Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8290: [Python] Improve FileSystemDataset constructor #6913

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Apr 13, 2020

  • Handle the required schema, format and filesystem keyword validation manually, instead of letting cython handle it (which gives very cryptic error messages, like complaining about positional arguments if you made a typo in one of the keyword names)
  • Move the paths to the first argument, move root_expression to the end
  • Make partitions and root_expression optional
  • Rename file_format to format, since this name is what we are using elsewhere (eg in ds.dataset(..)

@github-actions
Copy link

FileFormat file_format not None,
FileSystem filesystem not None,
paths_or_selector, partitions):
def __init__(self, paths_or_selector, schema=None, format=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I think you can leave the type annotations here, just the not None part needs to be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is then that it are positional keywords to cython. And although you can call it as keyword arg, you get all kinds of uninformative error messages once you do something wrong.

(maybe there is a better way to do this with cython, but didn't directly figure out. There is also not much docs about this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like "TypeError: init() takes exactly 6 positional arguments (1 given)" when using all keyword arguments, but having a typo in one of them.

This was actually the reason that I first thought that you needed to pass everything as positional arguments, because I used format= instead of file_format=, and then you get such error. And so I wanted to solve that in this PR to allow keyword arguments, to find out that the cause was actually a wrong name and that keyword arguments were already supported .. :)

Copy link
Member

@kszucs kszucs Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's annoying. Do you want to report this upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that. Need to make a small reproducer then, will put that on my to do list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche schema, format and filesystem arguments are mandatory, please remove the default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it might be confusing from the signature, but I did this to ensure proper error messages (and the docstring indicates which are optional, and you will get a proper error message when not passing a value)

root_partition.unwrap(),
file_format.unwrap(),
filesystem.unwrap(),
(<Expression> root_partition).unwrap(),
Copy link
Member

@kszucs kszucs Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pass an object which is not castable to <Expression> then I get a segfault, e.g. root_partition=1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You change the cast to <Expression?> then it is raising a cannot convert type error - although it might be better to set default values before the casts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, all other keywords are validated to be of the correct class before casting, except this one, thanks for the catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed this

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Joris!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants