New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8290: [Python] Improve FileSystemDataset constructor #6913
ARROW-8290: [Python] Improve FileSystemDataset constructor #6913
Conversation
FileFormat file_format not None, | ||
FileSystem filesystem not None, | ||
paths_or_selector, partitions): | ||
def __init__(self, paths_or_selector, schema=None, format=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I think you can leave the type annotations here, just the not None
part needs to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is then that it are positional keywords to cython. And although you can call it as keyword arg, you get all kinds of uninformative error messages once you do something wrong.
(maybe there is a better way to do this with cython, but didn't directly figure out. There is also not much docs about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like "TypeError: init() takes exactly 6 positional arguments (1 given)" when using all keyword arguments, but having a typo in one of them.
This was actually the reason that I first thought that you needed to pass everything as positional arguments, because I used format=
instead of file_format=
, and then you get such error. And so I wanted to solve that in this PR to allow keyword arguments, to find out that the cause was actually a wrong name and that keyword arguments were already supported .. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's annoying. Do you want to report this upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that. Need to make a small reproducer then, will put that on my to do list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche schema, format and filesystem arguments are mandatory, please remove the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it might be confusing from the signature, but I did this to ensure proper error messages (and the docstring indicates which are optional, and you will get a proper error message when not passing a value)
root_partition.unwrap(), | ||
file_format.unwrap(), | ||
filesystem.unwrap(), | ||
(<Expression> root_partition).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I pass an object which is not castable to <Expression>
then I get a segfault, e.g. root_partition=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You change the cast to <Expression?>
then it is raising a cannot convert type error - although it might be better to set default values before the casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, all other keywords are validated to be of the correct class before casting, except this one, thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Joris!
schema
,format
andfilesystem
keyword validation manually, instead of letting cython handle it (which gives very cryptic error messages, like complaining about positional arguments if you made a typo in one of the keyword names)root_expression
to the endpartitions
androot_expression
optionalfile_format
toformat
, since this name is what we are using elsewhere (eg inds.dataset(..)