-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-8290: [Python] Improve FileSystemDataset constructor #6913
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -276,55 +276,68 @@ cdef class FileSystemDataset(Dataset): | |
|
||
Parameters | ||
---------- | ||
paths_or_selector : Union[FileSelector, List[FileInfo]] | ||
List of files/directories to consume. | ||
schema : Schema | ||
The top-level schema of the DataDataset. | ||
root_partition : Expression | ||
The top-level partition of the DataDataset. | ||
file_format : FileFormat | ||
format : FileFormat | ||
File format to create fragments from, currently only | ||
ParquetFileFormat and IpcFileFormat are supported. | ||
filesystem : FileSystem | ||
The filesystem which files are from. | ||
paths_or_selector : Union[FileSelector, List[FileInfo]] | ||
List of files/directories to consume. | ||
partitions : List[Expression] | ||
partitions : List[Expression], optional | ||
Attach aditional partition information for the file paths. | ||
root_partition : Expression, optional | ||
The top-level partition of the DataDataset. | ||
""" | ||
|
||
cdef: | ||
CFileSystemDataset* filesystem_dataset | ||
|
||
def __init__(self, Schema schema not None, Expression root_partition, | ||
FileFormat file_format not None, | ||
FileSystem filesystem not None, | ||
paths_or_selector, partitions): | ||
def __init__(self, paths_or_selector, schema=None, format=None, | ||
filesystem=None, partitions=None, root_partition=None): | ||
cdef: | ||
FileInfo info | ||
Expression expr | ||
vector[CFileInfo] c_file_infos | ||
vector[shared_ptr[CExpression]] c_partitions | ||
CResult[shared_ptr[CDataset]] result | ||
|
||
# validate required arguments | ||
for arg, class_, name in [ | ||
(schema, Schema, 'schema'), | ||
(format, FileFormat, 'format'), | ||
(filesystem, FileSystem, 'filesystem') | ||
]: | ||
if not isinstance(arg, class_): | ||
raise TypeError( | ||
"Argument '{0}' has incorrect type (expected {1}, " | ||
"got {2})".format(name, class_.__name__, type(arg)) | ||
) | ||
|
||
for info in filesystem.get_file_info(paths_or_selector): | ||
c_file_infos.push_back(info.unwrap()) | ||
|
||
if partitions is None: | ||
partitions = [ | ||
ScalarExpression(True) for _ in range(c_file_infos.size())] | ||
for expr in partitions: | ||
c_partitions.push_back(expr.unwrap()) | ||
|
||
if c_file_infos.size() != c_partitions.size(): | ||
raise ValueError( | ||
'The number of files resulting from paths_or_selector must be ' | ||
'equal to the number of partitions.' | ||
'The number of files resulting from paths_or_selector ' | ||
'must be equal to the number of partitions.' | ||
) | ||
|
||
if root_partition is None: | ||
root_partition = ScalarExpression(True) | ||
|
||
result = CFileSystemDataset.Make( | ||
pyarrow_unwrap_schema(schema), | ||
root_partition.unwrap(), | ||
file_format.unwrap(), | ||
filesystem.unwrap(), | ||
(<Expression> root_partition).unwrap(), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I pass an object which is not castable to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You change the cast to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, all other keywords are validated to be of the correct class before casting, except this one, thanks for the catch! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, fixed this |
||
(<FileFormat> format).unwrap(), | ||
(<FileSystem> filesystem).unwrap(), | ||
c_file_infos, | ||
c_partitions | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche I think you can leave the type annotations here, just the
not None
part needs to be removed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is then that it are positional keywords to cython. And although you can call it as keyword arg, you get all kinds of uninformative error messages once you do something wrong.
(maybe there is a better way to do this with cython, but didn't directly figure out. There is also not much docs about this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like "TypeError: init() takes exactly 6 positional arguments (1 given)" when using all keyword arguments, but having a typo in one of them.
This was actually the reason that I first thought that you needed to pass everything as positional arguments, because I used
format=
instead offile_format=
, and then you get such error. And so I wanted to solve that in this PR to allow keyword arguments, to find out that the cause was actually a wrong name and that keyword arguments were already supported .. :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's annoying. Do you want to report this upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that. Need to make a small reproducer then, will put that on my to do list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche schema, format and filesystem arguments are mandatory, please remove the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it might be confusing from the signature, but I did this to ensure proper error messages (and the docstring indicates which are optional, and you will get a proper error message when not passing a value)