Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8290: [Python] Improve FileSystemDataset constructor #6913

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
43 changes: 28 additions & 15 deletions python/pyarrow/_dataset.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -276,55 +276,68 @@ cdef class FileSystemDataset(Dataset):

Parameters
----------
paths_or_selector : Union[FileSelector, List[FileInfo]]
List of files/directories to consume.
schema : Schema
The top-level schema of the DataDataset.
root_partition : Expression
The top-level partition of the DataDataset.
file_format : FileFormat
format : FileFormat
File format to create fragments from, currently only
ParquetFileFormat and IpcFileFormat are supported.
filesystem : FileSystem
The filesystem which files are from.
paths_or_selector : Union[FileSelector, List[FileInfo]]
List of files/directories to consume.
partitions : List[Expression]
partitions : List[Expression], optional
Attach aditional partition information for the file paths.
root_partition : Expression, optional
The top-level partition of the DataDataset.
"""

cdef:
CFileSystemDataset* filesystem_dataset

def __init__(self, Schema schema not None, Expression root_partition,
FileFormat file_format not None,
FileSystem filesystem not None,
paths_or_selector, partitions):
def __init__(self, paths_or_selector, schema=None, format=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I think you can leave the type annotations here, just the not None part needs to be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is then that it are positional keywords to cython. And although you can call it as keyword arg, you get all kinds of uninformative error messages once you do something wrong.

(maybe there is a better way to do this with cython, but didn't directly figure out. There is also not much docs about this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like "TypeError: init() takes exactly 6 positional arguments (1 given)" when using all keyword arguments, but having a typo in one of them.

This was actually the reason that I first thought that you needed to pass everything as positional arguments, because I used format= instead of file_format=, and then you get such error. And so I wanted to solve that in this PR to allow keyword arguments, to find out that the cause was actually a wrong name and that keyword arguments were already supported .. :)

Copy link
Member

@kszucs kszucs Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that's annoying. Do you want to report this upstream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that. Need to make a small reproducer then, will put that on my to do list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche schema, format and filesystem arguments are mandatory, please remove the default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it might be confusing from the signature, but I did this to ensure proper error messages (and the docstring indicates which are optional, and you will get a proper error message when not passing a value)

filesystem=None, partitions=None, root_partition=None):
cdef:
FileInfo info
Expression expr
vector[CFileInfo] c_file_infos
vector[shared_ptr[CExpression]] c_partitions
CResult[shared_ptr[CDataset]] result

# validate required arguments
for arg, class_, name in [
(schema, Schema, 'schema'),
(format, FileFormat, 'format'),
(filesystem, FileSystem, 'filesystem')
]:
if not isinstance(arg, class_):
raise TypeError(
"Argument '{0}' has incorrect type (expected {1}, "
"got {2})".format(name, class_.__name__, type(arg))
)

for info in filesystem.get_file_info(paths_or_selector):
c_file_infos.push_back(info.unwrap())

if partitions is None:
partitions = [
ScalarExpression(True) for _ in range(c_file_infos.size())]
for expr in partitions:
c_partitions.push_back(expr.unwrap())

if c_file_infos.size() != c_partitions.size():
raise ValueError(
'The number of files resulting from paths_or_selector must be '
'equal to the number of partitions.'
'The number of files resulting from paths_or_selector '
'must be equal to the number of partitions.'
)

if root_partition is None:
root_partition = ScalarExpression(True)

result = CFileSystemDataset.Make(
pyarrow_unwrap_schema(schema),
root_partition.unwrap(),
file_format.unwrap(),
filesystem.unwrap(),
(<Expression> root_partition).unwrap(),
Copy link
Member

@kszucs kszucs Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I pass an object which is not castable to <Expression> then I get a segfault, e.g. root_partition=1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You change the cast to <Expression?> then it is raising a cannot convert type error - although it might be better to set default values before the casts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, all other keywords are validated to be of the correct class before casting, except this one, thanks for the catch!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed this

(<FileFormat> format).unwrap(),
(<FileSystem> filesystem).unwrap(),
c_file_infos,
c_partitions
)
Expand Down
20 changes: 17 additions & 3 deletions python/pyarrow/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,15 +191,29 @@ def test_filesystem_dataset(mockfs):
partitions = [ds.ScalarExpression(True), ds.ScalarExpression(True)]

dataset = ds.FileSystemDataset(
schema,
schema=schema,
root_partition=None,
file_format=file_format,
format=file_format,
filesystem=mockfs,
paths_or_selector=paths,
partitions=partitions
)
assert isinstance(dataset.format, ds.ParquetFileFormat)

# the root_partition and partitions keywords have defaults
dataset = ds.FileSystemDataset(
paths, schema, format=file_format, filesystem=mockfs,
)
assert isinstance(dataset.format, ds.ParquetFileFormat)

# validation of required arguments
with pytest.raises(TypeError, match="incorrect type"):
ds.FileSystemDataset(paths, format=file_format, filesystem=mockfs)
with pytest.raises(TypeError, match="incorrect type"):
ds.FileSystemDataset(paths, schema=schema, filesystem=mockfs)
with pytest.raises(TypeError, match="incorrect type"):
ds.FileSystemDataset(paths, schema=schema, format=file_format)

root_partition = ds.ComparisonExpression(
ds.CompareOperator.Equal,
ds.FieldExpression('level'),
Expand All @@ -223,7 +237,7 @@ def test_filesystem_dataset(mockfs):
root_partition=root_partition,
filesystem=mockfs,
partitions=partitions,
file_format=file_format
format=file_format
)
assert dataset.partition_expression.equals(root_partition)
assert set(dataset.files) == set(paths)
Expand Down