Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir #7907

Closed
wants to merge 6 commits into from

Conversation

bkietz
Copy link
Member

@bkietz bkietz commented Aug 5, 2020

I still apply ignore_prefixes to all segments of paths yielded by a selector which lie outside an explicit partition base directory.

@github-actions
Copy link

github-actions bot commented Aug 5, 2020

@jorisvandenbossche
Copy link
Member

And the "partition base directory" is automatically set if a user does something like ds.dataset("_shouldnt_be_ignored/dataset/") ?

@bkietz
Copy link
Member Author

bkietz commented Aug 6, 2020

And the "partition base directory" is automatically set if a user does something like ds.dataset("_shouldnt_be_ignored/dataset/") ?

Currently in python the partition_base_dir is always identical to the base directory of the recursive selector used, so yes.

In principle it would be possible to select /mnt/data/** and set partition_base_dir=/mnt/data/partitioned, in which case /mnt/data/other/** would be included in the dataset but would not have partition information attached. Alternatively, one could set partition_base_dir=/mnt and select /mnt/year=2020/**, in which case all fragments would include the "year"_ == 2020 partition expression. I question the utility of these cases, honestly.

@bkietz
Copy link
Member Author

bkietz commented Aug 6, 2020

(again, moot for python since the two directories are identical) Since the user's selection of paths is really based in selector.base_dir, I think it'd be more approprate to apply ignore_prefixes relative to that

@jorisvandenbossche
Copy link
Member

Added the specific case of a directory to the existing test about this

@bkietz bkietz force-pushed the 9644-ignore_prefixes-base_dir branch from 147115e to 6d4779d Compare August 7, 2020 20:56
@nealrichardson
Copy link
Member

I didn't understand what the Appveyor failure was so I re-triggered the build to see if it was a flake.

@nealrichardson
Copy link
Member

Segfault in test_dataset.py?

============================= test session starts =============================
platform win32 -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: C:\projects\arrow\python, configfile: setup.cfg
plugins: hypothesis-5.24.0, lazy-fixture-0.6.3
collected 3415 items / 3 skipped / 3412 selected
pyarrow\tests\test_adhoc_memory_leak.py s                                [  0%]
pyarrow\tests\test_array.py ......................s..................... [  1%]
.....................................................................sss [  3%]
sssssss...............................................................ss [  5%]
.......                                                                  [  5%]
pyarrow\tests\test_builder.py ....                                       [  5%]
pyarrow\tests\test_cffi.py ....                                          [  5%]
pyarrow\tests\test_compute.py .......................................... [  7%]
.................................................................        [  9%]
pyarrow\tests\test_convert_builtin.py .................................. [ 10%]
........................................................................ [ 12%]
........................................................................ [ 14%]
........................................................................ [ 16%]
............................x........................................... [ 18%]
.......x....ssss......................................................   [ 20%]
pyarrow\tests\test_csv.py .............................................. [ 21%]
..........................s.                                             [ 22%]
pyarrow\tests\test_cython.py .                                           [ 22%]
pyarrow\tests\test_dataset.py ....................
(arrow) C:\projects\arrow\python>set lastexitcode=3 
(arrow) C:\projects\arrow\python>set  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp728E.tmp 
(arrow) C:\projects\arrow\python>echo C:\projects\arrow\python  1>C:\Users\appveyor\AppData\Local\Temp\1\tmp728F.tmp 
(arrow) C:\projects\arrow\python>exit /b 3 
Command exited with code 3

@pitrou pitrou force-pushed the 9644-ignore_prefixes-base_dir branch from 6d4779d to b281399 Compare August 12, 2020 13:50
@pitrou
Copy link
Member

pitrou commented Aug 12, 2020

I'm taking a look at the segfault.

@@ -117,6 +116,9 @@ def get_type_name(self):
protocol = protocol[0]
return "fsspec+{0}".format(protocol)

def normalize_path(self, path):
return path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche This seems to work for now. I'm not sure if fsspec normalizes paths internally (especially on Windows).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick looks, it seems that their LocalFileSystem should handle incoming "windows" (non-normalized) paths (see eg https://github.com/intake/filesystem_spec/pull/65/files, where a bunch of path = make_path_posix(path) were added to the LocalFileSystem methods). So I think this should be fine.

Only, do we rely on this method on the C++ side to actually return a normalized path?
Eg in the C++ dataset discovery code, that might be relying on being able to split the normalized path on / ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling GetFileInfo(selector), we need the return values to be a lexical child of the selector base_dir.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if GetFileInfo normalizes, so should NormalizePath. Otherwise, not necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we can also special-case fsspec's LocalFileSystem to use our own LocalFileSystem instead... do we already do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed already do that:

if type(filesystem).__name__ == 'LocalFileSystem':
# In case its a simple LocalFileSystem, use native arrow one
return LocalFileSystem(use_mmap=use_mmap)

at least in cases where the _ensure_filesystem is used to process the user specified filesystem before passing it to the underlying functions (but which should in principle be done by all user facing functions).

So the actual fsspec's LocalFileSystem is only used for testing, so I think this is certainly fine then (since other filesystems should not have this problem of windows paths)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume it's fine, then.

@pitrou pitrou closed this in aaf467a Aug 13, 2020
kszucs pushed a commit to kszucs/arrow that referenced this pull request Aug 17, 2020
…ase_dir

I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory.

Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
kszucs pushed a commit to kszucs/arrow that referenced this pull request Aug 17, 2020
…ase_dir

I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory.

Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@bkietz bkietz deleted the 9644-ignore_prefixes-base_dir branch February 25, 2021 16:41
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…ase_dir

I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory.

Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir

Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants