-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9644: [C++][Dataset] Don't apply ignore_prefixes to partition base_dir #7907
Conversation
And the "partition base directory" is automatically set if a user does something like |
Currently in python the partition_base_dir is always identical to the base directory of the recursive selector used, so yes. In principle it would be possible to select |
(again, moot for python since the two directories are identical) Since the user's selection of paths is really based in |
Added the specific case of a directory to the existing test about this |
147115e
to
6d4779d
Compare
I didn't understand what the Appveyor failure was so I re-triggered the build to see if it was a flake. |
Segfault in test_dataset.py?
|
6d4779d
to
b281399
Compare
I'm taking a look at the segfault. |
@@ -117,6 +116,9 @@ def get_type_name(self): | |||
protocol = protocol[0] | |||
return "fsspec+{0}".format(protocol) | |||
|
|||
def normalize_path(self, path): | |||
return path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche This seems to work for now. I'm not sure if fsspec
normalizes paths internally (especially on Windows).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick looks, it seems that their LocalFileSystem should handle incoming "windows" (non-normalized) paths (see eg https://github.com/intake/filesystem_spec/pull/65/files, where a bunch of path = make_path_posix(path)
were added to the LocalFileSystem methods). So I think this should be fine.
Only, do we rely on this method on the C++ side to actually return a normalized path?
Eg in the C++ dataset discovery code, that might be relying on being able to split the normalized path on /
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When calling GetFileInfo(selector)
, we need the return values to be a lexical child of the selector base_dir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if GetFileInfo
normalizes, so should NormalizePath
. Otherwise, not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, we can also special-case fsspec
's LocalFileSystem to use our own LocalFileSystem instead... do we already do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed already do that:
Lines 77 to 79 in 90d1ab7
if type(filesystem).__name__ == 'LocalFileSystem': | |
# In case its a simple LocalFileSystem, use native arrow one | |
return LocalFileSystem(use_mmap=use_mmap) |
at least in cases where the _ensure_filesystem
is used to process the user specified filesystem before passing it to the underlying functions (but which should in principle be done by all user facing functions).
So the actual fsspec's LocalFileSystem is only used for testing, so I think this is certainly fine then (since other filesystems should not have this problem of windows paths)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assume it's fine, then.
…ase_dir I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory. Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ase_dir I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory. Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ase_dir I still apply ignore_prefixes to all segments of paths yielded by a selector which lie *outside* an explicit partition base directory. Closes apache#7907 from bkietz/9644-ignore_prefixes-base_dir Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
I still apply ignore_prefixes to all segments of paths yielded by a selector which lie outside an explicit partition base directory.