-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata glob bug due to missing regex special character escaping #390
Comments
This is an interesting and important topic, but I don't think there is a rigorous solution. Note that globbing is used everywhere for filepath matching, not regexs, and we have essentially copied the implementation from pythons glob.glob (or fnmatch). Yes, some regex-containing filenames fail, and this is true for pathlib too. Unless these do work with python's glob.glob, I am tempted to say that fsspec is doing the right thing here. An alternative would be to implement a regex=True variant for glob, but this is called from a number of places, so I'm thinking it would take too much passing of kwargs around to make it work everywhere.
I've never understood why this exists! It calls the same os functions underneath and has the same problems. In fsspec, we convert all (windows) paths to posix as soon as we can, rather than try to support both in all the operations. Actually, Cpython has "magic" to convert them back again for the appropriate file-system drivers, so this is all rather circular. |
See https://docs.python.org/3/library/glob.html#glob.escape which I think should be implemented and also first called on the path name passed to glob and before the trailing "*" is added. fsspec's glob seems to not be intended to be a generic glob method (or people would just use the native glob), and especially the way it gets used by dask/fastparquet, so this seems sensible. |
We do Not want to use the OS-specific handles in fsspec glob, because we are usually not dealing with a local file-system. You could say that fsspec is more generic. Perhaps LocalFileSystem could directly call glob (previous versions did), but we prefer to work in posix-like filepaths where possible. |
I agree about avoiding os.glob for the fsspec broader intentions. That said, I believe the issue here is that the fsspec glob function is implemented with regex, and if it is intended to be "shell like" then it must escape non-shell specifics such as "^" to avoid clashes with its chosen regex implementation. With regard to paths that are passed to glob that happen to contain shell-like wildcards such as "*", supporting an escape method seems to also make sense for those chars. In the case we found, using dask/fastparquet to serialize to a file with things like '^GSPC' (the Yahoo! Finance S&P 500 index), needs those escapes or it's a nightmare to normalize ^GSPC (and all its brethren) everywhere and denormalize to display for users so there's no ambiguity. "^GSPC" is a 100% valid file system name, it must be reminded. |
OK: so I suppose the best thing would be to add your extra replacements to the code as a PR, and add tests for more exotic filenames both to the local and one of the non-local (memory or ftp) filesystems. You are quite right, it's often best to speak of particular example cases. |
Hi, @martindurant , I'm working on this now... |
PR submitted. Thanks, again. |
Related to #94 which was incomplete.
Greetings and thank you for a fine library. Having trouble using this with fastparquet under dask due to file name issues that manifest when the path name contains special regex characters. This issue looks like it should also have included ^ in the list of magic characters. Per https://docs.python.org/3.8/library/re.html the list of single characters to watch out for is: .^$*+?{}[]|()\
It seems that if these are all properly escaped when present in a path name then degenerate cases such as these would be avoided as well: {2} would be {2}, (?family) would be (?family), \A would be \A, \d000 would be \d000, etc.
These are legitimate in Unix file systems, at least, even if people think that "shell" special characters are illlegal--they are not. Perhaps if special case lists of characters are needed, then a specialized local fsspec implementation can provide its own list vs. a universal list which isn't actually universal.
In spec.py, I'd change it as follows (I tested this for my cases and it works)
I did not have time to review the logic of handling the embedded * case as it looks like this may need some thought:
I can submit a PR if you agree with the above logic.
P.S. Looking at the implementation of the code, perhaps using https://docs.python.org/3/library/pathlib.html might avoid both the complexities of local file system differences among platforms already handled and avoid the bugs likely to have been encountered and fixed.
The text was updated successfully, but these errors were encountered: