Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata glob bug due to missing regex special character escaping #390

Closed
shipmints opened this issue Aug 29, 2020 · 7 comments · Fixed by #408
Closed

Metadata glob bug due to missing regex special character escaping #390

shipmints opened this issue Aug 29, 2020 · 7 comments · Fixed by #408

Comments

@shipmints
Copy link
Contributor

Related to #94 which was incomplete.

Greetings and thank you for a fine library. Having trouble using this with fastparquet under dask due to file name issues that manifest when the path name contains special regex characters. This issue looks like it should also have included ^ in the list of magic characters. Per https://docs.python.org/3.8/library/re.html the list of single characters to watch out for is: .^$*+?{}[]|()\

It seems that if these are all properly escaped when present in a path name then degenerate cases such as these would be avoided as well: {2} would be {2}, (?family) would be (?family), \A would be \A, \d000 would be \d000, etc.

These are legitimate in Unix file systems, at least, even if people think that "shell" special characters are illlegal--they are not. Perhaps if special case lists of characters are needed, then a specialized local fsspec implementation can provide its own list vs. a universal list which isn't actually universal.

In spec.py, I'd change it as follows (I tested this for my cases and it works)


    def glob(self, path, **kwargs):
    <snip>
        pattern = (
            "^"
            + (
                # List of special characters adapted from Python's re documentation
                # .^$*+?{}[]|()\
                path.replace("\\", r"\\")
                .replace(".", r"\.")
                .replace("+", r"\+")
                .replace("//", "/")
                .replace("(", r"\(")
                .replace(")", r"\)")
                .replace("|", r"\|")
                .replace("^", r"\^")
                .replace("$", r"\$")
                # .replace("*", r"\*") # this should be dealt with as well
                .replace("{", r"\{")
                .replace("}", r"\}")
                .replace("[", r"\[")
                .replace("]", r"\]")
                .rstrip("/")
                .replace("?", ".")
            )
            + "$"
        )
    <snip>

I did not have time to review the logic of handling the embedded * case as it looks like this may need some thought:

        paths = fs.glob(paths[0] + fs.sep + "*") # how will this work with embedded *s?

I can submit a PR if you agree with the above logic.

P.S. Looking at the implementation of the code, perhaps using https://docs.python.org/3/library/pathlib.html might avoid both the complexities of local file system differences among platforms already handled and avoid the bugs likely to have been encountered and fixed.

@martindurant
Copy link
Member

This is an interesting and important topic, but I don't think there is a rigorous solution. Note that globbing is used everywhere for filepath matching, not regexs, and we have essentially copied the implementation from pythons glob.glob (or fnmatch). Yes, some regex-containing filenames fail, and this is true for pathlib too. Unless these do work with python's glob.glob, I am tempted to say that fsspec is doing the right thing here.

An alternative would be to implement a regex=True variant for glob, but this is called from a number of places, so I'm thinking it would take too much passing of kwargs around to make it work everywhere.

pathlib might avoid both the complexities

I've never understood why this exists! It calls the same os functions underneath and has the same problems. In fsspec, we convert all (windows) paths to posix as soon as we can, rather than try to support both in all the operations. Actually, Cpython has "magic" to convert them back again for the appropriate file-system drivers, so this is all rather circular.

@shipmints
Copy link
Contributor Author

See https://docs.python.org/3/library/glob.html#glob.escape which I think should be implemented and also first called on the path name passed to glob and before the trailing "*" is added. fsspec's glob seems to not be intended to be a generic glob method (or people would just use the native glob), and especially the way it gets used by dask/fastparquet, so this seems sensible.

@martindurant
Copy link
Member

re.compile('([*?[])').sub(r'[\1]') seems like a reasonable thing to do, that is what escape calls.

fsspec's glob seems to not be intended to be a generic glob

We do Not want to use the OS-specific handles in fsspec glob, because we are usually not dealing with a local file-system. You could say that fsspec is more generic. Perhaps LocalFileSystem could directly call glob (previous versions did), but we prefer to work in posix-like filepaths where possible.

@shipmints
Copy link
Contributor Author

I agree about avoiding os.glob for the fsspec broader intentions.

That said, I believe the issue here is that the fsspec glob function is implemented with regex, and if it is intended to be "shell like" then it must escape non-shell specifics such as "^" to avoid clashes with its chosen regex implementation. With regard to paths that are passed to glob that happen to contain shell-like wildcards such as "*", supporting an escape method seems to also make sense for those chars. In the case we found, using dask/fastparquet to serialize to a file with things like '^GSPC' (the Yahoo! Finance S&P 500 index), needs those escapes or it's a nightmare to normalize ^GSPC (and all its brethren) everywhere and denormalize to display for users so there's no ambiguity. "^GSPC" is a 100% valid file system name, it must be reminded.

@martindurant
Copy link
Member

OK: so I suppose the best thing would be to add your extra replacements to the code as a PR, and add tests for more exotic filenames both to the local and one of the non-local (memory or ftp) filesystems. You are quite right, it's often best to speak of particular example cases.

@shipmints
Copy link
Contributor Author

Hi, @martindurant , I'm working on this now...

@shipmints
Copy link
Contributor Author

PR submitted. Thanks, again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants