Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glob and rglob can be slow for large directories or lots of files #274

Closed
Gilthans opened this issue Oct 7, 2022 · 7 comments · Fixed by #276 or #304
Closed

glob and rglob can be slow for large directories or lots of files #274

Gilthans opened this issue Oct 7, 2022 · 7 comments · Fixed by #276 or #304

Comments

@Gilthans
Copy link
Contributor

Gilthans commented Oct 7, 2022

Hello!
While gleefully using cloudpathlib, I needed a recursive iteration of files in a directory. This directory is large (6238 files), so my first approach - a recursive iterdir() + is_file() - took waaaay too long (likely due to #176).
I remembered that glob made better use of the cloud list calls, so I tried list(p.rglob("*")). In my directory, that took 17m:34s.
I then tried to 'cheat' and call [f for f, is_dir in p.client._list_dir(p, recursive=True) if not is_dir]. It took 1.435s.

I looked at the glob logic, but I still can't understand why the discrepency (Using Google Cloud). This may warrant another issue, but

However, I wonder if it might not be a good idea to make _list_dir a public function in the meantime as a workaround.
Another option is to add recursive and/or files_only keywords to iterdir. This deviates from pathlib API, but since these are added keywords, it might be OK?

I'm suggesting these options even though solving #176 would probably solve most issues, but these solutions are much simpler.
I'd of course be happy to send a PR.
WDYT?

@pjbull
Copy link
Member

pjbull commented Oct 10, 2022

Yes, I agree with the sentiment that we want a much faster simple recursive directory listing function than what we currently expose.

As you point out, we work to make CloudPaths and Paths as interchangeable as possible, which means not relying on an expanded API for core functionality.

With that in mind, there are two APIs for recursively iterating over the contents of a directory in pathlib. First, in Python 3.12+ we will have Path.walk. We'll put that aside for now. Second, I think we want the current most common ways of recursively listing directories glob('**/*') and rglob('*') to be sufficiently fast, which brings us to your next point:

I looked at the glob logic, but I still can't understand why the discrepency (Using Google Cloud).

The glob DSL is annoyingly complex, and to make maintainability somewhat easier, we copy the CPython implementation of glob pretty directly. Unfortunately, this ends up being quite slow on large directory trees (perhaps because of network overhead or the complexity of the implementation?).

In my mind, shortcutting these common "list everything" recursive calls to glob and rglob to a much faster version with _list_dir may be the best path forward. No new API, direct speedups, relatively simple implementation. The downside is the runtime different between a glob with more complex logic and a simple recursive one may be substantial (which could be confusing). However, I think if we do this looking at perf improvements to glob and rglob could be tracked separately.

Would that work for your case?

@Gilthans
Copy link
Contributor Author

Yes, that would work well.
I could try and draft a PR later in the week if that's alright

@pjbull
Copy link
Member

pjbull commented Oct 10, 2022

That would be great, thanks!

@Gilthans Gilthans changed the title Make _list_dir public Short-circuit rglob/glob 'easy' patterns to simply use _list_dir, speeding up listing directories Oct 11, 2022
@Gilthans
Copy link
Contributor Author

After some consideration, I realized that without #176 it wouldn't work, since I would still need to filter out files.
However, this is still a useful thing to have.

@pjbull
Copy link
Member

pjbull commented Nov 2, 2022

Per discussion in #276, this needs some profiling to really get to the bottom of why glob is particularly slow. Shortcuts for common patterns were added as a workaround.

@pjbull pjbull changed the title Short-circuit rglob/glob 'easy' patterns to simply use _list_dir, speeding up listing directories glob and rglob can be slow for large directories or lots of files Nov 2, 2022
@RyanMarten
Copy link

RyanMarten commented Dec 20, 2022

Similar problem arose for me
path = [x for x in folder.iterdir() if ".parquet" in x.key]
is much much much faster than
paths = list(folder.glob('*.parquet'))

@pjbull
Copy link
Member

pjbull commented Dec 30, 2022

@Gilthans @RyanMarten We released a pretty big improvement here in #304.

You can test it in version 0.12.0, which is on PyPI now. Could you give it a spin and let us know if this improves performance in your scenarios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants