Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up glob implementation #304

Merged
merged 6 commits into from Dec 30, 2022
Merged

Speed up glob implementation #304

merged 6 commits into from Dec 30, 2022

Conversation

pjbull
Copy link
Member

@pjbull pjbull commented Dec 20, 2022

I believe that globs were slow because the internals of how pathlib calls our functions made us loop over all of the files in the directory every time it was checking children, which happened for any nested file.

Instead of passing the whole file tree to every selectable instance and letting that object filter it down on demand, this implementation builds the entire file tree as nested dicts and then lets scandir walk those nested dicts. Each selectable now knows only about its direct parents/children.

Closes #274

Perf results

The list and glob scenarios listed below are executed against the same file trees, so this lets us more or less compare raw _list_dir calls to the performance of glob. (Note: N: items is different because the glob test filters out folders).

Before this change

Super slow........

                                  Performance suite results: (2022-12-20T01:29:29.267986)                                  
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │          2 │ 0:00:01.256078 │ ± 0:00:00.037748 │ 0:00:01.282770 │   5,500 │
│ List Folders   │ List shallow non-recursive │          2 │ 0:00:01.261018 │ ± 0:00:00.003631 │ 0:00:01.263586 │   5,500 │
│ List Folders   │ List normal recursive      │          2 │ 0:00:01.778471 │ ± 0:00:00.019042 │ 0:00:01.791936 │   7,877 │
│ List Folders   │ List normal non-recursive  │          2 │ 0:00:00.035925 │ ± 0:00:00.000327 │ 0:00:00.036156 │     113 │
│ List Folders   │ List deep recursive        │          2 │ 0:00:03.105114 │ ± 0:00:00.059515 │ 0:00:03.147197 │   7,955 │
│ List Folders   │ List deep non-recursive    │          2 │ 0:00:00.040619 │ ± 0:00:00.010220 │ 0:00:00.047846 │      31 │
│ Glob scenarios │ Glob shallow recursive     │          2 │ 0:06:58.346311 │ ± 0:00:00.956693 │ 0:06:59.022795 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │          2 │ 0:04:39.675058 │ ± 0:00:00.115285 │ 0:04:39.756577 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │          2 │ 0:00:54.873498 │ ± 0:00:00.014340 │ 0:00:54.883638 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │          2 │ 0:00:06.609309 │ ± 0:00:00.005283 │ 0:00:06.613045 │      12 │
│ Glob scenarios │ Glob deep recursive        │          2 │ 0:04:09.120405 │ ± 0:00:01.541716 │ 0:04:10.210563 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │          2 │ 0:00:05.651228 │ ± 0:00:00.232796 │ 0:00:05.815840 │      25 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

After this change

Very little overhead when doing actual comparisons and filtering (~10%) beyond the vanilla _list_dir calls!

                                  Performance suite results: (2022-12-20T09:26:30.055262)                                  
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Test Name      ┃ Config Name                ┃ Iterations ┃           Mean ┃              Std ┃            Max ┃ N Items ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ List Folders   │ List shallow recursive     │         10 │ 0:00:01.213074 │ ± 0:00:00.047582 │ 0:00:01.307935 │   5,500 │
│ List Folders   │ List shallow non-recursive │         10 │ 0:00:01.235682 │ ± 0:00:00.020873 │ 0:00:01.276783 │   5,500 │
│ List Folders   │ List normal recursive      │         10 │ 0:00:01.764192 │ ± 0:00:00.059765 │ 0:00:01.825311 │   7,877 │
│ List Folders   │ List normal non-recursive  │         10 │ 0:00:00.036764 │ ± 0:00:00.004089 │ 0:00:00.047984 │     113 │
│ List Folders   │ List deep recursive        │         10 │ 0:00:03.052238 │ ± 0:00:00.457676 │ 0:00:04.332024 │   7,955 │
│ List Folders   │ List deep non-recursive    │         10 │ 0:00:00.032839 │ ± 0:00:00.002672 │ 0:00:00.038686 │      31 │
│ Glob scenarios │ Glob shallow recursive     │         10 │ 0:00:01.484482 │ ± 0:00:00.065573 │ 0:00:01.561880 │   5,500 │
│ Glob scenarios │ Glob shallow non-recursive │         10 │ 0:00:01.481584 │ ± 0:00:00.029798 │ 0:00:01.523332 │   5,500 │
│ Glob scenarios │ Glob normal recursive      │         10 │ 0:00:02.017116 │ ± 0:00:00.055220 │ 0:00:02.115873 │   7,272 │
│ Glob scenarios │ Glob normal non-recursive  │         10 │ 0:00:00.037116 │ ± 0:00:00.003030 │ 0:00:00.043329 │      12 │
│ Glob scenarios │ Glob deep recursive        │         10 │ 0:00:03.404524 │ ± 0:00:00.244392 │ 0:00:04.028038 │   7,650 │
│ Glob scenarios │ Glob deep non-recursive    │         10 │ 0:00:00.033715 │ ± 0:00:00.002128 │ 0:00:00.036057 │      25 │
└────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘

Also, CI timings are fickle, but we may have knocked a few minutes off the test suite when compared to master:
image

@github-actions
Copy link
Contributor

github-actions bot commented Dec 20, 2022

@pjbull pjbull requested a review from jayqi December 20, 2022 17:49
@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2022

Codecov Report

Merging #304 (9d0a83d) into master (84f73c6) will decrease coverage by 0.0%.
The diff coverage is 96.1%.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #304     +/-   ##
========================================
- Coverage    94.5%   94.4%   -0.1%     
========================================
  Files          21      21             
  Lines        1384    1378      -6     
========================================
- Hits         1309    1302      -7     
- Misses         75      76      +1     
Impacted Files Coverage Δ
cloudpathlib/cloudpath.py 92.5% <96.1%> (-0.3%) ⬇️

Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! One partially-baked thought but this looks like a big improvement and tests passing seems convincing.

cloudpathlib/cloudpath.py Show resolved Hide resolved
@pjbull
Copy link
Member Author

pjbull commented Dec 30, 2022

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

glob and rglob can be slow for large directories or lots of files
3 participants