Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce network overhead of _refresh_cache #111

Merged
merged 7 commits into from Jan 11, 2021
Merged

Conversation

pjbull
Copy link
Member

@pjbull pjbull commented Dec 22, 2020

We used to do a lot of network calls in _refresh_cache, which slowed down scripts that looked at lots of files. This change makes it so _refresh_cache only needs one network call.

  • Add a NoStat exception if we can't get stats for a particular path (two cases: could be directory or could not exist)
  • Get stats once at top of call

Downsides:

  • We used to be able to provide a specific error if somehow you tried to cache a directory instead of a file. I don't think this will happen, so it probably is not worth the additional network call.

On a slow connection, it seems ~4x faster after removing the additional calls. (First _refresh_cache is slow because it actually downloads the file on my slow connection)

BEFORE:
Screen Shot 2020-12-21 at 11 30 52 AM

AFTER:
Screen Shot 2020-12-21 at 11 25 52 AM

Closes #110

Make a single network call in _reresh_cache
@pjbull pjbull requested a review from jayqi December 22, 2020 01:22
@github-actions
Copy link
Contributor

github-actions bot commented Dec 22, 2020

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #111 (3a3d69b) into master (8b230c3) will decrease coverage by 0.1%.
The diff coverage is 91.4%.

@@           Coverage Diff            @@
##           master    #111     +/-   ##
========================================
- Coverage    91.4%   91.3%   -0.2%     
========================================
  Files           8       8             
  Lines         680     691     +11     
========================================
+ Hits          622     631      +9     
- Misses         58      60      +2     
Impacted Files Coverage Δ
cloudpathlib/azure/azblobpath.py 92.4% <77.7%> (-3.2%) ⬇️
cloudpathlib/s3/s3client.py 91.6% <85.7%> (-1.7%) ⬇️
cloudpathlib/cloudpath.py 90.1% <100.0%> (+0.4%) ⬆️
cloudpathlib/s3/s3path.py 97.8% <100.0%> (+0.1%) ⬆️

Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a reasonable solution to me.

Upload was also slow. Caching stats calls here as well.

Bonus: updated comment typo
Use objects filter instead of getting metadata to reduce cloud calls in file/dir check
Bonus: loop to list comprehension
@pjbull
Copy link
Member Author

pjbull commented Dec 28, 2020

Two updates, and I think this is ready to go if all the tests pass.

(1) In testing, found that _upload_local_to_cloud also had repeated stat calls, so refactored that.

(2) While looking for speed improvements, found that filtering the objects list made both our _exists and _is_file_or_dir calls on the S3 client faster. Here are some test runs:

image

@pjbull pjbull marked this pull request as ready for review December 28, 2020 00:22
@pjbull pjbull changed the title [WIP] Reduce network overhead of _refresh_cache Reduce network overhead of _refresh_cache Dec 28, 2020
cloudpathlib/s3/s3client.py Outdated Show resolved Hide resolved
cloudpathlib/s3/s3client.py Outdated Show resolved Hide resolved
@jayqi jayqi self-requested a review January 2, 2021 02:11
Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@pjbull pjbull merged commit 7ae1cd9 into master Jan 11, 2021
@pjbull pjbull deleted the 110-refresh-network branch January 11, 2021 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Slow performance for small files
2 participants