Skip to content

Reduce network overhead of _refresh_cache#111

Merged
pjbull merged 7 commits intomasterfrom
110-refresh-network
Jan 11, 2021
Merged

Reduce network overhead of _refresh_cache#111
pjbull merged 7 commits intomasterfrom
110-refresh-network

Conversation

@pjbull
Copy link
Member

@pjbull pjbull commented Dec 22, 2020

We used to do a lot of network calls in _refresh_cache, which slowed down scripts that looked at lots of files. This change makes it so _refresh_cache only needs one network call.

  • Add a NoStat exception if we can't get stats for a particular path (two cases: could be directory or could not exist)
  • Get stats once at top of call

Downsides:

  • We used to be able to provide a specific error if somehow you tried to cache a directory instead of a file. I don't think this will happen, so it probably is not worth the additional network call.

On a slow connection, it seems ~4x faster after removing the additional calls. (First _refresh_cache is slow because it actually downloads the file on my slow connection)

BEFORE:
Screen Shot 2020-12-21 at 11 30 52 AM

AFTER:
Screen Shot 2020-12-21 at 11 25 52 AM

Closes #110

Make a single network call in _reresh_cache
@pjbull pjbull requested a review from jayqi December 22, 2020 01:22
@github-actions
Copy link
Contributor

github-actions bot commented Dec 22, 2020

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #111 (3a3d69b) into master (8b230c3) will decrease coverage by 0.1%.
The diff coverage is 91.4%.

@@           Coverage Diff            @@
##           master    #111     +/-   ##
========================================
- Coverage    91.4%   91.3%   -0.2%     
========================================
  Files           8       8             
  Lines         680     691     +11     
========================================
+ Hits          622     631      +9     
- Misses         58      60      +2     
Impacted Files Coverage Δ
cloudpathlib/azure/azblobpath.py 92.4% <77.7%> (-3.2%) ⬇️
cloudpathlib/s3/s3client.py 91.6% <85.7%> (-1.7%) ⬇️
cloudpathlib/cloudpath.py 90.1% <100.0%> (+0.4%) ⬆️
cloudpathlib/s3/s3path.py 97.8% <100.0%> (+0.1%) ⬆️

Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a reasonable solution to me.

Upload was also slow. Caching stats calls here as well.

Bonus: updated comment typo
Use objects filter instead of getting metadata to reduce cloud calls in file/dir check
Bonus: loop to list comprehension
@pjbull
Copy link
Member Author

pjbull commented Dec 28, 2020

Two updates, and I think this is ready to go if all the tests pass.

(1) In testing, found that _upload_local_to_cloud also had repeated stat calls, so refactored that.

(2) While looking for speed improvements, found that filtering the objects list made both our _exists and _is_file_or_dir calls on the S3 client faster. Here are some test runs:

image

@pjbull pjbull marked this pull request as ready for review December 28, 2020 00:22
@pjbull pjbull changed the title [WIP] Reduce network overhead of _refresh_cache Reduce network overhead of _refresh_cache Dec 28, 2020
@jayqi jayqi self-requested a review January 2, 2021 02:11
Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎉

@pjbull pjbull merged commit 7ae1cd9 into master Jan 11, 2021
@pjbull pjbull deleted the 110-refresh-network branch January 11, 2021 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow performance for small files

2 participants