-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce cost for repeated accesses #255
Comments
Hey, thank you for the feedback! We've heard this ask from a few customers and we're looking into it, but we have nothing to share right now. |
Local file caching would make a huge performance gain in our workflows, where we access the same file dozens or hundreds of times in some cases. Currently I switch to s3fs when I need to run these workflows. |
Yes, I second this issue. |
Additionally, I wonder if it would be possible to provide a parameter to avoid re-listing the bucket once done once. This is a costly operation and can lead to slower training. |
Hey Thomas, would you be able to give a little more detail here on the access pattern? Maybe a short example code? I'm wondering if the cost you're referring to is caused by some explicit call like Either way, I see why listing the bucket/prefix up front and storing the results would help both of these cases. |
Hey @dannycjones, Yes, let's assume the dataset doesn't change through time and it is used in read-only mode. Right now, everytime I am listing ImageNet from python (~1.2M files), it takes 300 seconds. Here are 3 additional things I wish to have on top of the possibility to cache the files locally:
|
On-going work on this issue is being integrated behind the build-time feature flag mountpoint-s3/mountpoint-s3/Cargo.toml Lines 64 to 66 in fa0d516
|
@dannycjones What are the steps to compile it with the feature flag ? |
and I think step 4 should be changed from |
Exactly this. When running cargo commands (Rust's build system), you can add |
Cool @dannycjones. I will give it a try next week. |
Hey @dannycjones. Going to bump it today and run heavy stress test over it for the next week. I will ping you if we find anything. |
I'm planning to update the default metadata cache TTL to 1 second or a similar short value, so that it suits general purpose workloads. I'd recommend pinning the metadata cache TTL to a value that suits your workload (using |
For those following this thread and compiling from the main branch (like me), it looks like this parameter has changed to I'm excited to give this another try with a bucket that has 1000s of files in it. |
Cool, I will try that. |
You will also need to specify As noted above, both options are currently only available when building with the |
Support for caching is now available in Mountpoint 1.2.0 |
See updated docs for help configuring the cache. |
Hey @dannycjones @passaro. After a week of heavy testing, it appears mountpoints3 is more flaky than other open source solutions. We are seeing a ratio of 7/10 failures in our heavy benchmarks with transport errors. I will update you with more details when we validate this isn't coming from us. |
Tell us more about this new feature.
After reading a file (or a portion of it) using Mountpoint for Amazon S3, customers want to keep the data on their compute instance for a configurable amount of time. With this enhancement, Mountpoint will make fewer GET requests to Amazon S3 when customers repeatedly access the same file data.
The text was updated successfully, but these errors were encountered: