Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cost for repeated accesses #255

Closed
plgounod opened this issue May 23, 2023 · 19 comments
Closed

Reduce cost for repeated accesses #255

plgounod opened this issue May 23, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@plgounod
Copy link

Tell us more about this new feature.

After reading a file (or a portion of it) using Mountpoint for Amazon S3, customers want to keep the data on their compute instance for a configurable amount of time. With this enhancement, Mountpoint will make fewer GET requests to Amazon S3 when customers repeatedly access the same file data.

@plgounod plgounod added the enhancement New feature or request label May 23, 2023
@dannycjones
Copy link
Contributor

Hey, thank you for the feedback!

We've heard this ask from a few customers and we're looking into it, but we have nothing to share right now.

@stevew3344
Copy link

Local file caching would make a huge performance gain in our workflows, where we access the same file dozens or hundreds of times in some cases. Currently I switch to s3fs when I need to run these workflows.

@tchaton
Copy link

tchaton commented Sep 10, 2023

Yes, I second this issue.

@tchaton
Copy link

tchaton commented Oct 6, 2023

Additionally, I wonder if it would be possible to provide a parameter to avoid re-listing the bucket once done once. This is a costly operation and can lead to slower training.

@dannycjones
Copy link
Contributor

dannycjones commented Oct 6, 2023

Additionally, I wonder if it would be possible to provide a parameter to avoid re-listing the bucket once done once. This is a costly operation and can lead to slower training.

Hey Thomas, would you be able to give a little more detail here on the access pattern? Maybe a short example code?

I'm wondering if the cost you're referring to is caused by some explicit call like os.listdir(path) or to the ListObjectsV2 and HeadObject performed when opening individual files, i.e. during f = open(file_path).

Either way, I see why listing the bucket/prefix up front and storing the results would help both of these cases.

@tchaton
Copy link

tchaton commented Oct 9, 2023

Hey @dannycjones,

Yes, let's assume the dataset doesn't change through time and it is used in read-only mode. Right now, everytime I am listing ImageNet from python (~1.2M files), it takes 300 seconds.

Here are 3 additional things I wish to have on top of the possibility to cache the files locally:

    1. Keep the list of files in RAM or within a file when listed once e.g the second time I list from python, it is as fast as listing locally
    1. Some python utilities to fast list a bucket folder by recursively listing sub-folders in parallel, pre-loading some files ahead of time with fine control from python.
    1. Add support for dumping / restoring a bucket index to avoid listing the bucket over and over.

@dannycjones
Copy link
Contributor

On-going work on this issue is being integrated behind the build-time feature flag caching.

[features]
# Experimental features
caching = []

@tchaton
Copy link

tchaton commented Oct 29, 2023

@dannycjones What are the steps to compile it with the feature flag ?

@goldstar611
Copy link

@dannycjones What are the steps to compile it with the feature flag ?

https://github.com/awslabs/mountpoint-s3/blob/main/doc/INSTALL.md#building-mountpoint-for-amazon-s3-from-source

and I think step 4 should be changed from
cargo build --release
to
cargo build --release --features "caching"
if I'm reading the cargo features page correctly.

@dannycjones
Copy link
Contributor

@dannycjones What are the steps to compile it with the feature flag ?

https://github.com/awslabs/mountpoint-s3/blob/main/doc/INSTALL.md#building-mountpoint-for-amazon-s3-from-source

and I think step 4 should be changed from cargo build --release to cargo build --release --features "caching" if I'm reading the cargo features page correctly.

Exactly this. When running cargo commands (Rust's build system), you can add --features "caching" such as during builds above. We're explicitly trying to limit the differences when building with this flag at the moment, so the main difference is that the CLI arguments for enabling the cache are hidden when the caching flag is not provided.

@tchaton
Copy link

tchaton commented Nov 4, 2023

Cool @dannycjones. I will give it a try next week.

@tchaton
Copy link

tchaton commented Nov 15, 2023

Hey @dannycjones. Going to bump it today and run heavy stress test over it for the next week. I will ping you if we find anything.

@dannycjones
Copy link
Contributor

Hey @dannycjones. Going to bump it today and run heavy stress test over it for the next week. I will ping you if we find anything.

I'm planning to update the default metadata cache TTL to 1 second or a similar short value, so that it suits general purpose workloads.

I'd recommend pinning the metadata cache TTL to a value that suits your workload (using --metadata-cache-ttl <SECONDS>). Since its ML training and we don't expect objects in the prefix to change during training, a value that exceeds the expected duration of training sounds about right.

@goldstar611
Copy link

I'd recommend pinning the metadata cache TTL to a value that suits your workload (using --metadata-cache-ttl <SECONDS>).

For those following this thread and compiling from the main branch (like me), it looks like this parameter has changed to --metadata-ttl in 7d38be7#diff-dc57c703340f88ddb4eab99dd8d870135117972f0f323c0a57b22a3f803b99ffL241

I'm excited to give this another try with a bucket that has 1000s of files in it.

@tchaton
Copy link

tchaton commented Nov 18, 2023

Cool, I will try that.

@passaro
Copy link
Contributor

passaro commented Nov 18, 2023

You will also need to specify --cache <DIR> to enable caching (both object metadata and content), in addition to --metadata-ttl <SECONDS>.

As noted above, both options are currently only available when building with the --features "caching" flag. You can find an early draft of the documentation in #587 .

@passaro
Copy link
Contributor

passaro commented Nov 22, 2023

Support for caching is now available in Mountpoint 1.2.0

@passaro passaro closed this as completed Nov 22, 2023
@passaro
Copy link
Contributor

passaro commented Nov 22, 2023

See updated docs for help configuring the cache.

@tchaton
Copy link

tchaton commented Nov 24, 2023

Hey @dannycjones @passaro. After a week of heavy testing, it appears mountpoints3 is more flaky than other open source solutions. We are seeing a ratio of 7/10 failures in our heavy benchmarks with transport errors. I will update you with more details when we validate this isn't coming from us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

6 participants