-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage Cleaner] Migrate to cached_path #378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache that cached_path
uses only ever grows. There is no built-in expiry. Also, the files we'll do this with are going to be very large. Stretching the largest disks we can get. Sticking them into ~/.local/cached_path
will work almost never.
scripts/storage_cleaner.py
Outdated
s3_resource = session.resource( | ||
"s3", | ||
endpoint_url=endpoint_url, | ||
config=botocore.client.Config(signature_version=botocore.UNSIGNED), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why unsigned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copy & paste, so I'll get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/storage_cleaner.py
Outdated
bucket_name = parsed_path.netloc | ||
r2_path = parsed_path.path.lstrip("/") | ||
|
||
session = boto3.session.Session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does authentication work for this? It reads environment variables? Then I assume there is no way to have both R2 and S3 work at the same time, because they read auth information from the same place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the boto docs, boto looks for credentials in the order below. I believe we typically leverage the 3rd option (env variables), though I personally use option 4 because I can easily switch between R2 and S3 by setting the env variable AWS_PROFILE
to R2
or S3
(those are just profile names I chose for myself).
- Passing credentials as parameters in the boto.client() method
- Passing credentials as parameters when creating a Session object
- Environment variables
- Shared credential file (~/.aws/credentials)
- AWS config file (~/.aws/config)
- Assume Role provider
- Boto2 config file (/etc/boto.cfg and ~/.boto)
- Instance metadata service on an Amazon EC2 instance that has an IAM role configured.
The current code doesn't support R2 and S3 at the same time, but I can make it work with option 4 or 5. Boto sessions can take a profile_name
, which is equivalent to the AWS_PROFILE
env variable. The main problem with this approach is that the script user needs to set up their credentials appropriately. @epwalsh thoughts? Would you be willing to use credentials from ~/.aws/*
instead of env variables for this script? If we ever wanted to add R2 to our training, this approach is the least messy way of adding R2 support to that (so it may be worth us getting used to it now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable but it would be nice if we could keep backwards compat for the env var only setup, at least for training jobs, not necessarily this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -304,8 +291,7 @@ def _list_entries( | |||
bucket_name, key = self._get_bucket_name_and_key(path) | |||
|
|||
if self.local_fs_adapter.has_supported_archive_extension(path): | |||
log.info("Downloading archive %s", path) | |||
file_path = self._download_file(bucket_name, key) | |||
file_path = str(cached_path(path, extract_archive=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If path
is already local, cached_path()
will return the same path. What happens when it has to extract in that case? Where does the extraction go?
When you think about this, please imagine that every archive involved will be 10TB in size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4f59fc5 I've added the option to set where cached_path
will store artifacts.
Downloaded artifacts are currently not getting deleted, but that was a problem before using cached_path too. I will address this in a separate PR.
He is on vacation and so cannot re-review
The cloud file download and unarchiving functionality of
scripts/storage_cleaner.py
can be implemented with thecached_path
module instead of manual implementations. I initially avoided this becausecached_path
does not natively support R2, but in the long run I have found that implementing R2 forcached_path
is the lesser of the 2 poisons.PR Train (in repair):