Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage Cleaner] Migrate to cached_path #378

Merged
merged 11 commits into from
Dec 7, 2023

Conversation

2015aroras
Copy link
Contributor

@2015aroras 2015aroras commented Nov 22, 2023

The cloud file download and unarchiving functionality of scripts/storage_cleaner.py can be implemented with the cached_path module instead of manual implementations. I initially avoided this because cached_path does not natively support R2, but in the long run I have found that implementing R2 for cached_path is the lesser of the 2 poisons.

PR Train (in repair):

  1. [Storage Cleaner] Add script for removing bad runs from local & cloud storage #364
  2. This PR [Storage Cleaner] Migrate to cached_path #378
  3. [Storage Cleaner] Add option to get full paths when listing entries #379
  4. [Storage Cleaner] Move unarchiving logic to cleaning jobs #390

@2015aroras 2015aroras changed the title Migrate storage cleaner to cached_path [Storage Cleaner] Migrate to cached_path Nov 22, 2023
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Base automatically changed from shanea/storage-cleaner to main November 29, 2023 22:22
dirkgr
dirkgr previously requested changes Nov 29, 2023
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache that cached_path uses only ever grows. There is no built-in expiry. Also, the files we'll do this with are going to be very large. Stretching the largest disks we can get. Sticking them into ~/.local/cached_path will work almost never.

s3_resource = session.resource(
"s3",
endpoint_url=endpoint_url,
config=botocore.client.Config(signature_version=botocore.UNSIGNED),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unsigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copy & paste, so I'll get rid of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bucket_name = parsed_path.netloc
r2_path = parsed_path.path.lstrip("/")

session = boto3.session.Session()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does authentication work for this? It reads environment variables? Then I assume there is no way to have both R2 and S3 work at the same time, because they read auth information from the same place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the boto docs, boto looks for credentials in the order below. I believe we typically leverage the 3rd option (env variables), though I personally use option 4 because I can easily switch between R2 and S3 by setting the env variable AWS_PROFILE to R2 or S3 (those are just profile names I chose for myself).

  1. Passing credentials as parameters in the boto.client() method
  2. Passing credentials as parameters when creating a Session object
  3. Environment variables
  4. Shared credential file (~/.aws/credentials)
  5. AWS config file (~/.aws/config)
  6. Assume Role provider
  7. Boto2 config file (/etc/boto.cfg and ~/.boto)
  8. Instance metadata service on an Amazon EC2 instance that has an IAM role configured.

The current code doesn't support R2 and S3 at the same time, but I can make it work with option 4 or 5. Boto sessions can take a profile_name, which is equivalent to the AWS_PROFILE env variable. The main problem with this approach is that the script user needs to set up their credentials appropriately. @epwalsh thoughts? Would you be willing to use credentials from ~/.aws/* instead of env variables for this script? If we ever wanted to add R2 to our training, this approach is the least messy way of adding R2 support to that (so it may be worth us getting used to it now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable but it would be nice if we could keep backwards compat for the env var only setup, at least for training jobs, not necessarily this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated main to use auth in a backwards-compatible way that can support both S3 and R2 simultaneously. I'm leveraging those changes here now.

36e485f
4b86ebb

@@ -304,8 +291,7 @@ def _list_entries(
bucket_name, key = self._get_bucket_name_and_key(path)

if self.local_fs_adapter.has_supported_archive_extension(path):
log.info("Downloading archive %s", path)
file_path = self._download_file(bucket_name, key)
file_path = str(cached_path(path, extract_archive=True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path is already local, cached_path() will return the same path. What happens when it has to extract in that case? Where does the extraction go?

When you think about this, please imagine that every archive involved will be 10TB in size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4f59fc5 I've added the option to set where cached_path will store artifacts.

Downloaded artifacts are currently not getting deleted, but that was a problem before using cached_path too. I will address this in a separate PR.

@2015aroras 2015aroras dismissed dirkgr’s stale review December 6, 2023 19:07

He is on vacation and so cannot re-review

@2015aroras 2015aroras merged commit 1dbc346 into main Dec 7, 2023
10 checks passed
@2015aroras 2015aroras deleted the shanea/storage-cleaner-cached-path branch December 7, 2023 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants