[Storage Cleaner] Migrate to cached_path #378

2015aroras · 2023-11-22T16:36:01Z

The cloud file download and unarchiving functionality of scripts/storage_cleaner.py can be implemented with the cached_path module instead of manual implementations. I initially avoided this because cached_path does not natively support R2, but in the long run I have found that implementing R2 for cached_path is the lesser of the 2 poisons.

PR Train (in repair):

epwalsh

Nice!

dirkgr

The cache that cached_path uses only ever grows. There is no built-in expiry. Also, the files we'll do this with are going to be very large. Stretching the largest disks we can get. Sticking them into ~/.local/cached_path will work almost never.

dirkgr · 2023-11-29T23:07:38Z

scripts/storage_cleaner.py

+                s3_resource = session.resource(
+                    "s3",
+                    endpoint_url=endpoint_url,
+                    config=botocore.client.Config(signature_version=botocore.UNSIGNED),


Why unsigned?

This was copy & paste, so I'll get rid of it.

dirkgr · 2023-11-29T23:09:38Z

scripts/storage_cleaner.py

+            bucket_name = parsed_path.netloc
+            r2_path = parsed_path.path.lstrip("/")
+
+            session = boto3.session.Session()


How does authentication work for this? It reads environment variables? Then I assume there is no way to have both R2 and S3 work at the same time, because they read auth information from the same place?

According to the boto docs, boto looks for credentials in the order below. I believe we typically leverage the 3rd option (env variables), though I personally use option 4 because I can easily switch between R2 and S3 by setting the env variable AWS_PROFILE to R2 or S3 (those are just profile names I chose for myself).

Passing credentials as parameters in the boto.client() method

Passing credentials as parameters when creating a Session object

Environment variables

Shared credential file (~/.aws/credentials)

AWS config file (~/.aws/config)

Assume Role provider

Boto2 config file (/etc/boto.cfg and ~/.boto)

Instance metadata service on an Amazon EC2 instance that has an IAM role configured.

The current code doesn't support R2 and S3 at the same time, but I can make it work with option 4 or 5. Boto sessions can take a profile_name, which is equivalent to the AWS_PROFILE env variable. The main problem with this approach is that the script user needs to set up their credentials appropriately. @epwalsh thoughts? Would you be willing to use credentials from ~/.aws/* instead of env variables for this script? If we ever wanted to add R2 to our training, this approach is the least messy way of adding R2 support to that (so it may be worth us getting used to it now).

Seems reasonable but it would be nice if we could keep backwards compat for the env var only setup, at least for training jobs, not necessarily this script.

I've updated main to use auth in a backwards-compatible way that can support both S3 and R2 simultaneously. I'm leveraging those changes here now.

36e485f
4b86ebb

dirkgr · 2023-11-29T23:13:50Z

scripts/storage_cleaner.py

@@ -304,8 +291,7 @@ def _list_entries(
        bucket_name, key = self._get_bucket_name_and_key(path)

        if self.local_fs_adapter.has_supported_archive_extension(path):
-            log.info("Downloading archive %s", path)
-            file_path = self._download_file(bucket_name, key)
+            file_path = str(cached_path(path, extract_archive=True))


If path is already local, cached_path() will return the same path. What happens when it has to extract in that case? Where does the extraction go?

When you think about this, please imagine that every archive involved will be 10TB in size.

4f59fc5 I've added the option to set where cached_path will store artifacts.

Downloaded artifacts are currently not getting deleted, but that was a problem before using cached_path too. I will address this in a separate PR.

He is on vacation and so cannot re-review

2015aroras added 2 commits November 22, 2023 11:24

Add R2 scheme client for cached_path

2792a48

Migrate cloud file download and unarchiving to cached_path

97a6df7

2015aroras changed the title ~~Migrate storage cleaner to cached_path~~ [Storage Cleaner] Migrate to cached_path Nov 22, 2023

Run ruff

dd2ddc7

2015aroras requested review from dirkgr and epwalsh November 22, 2023 17:14

2015aroras marked this pull request as ready for review November 22, 2023 17:15

epwalsh approved these changes Nov 22, 2023

View reviewed changes

2015aroras mentioned this pull request Nov 22, 2023

[Storage Cleaner] Handle some legacy checkpoints in unsharding #382

Merged

Base automatically changed from shanea/storage-cleaner to main November 29, 2023 22:22

2015aroras added 2 commits November 29, 2023 14:23

Merge branch 'main' into shanea/storage-cleaner-cached-path

4e0d628

Fix type check errors in _list_entries

368a6ed

dirkgr previously requested changes Nov 29, 2023

View reviewed changes

2015aroras added 6 commits December 6, 2023 09:31

Merge branch 'main' into shanea/storage-cleaner-cached-path

ddbfee9

Use updated util methods to setup S3 adapter

36e485f

Use updated util methods to set up cached path client

4b86ebb

Fix typo in cached path client schemes

1913b7a

Add temp dir argument

4f59fc5

Run ruff

a22cddc

2015aroras requested a review from dirkgr December 6, 2023 19:06

2015aroras merged commit 1dbc346 into main Dec 7, 2023
10 checks passed

2015aroras deleted the shanea/storage-cleaner-cached-path branch December 7, 2023 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage Cleaner] Migrate to cached_path #378

[Storage Cleaner] Migrate to cached_path #378

2015aroras commented Nov 22, 2023 •

edited

Loading

epwalsh left a comment

dirkgr left a comment

dirkgr Nov 29, 2023

2015aroras Dec 4, 2023

2015aroras Dec 6, 2023

dirkgr Nov 29, 2023

2015aroras Dec 4, 2023

epwalsh Dec 5, 2023

2015aroras Dec 6, 2023

dirkgr Nov 29, 2023

2015aroras Dec 6, 2023

[Storage Cleaner] Migrate to cached_path #378

[Storage Cleaner] Migrate to cached_path #378

Conversation

2015aroras commented Nov 22, 2023 • edited Loading

epwalsh left a comment

Choose a reason for hiding this comment

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras commented Nov 22, 2023 •

edited

Loading