[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

2015aroras · 2023-11-07T19:00:09Z

This PR adds a script for cleaning our local & cloud storage. The script is intended to have the following functionality:

This PR - It will delete bad runs (runs with no checkpoints other than maybe step 0).
Future - It will unshard and re-upload checkpoints of runs.
Future - It will rename runs to match their wandb id.

The script supports doing dry runs, where the main functionality is not performed (but some auxiliary actions like downloading files may still occur).

This PR has been tested on GCS, R2, S3 and local FS mostly on the dry run setting, but I have attempted a test file deletion in these locations too.

PR Train:

epwalsh

Overall looks good, I just left a few suggestions in the comments. And also consider using pathlib.Path instead of raw strings with os.path.* functions.

epwalsh · 2023-11-08T17:07:09Z

scripts/storage_cleaner.py

+def main():
+    args = get_parser().parse_args()
+
+    logging.basicConfig(level=getattr(logging, args.log_level.upper()))


Consider using util.prepare_cli_environment() instead

epwalsh · 2023-11-08T17:08:37Z

scripts/storage_cleaner.py

+        elif args.run_path is not None:
+            storage_cleaner.delete_bad_run(args.run_path)
+        else:
+            raise ValueError("Neither runs directory not run path provided for run cleaning")


Always good to future proof:

Suggested change

raise ValueError("Neither runs directory not run path provided for run cleaning")

raise ValueError("Neither runs directory not run path provided for run cleaning")

else:

raise NotImplementedError(args.op)

epwalsh · 2023-11-09T17:28:26Z

scripts/storage_cleaner.py

+class S3StorageAdapter(StorageAdapter):
+    def __init__(self, endpoint_url: Optional[str] = None):
+        super().__init__()
+        self._s3_client = boto3.client(


Is there a reason you're not using util._get_s3_client()?

I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this). Also, I need to pass in an endpoint_url to use this for R2. I can modify util._get_s3_client to take in a endpoint_url if you want.

I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this).

Hard breaking changes (like a function being removed from util.py) should be caught by CI (mypy or ruff usually), but if you're worried about subtle breaks then you could always add more tests. I think the benefits of consolidating code here outweigh the risks.

epwalsh · 2023-11-09T17:34:16Z

scripts/storage_cleaner.py

+import google.cloud.storage as gcs
+from botocore.config import Config
+from google.api_core.exceptions import NotFound
+from tqdm import tqdm


tqdm is not one of our direct dependencies, so consider using rich.progress instead.

2015aroras · 2023-11-22T17:13:24Z

There are a lot of changes to come (since I want to add unsharding too), so I've tried to break them down and put them in separate PRs. I have added this chain of PRs ("PR train") to the description.

dirkgr · 2023-11-29T20:44:32Z

olmo/util.py

+def _get_s3_client(endpoint_url: Optional[str] = None):
    global _s3_client
    if _s3_client is None:
        _s3_client = boto3.client(
            "s3",
+            endpoint_url=endpoint_url,


This does not work, because _s3_client is cached. If you call it twice with two different endpoints, the second time you'll get the result from the first time.
Try getting rid of the global _s3_client, and instead use a functools.cache annotation. That will memoize this function, so that it doesn't use the cache if you call it with a different endpoint.

scripts/storage_cleaner.py

dirkgr · 2023-11-29T20:56:03Z

scripts/storage_cleaner.py

+            raise RuntimeError(f"Failed to get size for file with bucket | key: {bucket_name} | {key}")
+        size_in_bytes: int = head_response["ContentLength"]
+
+        with Progress(transient=True) as progress:


scripts/storage_cleaner.py

2015aroras added 20 commits October 4, 2023 13:26

Local FS and GCS impls for deleting bad runs

07e9309

Wandb renaming in progress

416118e

Merge branch 'main' into shanea/storage-cleaner

775618a

Add some extra documentation and types

61774cf

Add R2 storage adapter, run ruff

5421196

Use checkpoint folders instead of config.yaml to detect runs

70119ec

Remove wandb progress

c8d0f38

Merge branch 'main' into shanea/storage-cleaner

9549949

Remove unneeded import

f4191a7

Remove old references to config.yaml

529bdfa

Remove reference to future checkpoint unsharding

09fd103

Add option to delete individual runs

1b7e4a4

Add archive deletion support for S3/R2

7cf4a7d

Add progress bar for boto archive downloads, other minor cleanups

ca155fb

Fix boto deletion request parameters

443e975

Run ruff

37445da

Allow skipping deletion of directories that prompt user

ef958d2

Enable Google Cloud Storage blob deletion

fed5e40

Remove unnecessary string mutation

6cdedbb

Change dry run log message

fc54af9

2015aroras requested review from dirkgr and epwalsh November 7, 2023 19:00

2015aroras added 3 commits November 7, 2023 11:01

Merge branch 'main' into shanea/storage-cleaner

899b126

Run Ruff

a685d17

Fix type check failure

8fcdb76

epwalsh reviewed Nov 9, 2023

View reviewed changes

2015aroras added 4 commits November 9, 2023 11:33

Replace tqdm with rich.Progress

61e8731

Throw error on unknown operation

ae08e4e

Use util.prepare_cli_environment()

67263a9

Replace os.path with pathlib

af45b1b

2015aroras added 7 commits November 14, 2023 16:19

Recover path from bucket and key for logging

978e0b1

Remove leftover debug print

d471011

Get s3 client from util

26e8d65

Remove 10,000 max_result for GCS

f006840

Add is_dir and more path existence validation

8e27200

Remove unused yes argument

a255dec

Run ruff

7e40b94

epwalsh approved these changes Nov 17, 2023

View reviewed changes

2015aroras added 9 commits November 20, 2023 08:49

Change storage cleaning from class to functions

fd94238

Fix bug where max archive size was not being used

604b847

Log reason for not deleting a given run

f3c3fbf

Change DeleteBadRunsConfig to dataclass

e7fa0ca

Make first run deletion comment more accurate

44e70d6

Run ruff

f5e51ad

Merge branch 'main' into shanea/storage-cleaner

3d01926

Fix instance where _get_file_size can get called on a directory

6f9c261

Run ruff

60d1e50

2015aroras changed the title ~~Add script for removing bad runs from local & cloud storage~~ [Storage Cleaner] Add script for removing bad runs from local & cloud storage Nov 22, 2023

2015aroras mentioned this pull request Nov 22, 2023

[Storage Cleaner] Handle some legacy checkpoints in unsharding #382

Merged

dirkgr requested changes Nov 29, 2023

View reviewed changes

Use functools.cache for _get_s3_client

ef3c790

dirkgr approved these changes Nov 29, 2023

View reviewed changes

Merge branch 'main' into shanea/storage-cleaner

e359b2c

2015aroras merged commit e30d29f into main Nov 29, 2023
10 checks passed

2015aroras deleted the shanea/storage-cleaner branch November 29, 2023 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

2015aroras commented Nov 7, 2023 •

edited

Loading

epwalsh left a comment

epwalsh Nov 8, 2023

2015aroras Nov 9, 2023

epwalsh Nov 8, 2023

2015aroras Nov 9, 2023

epwalsh Nov 9, 2023

2015aroras Nov 9, 2023

epwalsh Nov 13, 2023

2015aroras Nov 15, 2023

epwalsh Nov 9, 2023

2015aroras Nov 9, 2023

2015aroras commented Nov 22, 2023 •

edited

Loading

dirkgr Nov 29, 2023

2015aroras Nov 29, 2023

dirkgr Nov 29, 2023

[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

Conversation

2015aroras commented Nov 7, 2023 • edited Loading

epwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras commented Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras commented Nov 7, 2023 •

edited

Loading

2015aroras commented Nov 22, 2023 •

edited

Loading