Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364

Merged
merged 57 commits into from
Nov 29, 2023

Conversation

2015aroras
Copy link
Contributor

@2015aroras 2015aroras commented Nov 7, 2023

This PR adds a script for cleaning our local & cloud storage. The script is intended to have the following functionality:

  1. This PR - It will delete bad runs (runs with no checkpoints other than maybe step 0).
  2. Future - It will unshard and re-upload checkpoints of runs.
  3. Future - It will rename runs to match their wandb id.

The script supports doing dry runs, where the main functionality is not performed (but some auxiliary actions like downloading files may still occur).

This PR has been tested on GCS, R2, S3 and local FS mostly on the dry run setting, but I have attempted a test file deletion in these locations too.

PR Train:

  1. This PR [Storage Cleaner] Add script for removing bad runs from local & cloud storage #364
  2. [Storage Cleaner] Migrate to cached_path #378
  3. [Storage Cleaner] Add option to get full paths when listing entries #379
  4. [Storage Cleaner] Add more basic functionality to storage adapters #380
  5. [Storage Cleaner] Add unsharding to storage cleaner #381
  6. [Storage Cleaner] Handle some legacy checkpoints in unsharding #382

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I just left a few suggestions in the comments. And also consider using pathlib.Path instead of raw strings with os.path.* functions.

def main():
args = get_parser().parse_args()

logging.basicConfig(level=getattr(logging, args.log_level.upper()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using util.prepare_cli_environment() instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elif args.run_path is not None:
storage_cleaner.delete_bad_run(args.run_path)
else:
raise ValueError("Neither runs directory not run path provided for run cleaning")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always good to future proof:

Suggested change
raise ValueError("Neither runs directory not run path provided for run cleaning")
raise ValueError("Neither runs directory not run path provided for run cleaning")
else:
raise NotImplementedError(args.op)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class S3StorageAdapter(StorageAdapter):
def __init__(self, endpoint_url: Optional[str] = None):
super().__init__()
self._s3_client = boto3.client(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you're not using util._get_s3_client()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this). Also, I need to pass in an endpoint_url to use this for R2. I can modify util._get_s3_client to take in a endpoint_url if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this).

Hard breaking changes (like a function being removed from util.py) should be caught by CI (mypy or ruff usually), but if you're worried about subtle breaks then you could always add more tests. I think the benefits of consolidating code here outweigh the risks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import google.cloud.storage as gcs
from botocore.config import Config
from google.api_core.exceptions import NotFound
from tqdm import tqdm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tqdm is not one of our direct dependencies, so consider using rich.progress instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@2015aroras 2015aroras changed the title Add script for removing bad runs from local & cloud storage [Storage Cleaner] Add script for removing bad runs from local & cloud storage Nov 22, 2023
@2015aroras
Copy link
Contributor Author

2015aroras commented Nov 22, 2023

There are a lot of changes to come (since I want to add unsharding too), so I've tried to break them down and put them in separate PRs. I have added this chain of PRs ("PR train") to the description.

olmo/util.py Outdated
Comment on lines 565 to 570
def _get_s3_client(endpoint_url: Optional[str] = None):
global _s3_client
if _s3_client is None:
_s3_client = boto3.client(
"s3",
endpoint_url=endpoint_url,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work, because _s3_client is cached. If you call it twice with two different endpoints, the second time you'll get the result from the first time.
Try getting rid of the global _s3_client, and instead use a functools.cache annotation. That will memoize this function, so that it doesn't use the cache if you call it with a different endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts/storage_cleaner.py Show resolved Hide resolved
raise RuntimeError(f"Failed to get size for file with bucket | key: {bucket_name} | {key}")
size_in_bytes: int = head_response["ContentLength"]

with Progress(transient=True) as progress:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

scripts/storage_cleaner.py Show resolved Hide resolved
scripts/storage_cleaner.py Show resolved Hide resolved
@2015aroras 2015aroras merged commit e30d29f into main Nov 29, 2023
10 checks passed
@2015aroras 2015aroras deleted the shanea/storage-cleaner branch November 29, 2023 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants