-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage Cleaner] Add script for removing bad runs from local & cloud storage #364
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I just left a few suggestions in the comments. And also consider using pathlib.Path
instead of raw strings with os.path.*
functions.
scripts/storage_cleaner.py
Outdated
def main(): | ||
args = get_parser().parse_args() | ||
|
||
logging.basicConfig(level=getattr(logging, args.log_level.upper())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using util.prepare_cli_environment()
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/storage_cleaner.py
Outdated
elif args.run_path is not None: | ||
storage_cleaner.delete_bad_run(args.run_path) | ||
else: | ||
raise ValueError("Neither runs directory not run path provided for run cleaning") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always good to future proof:
raise ValueError("Neither runs directory not run path provided for run cleaning") | |
raise ValueError("Neither runs directory not run path provided for run cleaning") | |
else: | |
raise NotImplementedError(args.op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/storage_cleaner.py
Outdated
class S3StorageAdapter(StorageAdapter): | ||
def __init__(self, endpoint_url: Optional[str] = None): | ||
super().__init__() | ||
self._s3_client = boto3.client( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you're not using util._get_s3_client()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this). Also, I need to pass in an endpoint_url
to use this for R2. I can modify util._get_s3_client
to take in a endpoint_url
if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid using the Olmo methods since they are probably more likely to change than this script (and I don't want them to accidentally break this).
Hard breaking changes (like a function being removed from util.py
) should be caught by CI (mypy
or ruff
usually), but if you're worried about subtle breaks then you could always add more tests. I think the benefits of consolidating code here outweigh the risks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts/storage_cleaner.py
Outdated
import google.cloud.storage as gcs | ||
from botocore.config import Config | ||
from google.api_core.exceptions import NotFound | ||
from tqdm import tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tqdm
is not one of our direct dependencies, so consider using rich.progress
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of changes to come (since I want to add unsharding too), so I've tried to break them down and put them in separate PRs. I have added this chain of PRs ("PR train") to the description. |
olmo/util.py
Outdated
def _get_s3_client(endpoint_url: Optional[str] = None): | ||
global _s3_client | ||
if _s3_client is None: | ||
_s3_client = boto3.client( | ||
"s3", | ||
endpoint_url=endpoint_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not work, because _s3_client
is cached. If you call it twice with two different endpoints, the second time you'll get the result from the first time.
Try getting rid of the global _s3_client
, and instead use a functools.cache
annotation. That will memoize this function, so that it doesn't use the cache if you call it with a different endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError(f"Failed to get size for file with bucket | key: {bucket_name} | {key}") | ||
size_in_bytes: int = head_response["ContentLength"] | ||
|
||
with Progress(transient=True) as progress: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This PR adds a script for cleaning our local & cloud storage. The script is intended to have the following functionality:
The script supports doing dry runs, where the main functionality is not performed (but some auxiliary actions like downloading files may still occur).
This PR has been tested on GCS, R2, S3 and local FS mostly on the dry run setting, but I have attempted a test file deletion in these locations too.
PR Train: