-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix occasional KeyError in S3 logic #301
Conversation
I can copy either of the retry logic and the conversion to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find on the exception handling issue.
I'm pretty sure Boto already has retries built in. It would involve some setting when s3_client
is initialized. And I think we should be way more persistent than 2 retries. I'm thinking exponential back-off for 5 minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, good catch.
I can copy either of the retry logic and the conversion to OlmoNetworkError to the other s3 methods if desired.
I think you should
Boto gives us a stream if it succeeds when we ask for the bucket. Reading from the stream causes the failure; as far as boto is concerned, it has succeeded. I couldn't see a way to retry the reading via a parameter. |
Co-authored-by: Pete <epwalsh10@gmail.com>
- Add retry and OlmoNetworkError transform to all s3 methods - Add exponental-backoff delay to retries - Fix linter issues - Fixed e.e. typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
You can run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to increase the number of retries in the client's config, like this:
s3_client = boto3.client("s3", config=Config(retries={"max_attempts": 10, "mode": "standard"}))
olmo/util.py
Outdated
if int(e.response["Error"]["Code"]) == 404: | ||
raise FileNotFoundError(f"s3://{bucket_name}/{key}") | ||
err = e | ||
except ResponseStreamingError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just reading a little more about boto3 error handling and I think we can make this a little more robust. Instead of ResponseStreamingError
I think we should catch a combination of HTTPClientError
(the base class for ResponseStreamingError
) and ConnectionError
(from botocore, not the Python built-in exception) - the base class for a lot of other retry-able errors.
except ResponseStreamingError as e: | |
except (HTTPClientError, ConnectionError) as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log some warnings too when we have to retry? That way if a run slows down due to connection errors we'll know the root cause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed both error handling and logging comments.
- Add warning logs on s3 retries - Add retry s3 client config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, I'm excited to not have to worry about these annoying S3 errors again. One final comment: we're importing botocore exceptions three different ways... can we just use one way? I'd suggest we import botocore.exceptions as boto_exceptions
at the top and then reference botocore exceptions like boto_exceptions.ClientError
, boto_exceptions.HTTPClientError
, etc.
This way if we end up doing something similar with the Google Cloud client there's no chance of conflict with exception names.
I'm hoping that we won't run into any major s3 issues anymore, though I do still get SSLErrors from time to time |
This is a great fix, and very timely given how much more running from S3 we're about to do. But I fear it will not address the segfault with multiple workers. I can't imagine how it could be related to this. |
Context: There is an issue where we can't run with
data.num_workers > 0
in LUMI and sometimes in MosaicML. In MosaicML, our newer setup, aKeyError: 'error'
often shows up when I run with more than 0 workers. We suspected that the LUMI and MosaicML num_workers issues are linked to theKeyError
.There appear to be 2 different issues that have compounded:
KeyError: 'error'
in our callstack; the exception constructor is expectingerror
as akwarg
.ResponseStreamingError
that gets lost due to the first issue. A retry seems to get around this just fine for me.Fixes:
_s3_get_bytes_range
into a custom Olmo error that has a single-parameter constructor._s3_get_bytes_range
in some failure cases.