Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix occasional KeyError in S3 logic #301

Merged
merged 9 commits into from
Sep 30, 2023
Merged

Conversation

2015aroras
Copy link
Contributor

Context: There is an issue where we can't run with data.num_workers > 0 in LUMI and sometimes in MosaicML. In MosaicML, our newer setup, a KeyError: 'error' often shows up when I run with more than 0 workers. We suspected that the LUMI and MosaicML num_workers issues are linked to the KeyError.

There appear to be 2 different issues that have compounded:

  1. When torch's DataLoader intercepts exceptions, it may try to re-raise them by recalling their constructor with a single message arg. Torch has some logic to deal with the absence of a single-parameter constructor, but it doesn't gracefully handle other possible failures in calling such a constructor. This torch handling is what causes a real issue with workers > 0 to turn into KeyError: 'error' in our callstack; the exception constructor is expecting error as a kwarg.
  2. Reading from an s3 bucket using boto3 can occasionally fail to read bytes. This causes a ResponseStreamingError that gets lost due to the first issue. A retry seems to get around this just fine for me.

Fixes:

  1. Transform the error in _s3_get_bytes_range into a custom Olmo error that has a single-parameter constructor.
  2. Add a retry to _s3_get_bytes_range in some failure cases.

@2015aroras 2015aroras self-assigned this Sep 28, 2023
@2015aroras
Copy link
Contributor Author

I can copy either of the retry logic and the conversion to OlmoNetworkError to the other s3 methods if desired.

olmo/util.py Outdated Show resolved Hide resolved
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find on the exception handling issue.

I'm pretty sure Boto already has retries built in. It would involve some setting when s3_client is initialized. And I think we should be way more persistent than 2 retries. I'm thinking exponential back-off for 5 minutes?

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, good catch.

I can copy either of the retry logic and the conversion to OlmoNetworkError to the other s3 methods if desired.

I think you should

olmo/util.py Outdated Show resolved Hide resolved
olmo/util.py Show resolved Hide resolved
olmo/util.py Outdated Show resolved Hide resolved
@2015aroras
Copy link
Contributor Author

Good find on the exception handling issue.

I'm pretty sure Boto already has retries built in. It would involve some setting when s3_client is initialized. And I think we should be way more persistent than 2 retries. I'm thinking exponential back-off for 5 minutes?

Boto gives us a stream if it succeeds when we ask for the bucket. Reading from the stream causes the failure; as far as boto is concerned, it has succeeded. I couldn't see a way to retry the reading via a parameter.

Co-authored-by: Pete <epwalsh10@gmail.com>
@2015aroras 2015aroras linked an issue Sep 29, 2023 that may be closed by this pull request
- Add retry and OlmoNetworkError transform to all s3 methods
- Add exponental-backoff delay to retries
- Fix linter issues
- Fixed e.e. typo
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@epwalsh
Copy link
Member

epwalsh commented Sep 29, 2023

You can run black olmo/ to fix formatting.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to increase the number of retries in the client's config, like this:

s3_client = boto3.client("s3", config=Config(retries={"max_attempts": 10, "mode": "standard"}))

See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#defining-a-retry-configuration-in-a-config-object-for-your-boto3-client

olmo/util.py Outdated
if int(e.response["Error"]["Code"]) == 404:
raise FileNotFoundError(f"s3://{bucket_name}/{key}")
err = e
except ResponseStreamingError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just reading a little more about boto3 error handling and I think we can make this a little more robust. Instead of ResponseStreamingError I think we should catch a combination of HTTPClientError (the base class for ResponseStreamingError) and ConnectionError (from botocore, not the Python built-in exception) - the base class for a lot of other retry-able errors.

Suggested change
except ResponseStreamingError as e:
except (HTTPClientError, ConnectionError) as e:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we log some warnings too when we have to retry? That way if a run slows down due to connection errors we'll know the root cause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both error handling and logging comments.

- Add warning logs on s3 retries
- Add retry s3 client config
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I'm excited to not have to worry about these annoying S3 errors again. One final comment: we're importing botocore exceptions three different ways... can we just use one way? I'd suggest we import botocore.exceptions as boto_exceptions at the top and then reference botocore exceptions like boto_exceptions.ClientError, boto_exceptions.HTTPClientError, etc.

This way if we end up doing something similar with the Google Cloud client there's no chance of conflict with exception names.

@2015aroras
Copy link
Contributor Author

I'm hoping that we won't run into any major s3 issues anymore, though I do still get SSLErrors from time to time

@dirkgr
Copy link
Member

dirkgr commented Sep 29, 2023

This is a great fix, and very timely given how much more running from S3 we're about to do. But I fear it will not address the segfault with multiple workers. I can't imagine how it could be related to this.

@2015aroras 2015aroras merged commit 0a1455b into main Sep 30, 2023
10 checks passed
@2015aroras 2015aroras deleted the shanea/fix-s3-keyerror-failures branch September 30, 2023 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Why can't we run with workers >0? Same as KeyError?
3 participants