Fix occasional KeyError in S3 logic #301

2015aroras · 2023-09-28T23:46:36Z

Context: There is an issue where we can't run with data.num_workers > 0 in LUMI and sometimes in MosaicML. In MosaicML, our newer setup, a KeyError: 'error' often shows up when I run with more than 0 workers. We suspected that the LUMI and MosaicML num_workers issues are linked to the KeyError.

There appear to be 2 different issues that have compounded:

When torch's DataLoader intercepts exceptions, it may try to re-raise them by recalling their constructor with a single message arg. Torch has some logic to deal with the absence of a single-parameter constructor, but it doesn't gracefully handle other possible failures in calling such a constructor. This torch handling is what causes a real issue with workers > 0 to turn into KeyError: 'error' in our callstack; the exception constructor is expecting error as a kwarg.
Reading from an s3 bucket using boto3 can occasionally fail to read bytes. This causes a ResponseStreamingError that gets lost due to the first issue. A retry seems to get around this just fine for me.

Fixes:

Transform the error in _s3_get_bytes_range into a custom Olmo error that has a single-parameter constructor.
Add a retry to _s3_get_bytes_range in some failure cases.

olmo/util.py

2015aroras · 2023-09-28T23:49:44Z

I can copy either of the retry logic and the conversion to OlmoNetworkError to the other s3 methods if desired.

olmo/util.py

dirkgr

Good find on the exception handling issue.

I'm pretty sure Boto already has retries built in. It would involve some setting when s3_client is initialized. And I think we should be way more persistent than 2 retries. I'm thinking exponential back-off for 5 minutes?

epwalsh

Nice, good catch.

I can copy either of the retry logic and the conversion to OlmoNetworkError to the other s3 methods if desired.

I think you should

olmo/util.py

2015aroras · 2023-09-28T23:55:18Z

Good find on the exception handling issue.

I'm pretty sure Boto already has retries built in. It would involve some setting when s3_client is initialized. And I think we should be way more persistent than 2 retries. I'm thinking exponential back-off for 5 minutes?

Boto gives us a stream if it succeeds when we ask for the bucket. Reading from the stream causes the failure; as far as boto is concerned, it has succeeded. I couldn't see a way to retry the reading via a parameter.

Co-authored-by: Pete <epwalsh10@gmail.com>

- Add retry and OlmoNetworkError transform to all s3 methods - Add exponental-backoff delay to retries - Fix linter issues - Fixed e.e. typo

epwalsh

LGTM!

epwalsh · 2023-09-29T03:24:12Z

You can run black olmo/ to fix formatting.

epwalsh

We may also want to increase the number of retries in the client's config, like this:

s3_client = boto3.client("s3", config=Config(retries={"max_attempts": 10, "mode": "standard"}))

See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#defining-a-retry-configuration-in-a-config-object-for-your-boto3-client

epwalsh · 2023-09-29T16:59:06Z

olmo/util.py

+            if int(e.response["Error"]["Code"]) == 404:
+                raise FileNotFoundError(f"s3://{bucket_name}/{key}")
+            err = e
+        except ResponseStreamingError as e:


I was just reading a little more about boto3 error handling and I think we can make this a little more robust. Instead of ResponseStreamingError I think we should catch a combination of HTTPClientError (the base class for ResponseStreamingError) and ConnectionError (from botocore, not the Python built-in exception) - the base class for a lot of other retry-able errors.

Suggested change

except ResponseStreamingError as e:

except (HTTPClientError, ConnectionError) as e:

Can we log some warnings too when we have to retry? That way if a run slows down due to connection errors we'll know the root cause.

Addressed both error handling and logging comments.

- Add warning logs on s3 retries - Add retry s3 client config

epwalsh

This looks great, I'm excited to not have to worry about these annoying S3 errors again. One final comment: we're importing botocore exceptions three different ways... can we just use one way? I'd suggest we import botocore.exceptions as boto_exceptions at the top and then reference botocore exceptions like boto_exceptions.ClientError, boto_exceptions.HTTPClientError, etc.

This way if we end up doing something similar with the Google Cloud client there's no chance of conflict with exception names.

2015aroras · 2023-09-29T20:09:52Z

I'm hoping that we won't run into any major s3 issues anymore, though I do still get SSLErrors from time to time

dirkgr · 2023-09-29T23:52:48Z

This is a great fix, and very timely given how much more running from S3 we're about to do. But I fear it will not address the segfault with multiple workers. I can't imagine how it could be related to this.

Added retries to and updated exception logic of _s3_get_bytes_range

8f94005

2015aroras requested review from dirkgr and epwalsh September 28, 2023 23:46

2015aroras self-assigned this Sep 28, 2023

2015aroras commented Sep 28, 2023

View reviewed changes

olmo/util.py Show resolved Hide resolved

dirkgr reviewed Sep 28, 2023

View reviewed changes

olmo/util.py Outdated Show resolved Hide resolved

dirkgr requested changes Sep 28, 2023

View reviewed changes

epwalsh requested changes Sep 28, 2023

View reviewed changes

olmo/util.py Outdated Show resolved Hide resolved

olmo/util.py Show resolved Hide resolved

olmo/util.py Outdated Show resolved Hide resolved

Increase _s3_get_bytes_range attempts to 3

547b564

Co-authored-by: Pete <epwalsh10@gmail.com>

2015aroras linked an issue Sep 29, 2023 that may be closed by this pull request

Why can't we run with workers >0? Same as KeyError? #300

Closed

2015aroras added 2 commits September 28, 2023 17:47

Address PR comments

4394c2a

- Add retry and OlmoNetworkError transform to all s3 methods - Add exponental-backoff delay to retries - Fix linter issues - Fixed e.e. typo

Remove empty line to fix linting issue

87a29dd

epwalsh approved these changes Sep 29, 2023

View reviewed changes

epwalsh requested changes Sep 29, 2023

View reviewed changes

Address PR comments

6682c77

- Add warning logs on s3 retries - Add retry s3 client config

epwalsh reviewed Sep 29, 2023

View reviewed changes

Disambiguate botocore exceptions from other exceptions

92d1ce8

Fix undefined name error

e639b77

epwalsh approved these changes Sep 29, 2023

View reviewed changes

Merge branch 'main' into shanea/fix-s3-keyerror-failures

ed41b62

dirkgr approved these changes Sep 29, 2023

View reviewed changes

Merge branch 'main' into shanea/fix-s3-keyerror-failures

019ecf3

2015aroras merged commit 0a1455b into main Sep 30, 2023
10 checks passed

2015aroras deleted the shanea/fix-s3-keyerror-failures branch September 30, 2023 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix occasional KeyError in S3 logic #301

Fix occasional KeyError in S3 logic #301

2015aroras commented Sep 28, 2023

2015aroras commented Sep 28, 2023

dirkgr left a comment

epwalsh left a comment

2015aroras commented Sep 28, 2023

epwalsh left a comment

epwalsh commented Sep 29, 2023

epwalsh left a comment

epwalsh Sep 29, 2023

epwalsh Sep 29, 2023

2015aroras Sep 29, 2023

epwalsh left a comment

2015aroras commented Sep 29, 2023

dirkgr commented Sep 29, 2023

	except ResponseStreamingError as e:
	except (HTTPClientError, ConnectionError) as e:

Fix occasional KeyError in S3 logic #301

Fix occasional KeyError in S3 logic #301

Conversation

2015aroras commented Sep 28, 2023

2015aroras commented Sep 28, 2023

dirkgr left a comment

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

2015aroras commented Sep 28, 2023

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh commented Sep 29, 2023

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh Sep 29, 2023

Choose a reason for hiding this comment

epwalsh Sep 29, 2023

Choose a reason for hiding this comment

2015aroras Sep 29, 2023

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

2015aroras commented Sep 29, 2023

dirkgr commented Sep 29, 2023