Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332

bistline · 2023-12-04T15:48:02Z

BACKGROUND & CHANGES

This update doubles the maximum number of retries for MongoDB inserts from 5 to 10, and increases the maximum amount of jitter on exponential backoffs by a factor of 2.5. This is follow-on work for #331 as the previous fixes were insufficient for ingesting very large matrix files w/ extremely dense data (small slices of the parent matrix were able to ingest, but the entire merged file still failed reproducibly). The combined updates proved successful for ingesting the matrix at least once (second try in underway).

The logic is that when a MongoDB connection is reset, the issue is less about the time it takes to reconnect as opposed to the total number of attempts. We do not differentiate between AutoReconnect and BulkWriteError when retrying, and both errors tend to travel together. As such, serverSelectionTimeoutMS and batch size decreases only addressed part of the problem. Additionally, waiting longer between retries allows for more time to stabilize the connection.

Also, this changes the reference for the constant on retries to MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS for both consistency and testing.

MANUAL TESTING

Since manual testing involves parsing a ~110 GB dense matrix, this is not advised (see timing information below).

Your Single Cell Portal parse job has completed with the following results:

Total parse time: 21 Hours, 48 Minutes and 37 Seconds
Gene-level entries created: 18677

… for testing, consistency

codecov · 2023-12-04T15:53:55Z

Codecov Report

Merging #332 (41224b9) into development (c1fbf13) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

@@               Coverage Diff               @@
##           development     #332      +/-   ##
===============================================
- Coverage        73.84%   73.83%   -0.01%     
===============================================
  Files               30       30              
  Lines             4171     4170       -1     
===============================================
- Hits              3080     3079       -1     
  Misses            1091     1091

Files	Coverage Δ
ingest/mongo_connection.py	`84.84% <100.00%> (-0.23%)`	⬇️

jlchang

The new changes seem logical, given the large dataset issues. I'm glad that these changes seem to suffice.

Re: large datasets and Mongo disk space
Is real-time (just hourly-ish) monitoring of available disk or notification of large dataset ingest currently done Rails-side?
Do we think we need any monitoring/alerting from the ingest side to suggest checking available disk based on notably large dataset ingest?

bistline · 2023-12-04T19:19:05Z

The new changes seem logical, given the large dataset issues. I'm glad that these changes seem to suffice.

Re: large datasets and Mongo disk space Is real-time (just hourly-ish) monitoring of available disk or notification of large dataset ingest currently done Rails-side? Do we think we need any monitoring/alerting from the ingest side to suggest checking available disk based on notably large dataset ingest?

Great question - we do have real-time monitoring available in the Observability tab for any of our GCE instances.

Production: https://console.cloud.google.com/compute/instances/observability?project=broad-singlecellportal&tab=observability&pageState=(%22nav%22:(%22section%22:%22disk%22))
Staging: https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-a/instances/singlecell-mongo-02?project=broad-singlecellportal-staging&tab=monitoring&pageState=(%22observabilityTab%22:(%22mainContent%22:%22metrics%22,%22section%22:%22diskCapacity%22),%22duration%22:(%22groupValue%22:%22PT1H%22,%22customValue%22:null))

Note: the Ops agent is still installing in staging so the view/data may not be available yet but it should be soon.

As far as alerts or monitoring go, I think we can set up alerts in GCE re: disk size. I'll look into that.

eweitz

The robustness improvements look good! I suggest a future possible refinement.

when a MongoDB connection is reset, there is (apparently) no "timeout" that is invoked - the connection is dropped and the AutoReconnect exception triggers immediately

Thanks, this is useful to know. Looking closer into the meaning of our new custom setting for serverSelectionTimeoutMS, I see MongoDB docs note it "Specifies how long (in milliseconds) to block for server selection before throwing an exception. Default: 30,000 milliseconds."

So it seems this updated backoff will increase time between the initiation of reconnection attempts, whereas that timeout increased the max duration of a reconnection attempt. If MongoDB resources are saturated, I can see how increasing wait times between reconnection attempts (as this PR does) would help robustness in addition to increased wait times within each attempt.

eweitz · 2023-12-04T19:29:09Z

ingest/mongo_connection.py

    def retry(attempt_num):
-        if attempt_num < MAX_ATTEMPTS - 1:
+        if attempt_num < MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS - 1:
            exp_backoff = pow(2, attempt_num)


AIUI, an issue this PR addresses is that the backoff period was too brief, not that retries were too synchronized.

So given no signs of thundering herds per se, I suspect we could more simply -- and maybe more effectively -- mitigate failed large ingests by increasing the underlying backoff, while leaving jitter unchanged. The change's wider jitter does increase backoff, which indeed ought to help, just more subtly and indirectly.

Below is a trivial way we could implement that. The cost to test it would be non-trivial, though, so the refinement suggested below doesn't strike me as worthwhile unless we see future similar failures of large ingests.

Suggested change

exp_backoff = pow(2, attempt_num)

exp_backoff = pow(2, attempt_num + 1)

bistline · 2023-12-04T20:19:28Z

Specifies how long (in milliseconds) to block for server selection before throwing an exception. Default: 30,000 milliseconds.

Thanks for the insightful comment. That is indeed correct, and I'm realizing now that my comment is slightly incorrect/misleading.

What I've found through testing is that the serverSelectionTimeoutMS was only a small part of the overall issue. When the connection is dropped, the ingest process is finding the MongoDB server again, but that connection keeps getting dropped/reset. Why is still somewhat of a mystery to me, but it does correlate with large inserts. Reducing the batch size helped reduce these, but not completely. Same with the timeout - it made the individual connection retries more resilient, but instances where it couldn't find the server in enough time are much rarer than it finding the server only to have the connection dropped again. Also, we don't differential between the AutoReconnect exception and the BulkWriteError in the same retry logic. These two very often travel together. By doubling the number of retries, this makes the entire process much more resilient. The jitter increase was less about "thundering herd" and more "just wait longer". I think your suggestion of just adding 1 makes more sense than increasing the jitter coefficient, so I'll incorporate that change.

bistline added 3 commits November 30, 2023 12:02

Doubling retry attempts, increasing backoff by 2.5x

2a0216d

correctly set MAX_ATTEMPTS

1288e9d

Change retry reference to MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS…

a8e2e08

… for testing, consistency

bistline marked this pull request as ready for review December 4, 2023 15:56

bistline requested review from eweitz and jlchang December 4, 2023 15:56

jlchang approved these changes Dec 4, 2023

View reviewed changes

eweitz approved these changes Dec 4, 2023

View reviewed changes

addressing PR comments

41224b9

bistline merged commit 25c28fa into development Dec 5, 2023

bistline deleted the jb-retry-attempt-increase branch August 26, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332

Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332

Uh oh!

bistline commented Dec 4, 2023 •

edited

Loading

Uh oh!

codecov bot commented Dec 4, 2023 •

edited

Loading

Uh oh!

jlchang left a comment

Uh oh!

bistline commented Dec 4, 2023 •

edited

Loading

Uh oh!

eweitz left a comment •

edited

Loading

Uh oh!

eweitz Dec 4, 2023

Uh oh!

bistline commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	exp_backoff = pow(2, attempt_num)
	exp_backoff = pow(2, attempt_num + 1)

Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332

Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332

Uh oh!

Conversation

bistline commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

BACKGROUND & CHANGES

MANUAL TESTING

Uh oh!

codecov bot commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jlchang left a comment

Choose a reason for hiding this comment

Uh oh!

bistline commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eweitz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eweitz Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

bistline commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bistline commented Dec 4, 2023 •

edited

Loading

codecov bot commented Dec 4, 2023 •

edited

Loading

bistline commented Dec 4, 2023 •

edited

Loading

eweitz left a comment •

edited

Loading