-
Notifications
You must be signed in to change notification settings - Fork 0
Increasing maximum number of MongoDB retries, jitter for backoff (SCP-5429) #332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## development #332 +/- ##
===============================================
- Coverage 73.84% 73.83% -0.01%
===============================================
Files 30 30
Lines 4171 4170 -1
===============================================
- Hits 3080 3079 -1
Misses 1091 1091
|
jlchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new changes seem logical, given the large dataset issues. I'm glad that these changes seem to suffice.
Re: large datasets and Mongo disk space
Is real-time (just hourly-ish) monitoring of available disk or notification of large dataset ingest currently done Rails-side?
Do we think we need any monitoring/alerting from the ingest side to suggest checking available disk based on notably large dataset ingest?
Great question - we do have real-time monitoring available in the Observability tab for any of our GCE instances. Production: https://console.cloud.google.com/compute/instances/observability?project=broad-singlecellportal&tab=observability&pageState=(%22nav%22:(%22section%22:%22disk%22)) Note: the Ops agent is still installing in staging so the view/data may not be available yet but it should be soon. As far as alerts or monitoring go, I think we can set up alerts in GCE re: disk size. I'll look into that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The robustness improvements look good! I suggest a future possible refinement.
when a MongoDB connection is reset, there is (apparently) no "timeout" that is invoked - the connection is dropped and the
AutoReconnectexception triggers immediately
Thanks, this is useful to know. Looking closer into the meaning of our new custom setting for serverSelectionTimeoutMS, I see MongoDB docs note it "Specifies how long (in milliseconds) to block for server selection before throwing an exception. Default: 30,000 milliseconds."
So it seems this updated backoff will increase time between the initiation of reconnection attempts, whereas that timeout increased the max duration of a reconnection attempt. If MongoDB resources are saturated, I can see how increasing wait times between reconnection attempts (as this PR does) would help robustness in addition to increased wait times within each attempt.
ingest/mongo_connection.py
Outdated
| def retry(attempt_num): | ||
| if attempt_num < MAX_ATTEMPTS - 1: | ||
| if attempt_num < MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS - 1: | ||
| exp_backoff = pow(2, attempt_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, an issue this PR addresses is that the backoff period was too brief, not that retries were too synchronized.
So given no signs of thundering herds per se, I suspect we could more simply -- and maybe more effectively -- mitigate failed large ingests by increasing the underlying backoff, while leaving jitter unchanged. The change's wider jitter does increase backoff, which indeed ought to help, just more subtly and indirectly.
Below is a trivial way we could implement that. The cost to test it would be non-trivial, though, so the refinement suggested below doesn't strike me as worthwhile unless we see future similar failures of large ingests.
| exp_backoff = pow(2, attempt_num) | |
| exp_backoff = pow(2, attempt_num + 1) |
Thanks for the insightful comment. That is indeed correct, and I'm realizing now that my comment is slightly incorrect/misleading. What I've found through testing is that the |
BACKGROUND & CHANGES
This update doubles the maximum number of retries for MongoDB inserts from 5 to 10, and increases the maximum amount of jitter on exponential backoffs by a factor of 2.5. This is follow-on work for #331 as the previous fixes were insufficient for ingesting very large matrix files w/ extremely dense data (small slices of the parent matrix were able to ingest, but the entire merged file still failed reproducibly). The combined updates proved successful for ingesting the matrix at least once (second try in underway).
The logic is that when a MongoDB connection is reset, the issue is less about the time it takes to reconnect as opposed to the total number of attempts. We do not differentiate between
AutoReconnectandBulkWriteErrorwhen retrying, and both errors tend to travel together. As such,serverSelectionTimeoutMSand batch size decreases only addressed part of the problem. Additionally, waiting longer between retries allows for more time to stabilize the connection.Also, this changes the reference for the constant on retries to
MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTSfor both consistency and testing.MANUAL TESTING
Since manual testing involves parsing a ~110 GB dense matrix, this is not advised (see timing information below).