Skip to content

Conversation

@bistline
Copy link
Contributor

@bistline bistline commented Dec 4, 2023

BACKGROUND & CHANGES

This update doubles the maximum number of retries for MongoDB inserts from 5 to 10, and increases the maximum amount of jitter on exponential backoffs by a factor of 2.5. This is follow-on work for #331 as the previous fixes were insufficient for ingesting very large matrix files w/ extremely dense data (small slices of the parent matrix were able to ingest, but the entire merged file still failed reproducibly). The combined updates proved successful for ingesting the matrix at least once (second try in underway).

The logic is that when a MongoDB connection is reset, the issue is less about the time it takes to reconnect as opposed to the total number of attempts. We do not differentiate between AutoReconnect and BulkWriteError when retrying, and both errors tend to travel together. As such, serverSelectionTimeoutMS and batch size decreases only addressed part of the problem. Additionally, waiting longer between retries allows for more time to stabilize the connection.

Also, this changes the reference for the constant on retries to MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS for both consistency and testing.

MANUAL TESTING

Since manual testing involves parsing a ~110 GB dense matrix, this is not advised (see timing information below).

Your Single Cell Portal parse job has completed with the following results:

Total parse time: 21 Hours, 48 Minutes and 37 Seconds
Gene-level entries created: 18677

@codecov
Copy link

codecov bot commented Dec 4, 2023

Codecov Report

Merging #332 (41224b9) into development (c1fbf13) will decrease coverage by 0.01%.
The diff coverage is 100.00%.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #332      +/-   ##
===============================================
- Coverage        73.84%   73.83%   -0.01%     
===============================================
  Files               30       30              
  Lines             4171     4170       -1     
===============================================
- Hits              3080     3079       -1     
  Misses            1091     1091              
Files Coverage Δ
ingest/mongo_connection.py 84.84% <100.00%> (-0.23%) ⬇️

@bistline bistline marked this pull request as ready for review December 4, 2023 15:56
@bistline bistline requested review from eweitz and jlchang December 4, 2023 15:56
Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new changes seem logical, given the large dataset issues. I'm glad that these changes seem to suffice.

Re: large datasets and Mongo disk space
Is real-time (just hourly-ish) monitoring of available disk or notification of large dataset ingest currently done Rails-side?
Do we think we need any monitoring/alerting from the ingest side to suggest checking available disk based on notably large dataset ingest?

@bistline
Copy link
Contributor Author

bistline commented Dec 4, 2023

The new changes seem logical, given the large dataset issues. I'm glad that these changes seem to suffice.

Re: large datasets and Mongo disk space Is real-time (just hourly-ish) monitoring of available disk or notification of large dataset ingest currently done Rails-side? Do we think we need any monitoring/alerting from the ingest side to suggest checking available disk based on notably large dataset ingest?

Great question - we do have real-time monitoring available in the Observability tab for any of our GCE instances.

Production: https://console.cloud.google.com/compute/instances/observability?project=broad-singlecellportal&tab=observability&pageState=(%22nav%22:(%22section%22:%22disk%22))
Staging: https://console.cloud.google.com/compute/instancesDetail/zones/us-central1-a/instances/singlecell-mongo-02?project=broad-singlecellportal-staging&tab=monitoring&pageState=(%22observabilityTab%22:(%22mainContent%22:%22metrics%22,%22section%22:%22diskCapacity%22),%22duration%22:(%22groupValue%22:%22PT1H%22,%22customValue%22:null))

Note: the Ops agent is still installing in staging so the view/data may not be available yet but it should be soon.

As far as alerts or monitoring go, I think we can set up alerts in GCE re: disk size. I'll look into that.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The robustness improvements look good! I suggest a future possible refinement.

when a MongoDB connection is reset, there is (apparently) no "timeout" that is invoked - the connection is dropped and the AutoReconnect exception triggers immediately

Thanks, this is useful to know. Looking closer into the meaning of our new custom setting for serverSelectionTimeoutMS, I see MongoDB docs note it "Specifies how long (in milliseconds) to block for server selection before throwing an exception. Default: 30,000 milliseconds."

So it seems this updated backoff will increase time between the initiation of reconnection attempts, whereas that timeout increased the max duration of a reconnection attempt. If MongoDB resources are saturated, I can see how increasing wait times between reconnection attempts (as this PR does) would help robustness in addition to increased wait times within each attempt.

def retry(attempt_num):
if attempt_num < MAX_ATTEMPTS - 1:
if attempt_num < MongoConnection.MAX_AUTO_RECONNECT_ATTEMPTS - 1:
exp_backoff = pow(2, attempt_num)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, an issue this PR addresses is that the backoff period was too brief, not that retries were too synchronized.

So given no signs of thundering herds per se, I suspect we could more simply -- and maybe more effectively -- mitigate failed large ingests by increasing the underlying backoff, while leaving jitter unchanged. The change's wider jitter does increase backoff, which indeed ought to help, just more subtly and indirectly.

Below is a trivial way we could implement that. The cost to test it would be non-trivial, though, so the refinement suggested below doesn't strike me as worthwhile unless we see future similar failures of large ingests.

Suggested change
exp_backoff = pow(2, attempt_num)
exp_backoff = pow(2, attempt_num + 1)

@bistline
Copy link
Contributor Author

bistline commented Dec 4, 2023

Specifies how long (in milliseconds) to block for server selection before throwing an exception. Default: 30,000 milliseconds.

Thanks for the insightful comment. That is indeed correct, and I'm realizing now that my comment is slightly incorrect/misleading.

What I've found through testing is that the serverSelectionTimeoutMS was only a small part of the overall issue. When the connection is dropped, the ingest process is finding the MongoDB server again, but that connection keeps getting dropped/reset. Why is still somewhat of a mystery to me, but it does correlate with large inserts. Reducing the batch size helped reduce these, but not completely. Same with the timeout - it made the individual connection retries more resilient, but instances where it couldn't find the server in enough time are much rarer than it finding the server only to have the connection dropped again. Also, we don't differential between the AutoReconnect exception and the BulkWriteError in the same retry logic. These two very often travel together. By doubling the number of retries, this makes the entire process much more resilient. The jitter increase was less about "thundering herd" and more "just wait longer". I think your suggestion of just adding 1 makes more sense than increasing the jitter coefficient, so I'll incorporate that change.

@bistline bistline merged commit 25c28fa into development Dec 5, 2023
@bistline bistline deleted the jb-retry-attempt-increase branch August 26, 2024 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants