-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helper in daphne_worker_test may lose reports in AggregateStore #109
Comments
Thanks for the report, @divergentdave. I believe this is related to #73. My understanding is that the transaction guarantee no longer holds as soon as you have DO->DO communication. This happens here because AggregateStore calls GarbageCollector. This is not a bug in Miniflare -- Miniflare is correctly emulating production here. If this is indeed the root cause, then this would get resolved once we use an alarm for garbage collection: ttps://github.com//issues/94 I plan to work on this in the coming weeks (DAP-02 changes are higher priority right now). In the meantime: How frequently does this bug come up? Can you tolerate flakiness for a few weeks? |
I think we can likely tolerate flakiness for a few weeks. (at worst, we might disable running this integration test by default, until a fix is in place -- not ideal but it'd certainly fix flakiness, and the test is not nearly so flaky that we can't run it manually to ensure we haven't broken Daphne-compatibility while making changes to Janus, which I suppose is my primary concern.)
Alternatively to disabling the test, would you accept/have time to review a workaround similar to what was put in place for #73? I'd view this as an acceptable tradeoff in test-realism for the time being. (I'm also not sure if this workaround would be as cut-and-dry as the one for #73 -- if not, that would be a good reason not to spend the time implementing it.) |
I think we've seen it a few times a day in CI, but it's infrequent enough that retries can take care of it. Hmm, I wonder if the DO->DO communication issue falls under the "with no other intervening I/O" clause, or if there's something else going on. With this particular conflict, the garbage collector is only touched outside of the critical section, at the start of the fetch method. I notice that the request's body is read between the "agg_share" get and put calls, I wonder if that provides a hole in "the system will prevent concurrent events from executing while awaiting a read operation", such that another DO call can interleave while the first is reading its request body. |
Yes, I would review a workaround. I think what we could do is use the environment variable you added last week to disable garbage collection altogether. For example, we would make this line conditional: https://github.com/cloudflare/daphne/blob/main/daphne_worker/src/durable/aggregate_store.rs#L65 |
We are testing Janus against Daphne in an integration test, and we have noticed an intermittent failure. (I can more reliably trigger the failure when my CPU has high load) We're testing against commit e1b503e, but I think the issue has not yet been addressed on main. When it receives a merge request, AggregateStore tries to read from the "agg_share" key, uses an empty aggregate share if it's not yet present, updates the share, and writes it back. I updated this method as follows with some logging, and I saw the following leading up to a test failure.
Later, the leader sends an aggregate share request for 46 reports, but the Daphne helper returns an error due to a batch mismatch, as it is expecting 33 reports.
Clearly there is some issue with the transactionality of this DO's storage accesses, but it's hard for me to say whether it's a bug in Daphne or Miniflare, i.e. whether the same would occur when running on Cloudflare's real runtime. The DO API documentation says that "a series of reads followed by a series of writes (with no other intervening I/O) are automatically atomic and behave like a transaction." When the DO tries to read "agg_share" and gets an error with "No such value in storage.", does that count as a read or not for the purposes of blocking other concurrent event? In other words, does the transactional storage API handle phantom reads correctly? There is a
transaction()
method available, which takes a callback, and runs it in an explicit transaction. It's possible the phantom read behavior is different with that. (again, this could differ on Miniflare and Cloudflare)The text was updated successfully, but these errors were encountered: