-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ActorCache Read Coalesce #1916
base: main
Are you sure you want to change the base?
ActorCache Read Coalesce #1916
Conversation
e2fa4e2
to
dd81346
Compare
b1b503c
to
eb3dd7f
Compare
dd81346
to
7579d31
Compare
eb3dd7f
to
1171c27
Compare
9bb1e4d
to
505a274
Compare
1171c27
to
7f9f598
Compare
505a274
to
c38d951
Compare
7f9f598
to
f8d4035
Compare
c38d951
to
1595f4a
Compare
f8d4035
to
ba84e06
Compare
src/workerd/io/actor-cache-test.c++
Outdated
@@ -1444,76 +1424,131 @@ KJ_TEST("ActorCache read retry on flush containing only puts") { | |||
KJ_ASSERT(KJ_ASSERT_NONNULL(promise.wait(ws)) == "123"); | |||
} | |||
|
|||
KJ_TEST("ActorCache read hard fail") { | |||
// KJ_TEST("ActorCache read hard fail") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is currently commented out because:
- When we
kj::mv(mockGet).thenThrow(KJ_EXCEPTION(FAILED, "read failed"));
the read promise, we cause thesyncGet()
storage operation to throw. - This results in
startFlushTransaction()
to throw, which causes theflushProm
influshImpl()
to throw. - We end up in the
catch_()
handler of theflushProm
, and since our exception cannot be retried, we go into the lastelse
and throw another exception and break the output gate.
Prior to the coalesced read implementation, we would expect the failure of a single storage get() to propagate the exception to the JS caller. This was easy because storage gets() were completely independent operations from writes. Now that get()
goes through the same flush as writes, we've got a few more things to worry about.
Propagating the Exception to the caller
We'll focus on the single key get()
for now. Currently, the JS get caller must wait for the flush to complete, and upon success, the caller will receive the result of the get()
.
This breaks when we force a storage get()
to throw, because instead of succeeding, the flush will fail with broken.outputGateBroken; jsg.Error: Internal error in Durable Object storage write caused object to be reset
.
I believe we can fix this by waiting on a joint promise in ActorCache::get()
. We can keep the successful flush case (linked above), and add another promise for when the storage get (in syncGet()
) fails. To make things explicit, we can wrap much of syncGet()
in a try...catch
, and upon entering the catch branch, propagate the exception to the waiting ActorCache::get()
. This would effectively ensure that the JS caller receives the exception.
I've already tested this and confirmed it works, and it doesn't look too daunting. In short, the constructor for GetWaiter
takes all the entries we care about for this get()
. For each Entry
, we create a PaF, give the fulfiller to the Entry
, and add the promise to a single joint "the read failed" promise owned by the GetWaiter
.
- At this point, the
GetWaiter
has a promise that resolves when any of the associatedEntry
s use their fulfiller, indicating the entire get has failed.
Back in the try...catch
for syncGet()
, if we're in the catch branch, we can just use the fulfiller on the Entry
to signal that the storage get()
failed.
This approach should extend to the multi-key get() fairly easily, since we can just iterate over all the Entry
s in a GetFlush
and fulfill each of their fulfillers. That said, I don't know if this is what we want -- I would have to look into it more. Not sure if we have a precedent for this.
What to do with the flush
Assuming we've propagated the exception to ActorCache::get()
, we now need to decide what to do about this ongoing flush. Should we throw a fatal exception from syncGet()
? Maybe only if it's a DISCONNECTED
exception?
At least in the case of FAILED
, if we still throw the exception, then this ActorCache read hard fail
test will still fail because the output gate will break. Perhaps there are certain types of exceptions that should be propagated to the get() caller (FAILED/OVERLOADED
), and certain types that should only be thrown from syncGet()
(DISCONNECTED
so we retry).
Do Entry's need to be removed from dirtyList/cache?
If we don't propagate FAILED
exceptions from syncGet()
, do we need to worry about the Entry
s associated with the failed get() in the dirtyList
?
For example, let's say we fail the storage get(), send the exception to ActorCache::get()
, and then don't throw from syncGet()
. Assuming there are no other storage operations, we will continue down our flush path to the successful flush case, and iterate over the dirtyList
, moving values into the cleanList
. This seems wrong, although perhaps if the GetWaiter
is destroyed these entries will also be removed from the dirtyList
before the flushProm
continuation runs?
I would think we would need to remove the entries from dirtyList
inside syncGet()
, but may need to think about this some more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all feels like the natural consequences of reusing lastFlush
and the dirtyList
. Can we keep the storage reads segregated onto a separate promise rather than integrating them so tightly into lastFlush
? e.g. a coalescedReads
or lastBatchedGet
promise or something.
It seems like that would simplify some of this desire to treat read failures separately from write failures (and, as asked about in my other comment, it'd allow reads to resolve prior to needing to wait on writes to flush).
As for removing entries from the dirtyList
, I'm not sure I totally understand the question unfortunately. If we're throwing an exception to the caller for a given get
, then I'd think we should remove it from the list (although presumably not by adding it to the cleanList
given that we don't know its value). If we're still going to try to get the key's value and return it, then we shouldn't. But I suspect that isn't an answer to your question.
src/workerd/io/actor-cache.h
Outdated
KJ_DISALLOW_COPY_AND_MOVE(GetWaiter); | ||
~GetWaiter() noexcept(false) { | ||
for (auto& entry : entries) { | ||
if (!entry->isShared()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the topic of caching keys even if the GetWaiter
goes away (reference).
This "cancel if the GetWaiter
is destroyed" approach doesn't really work as is now.
If the GetWaiter
is holding the last reference to the Entry
, then upon its destruction (because JS dropped the get() promise) we would remove the Entry
from the dirtyList
and we wouldn't attempt to do the get() when we get around to flushing. The thing is, we almost always give the currentValues
map a strong reference to the Entry
, then addRef
that and put it in the dirtyList
(see findInCache()
). This means by the time the GetWaiter
has been initialized, we will (in almost all cases) have at least 2 strong references to the Entry
-- one owned by the GetWaiter
and one owned by currentValues
. When GetWaiter
's dtor runs, we won't even try to remove the Entry from dirtyList
because it looks like it's shared.
I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in currentValues
+ the status is UNKNOWN
". This would imply we haven't changed the Entry since we initiated the get()
.
There's an open question on if we want to try to cancel the storage get() during the flush. I think this might be more complicated than it's worth, since we would have already put Entry
s into a GetFlush
, and the storage operation might have already run. What if the values have already been retrieved from storage, and the waiter is just waiting for the flush promise to finish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in
currentValues
+ the status isUNKNOWN
". This would imply we haven't changed the Entry since we initiated theget()
.
But what if some other get()
request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?
(again, sorry for commenting on this without having yet read the full implementation -- I've gone from the test file to the header file so far).
There's an open question on if we want to try to cancel the storage get() during the flush.
If the storage operation has already been sent (or packed into a GetFlush
in preparation for sending), then yeah I see little marginal return in trying to cancel it. But it does (to me at least) seem meaningfully more valuable to attempt to avoid sending a get to storage if it's canceled before we get around to the flush call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what if some other get() request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?
Yeah, that's a solid point. If we have 2 outstanding gets() for one key (i.e. two separate GetWaiter
s), then they'll each hold a strong ref (via addAtomicRef()
), and so if one of the GetWaiter
s is destroyed, we still can't assume with that a match within currentValue
& status UNKNOWN
means we're the only outstanding get.
Maybe we can put a counter on the Entry
which we increment/decrement whenever a GetWaiter with that Entry is created/destroyed? If it goes down to 1 we know we're the last waiter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I make it to the implementation file I see enough TODO(now) comments that I suppose I may not be intended to review that yet, but if there's more that you want me to comment on beyond the couple big questions you left, let me know. I'll get to those big questions next, at least.
// No writes will be done until our reads finish! | ||
mockStorage->expectNoActivity(ws); | ||
|
||
// Let's have the second read complete first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like previously we had test coverage of reads returning out of order, but now we've lost that. Could we re-add a case in which ActorCache
sends a read request out, then another read is requested by the client, and that second read gets sent and responded to before the first read gets its result back?
Perhaps it's not even possible to have two outstanding reads to storage at the same time in this new version of the code, though? I haven't gotten that far yet, so apologies in advance if that turns out not to be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will give this a go tomorrow after looking at cancelling reads, but FWIW I suspect it's not possible. If they end up in separate flushes, then they'll resolve in the order they were requested, but if they end up in the same flush, they resolve at the same time (although I suspect whichever GetWaiter
started waiting on readFlush
first will resolve first).
src/workerd/io/actor-cache-test.c++
Outdated
// we actually expect an exception. | ||
auto promise = test.get("foo").get<kj::Promise<kj::Maybe<kj::String>>>(); | ||
{ | ||
// We drop this quickly, the get should still be cached later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment on the previous PR (4a28176#r1540069776) was more about the case where all gets are canceled, not where one is canceled but another isn't.
Because while it's relatively harmless to ask for corge
in this case given that we're already having to ask for foo
, in the case where all gets are canceled (e.g. if we also immediately dropped the promise for foo
), sending the storage read is more wasteful since we don't really need either value and the read is blocking writes from proceeding.
|
||
// We won't write anything until the read completes. | ||
mockStorage->expectNoActivity(ws); | ||
// Since we still have the promise for foo and bar, we do send a get for them. However, 'baz' will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh :/ This is weirdly different from the ActorCache read cancel
test above, where we drop the promise for the get("corge")
call but end up fetching it from storage anyway. What's the cause of the difference? It'd be nice to note it here in the test somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/workerd/io/actor-cache-test.c++
Outdated
@@ -1715,6 +1752,8 @@ KJ_TEST("ActorCache list() with limit") { | |||
|
|||
// Stuff after the last key is not in cache. | |||
(void)expectUncached(test.get("fooa")); | |||
// Return our uncached get. | |||
mockStorage->expectCall("get", ws).withParams(CAPNP(key = "fooa")).thenReturn(CAPNP()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I've missed where 4a28176#r1543935758 got resolved, but similar to my above comments, what's the deal with why our reads sometimes get canceled when we drop the returned promise and they sometimes don't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/workerd/io/actor-cache.h
Outdated
KJ_DISALLOW_COPY_AND_MOVE(GetWaiter); | ||
~GetWaiter() noexcept(false) { | ||
for (auto& entry : entries) { | ||
if (!entry->isShared()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in
currentValues
+ the status isUNKNOWN
". This would imply we haven't changed the Entry since we initiated theget()
.
But what if some other get()
request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?
(again, sorry for commenting on this without having yet read the full implementation -- I've gone from the test file to the header file so far).
There's an open question on if we want to try to cancel the storage get() during the flush.
If the storage operation has already been sent (or packed into a GetFlush
in preparation for sending), then yeah I see little marginal return in trying to cancel it. But it does (to me at least) seem meaningfully more valuable to attempt to avoid sending a get to storage if it's canceled before we get around to the flush call.
// for reads instead. | ||
kj::Promise<kj::Maybe<Value>> getForPreview(kj::Own<Entry> entry, ReadOptions options); | ||
|
||
// TODO(now): Need to implement getMultiple for preview sessions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving a comment as a reminder.
src/workerd/io/actor-cache.c++
Outdated
getFlush.batches.clear(); | ||
getFlush.entries.clear(); | ||
|
||
// TODO(now): This doesn't seem to be firing... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readCancellation
promise only gets resolved during the ActorCache
destructor. That cancelReadFlush->fulfill()
call there adds an event to the kj event loop to call any callbacks that were waiting on the readCancellation
promise, but it doesn't immediately call them. The ActorCache
finishes getting destructed (and whatever code path caused the destruction of the ActorCache
would also continue running up until it yields the event loop) before any of those callbacks would run.
And, importantly, syncGets()
is only called from within the flushImpl()
code path, which is a chain of async work owned by the lastFlush
promise on the ActorCache
class. So when the ActorCache
instance gets destructed, that lastFlush
promise will also get canceled.
So it would surprise me greatly (and probably be a bad thing) if there was some code path in which this promise fired. Which means we may need a different approach to more synchronously do whatever needs doing here during destruction (assuming it can't be handled gracefully via destructors/RAII).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to get the test passing locally, but some of our internal tests are seeing segfaults now (need to look into them more, this is weirdly tricky). Any idea why destroying the joint promise wouldn't also cause the GetMultiStreamImpl
to be dropped? I had to explicitly clear out the GetMultiStreamImpl::entries
, was hoping to just dtor the object but it seems like its lifetime is extended further, but I'm not sure what's keeping it alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why destroying the joint promise wouldn't also cause the GetMultiStreamImpl to be dropped?
If the call to syncGets()
is still being awaited on somewhere, then the clients
vector would still be getting held on the relevant coroutine stack. We could presumably fix that by either (A) moving the clients
vector into an attachment on the joint promise, or (B) making sure that whatever promise that's keeping the syncGets
call alive gets canceled.
And I'd guess what's keeping it alive is the lastFlush
promise, which is held as a member variable on ActorCache
.
But would you like me to try poking at this locally? And if so, can you point me at which test(s) to look at? It's always easier to figure cancellation / destruction order stuff out by testing stuff out than just by looking at the code and guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the test that would fail is bazel test @workerd//src/workerd/io:actor-cache-test
, specifically ActorCache list stream cancellation
. It seems like the joint promise does get cancelled (and the individual promises in the loop also get cancelled), but the GetMultiStreamImpl
, which we move into a rpc::ActorStorage::ListStream::Client
, doesn't seem to get destroyed.
I'm not sure if it's because the ListStream::Client
is living long, or if the req
object which takes a reference to the ListStream::Client
is living long. I suspect we're somehow creating a strong reference to the capability (briefly looking at req.setStream()
, I see we eventually call Capability::Client::hook->addRef()
in PointerHelpers<>::set()
), preventing it from being dtor'd even though the promise is done.
It would be good to find a way to rely on RAII here, since this approach is causing a segfault in a test in our internal repo (our client
s array contains raw pointers so no surprise there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is fixed now? At least it's not failing for me when I pull this branch down and try to repro it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! This test is fine, but one of our internal tests is failing because of a segfault that was introduced when fixing ActorCache list stream cancellation
. I'll let you know which one some time this week, not able to connect to wifi on my laptop for some reason.
When we destroy ActorCache
, we go through and destroy the promises in syncGets()
, which results in us calling GetMultiStreamImpl::clear()
. However, sometimes this object has already been destroyed by the time we call clear()
. There's probably a reasonable fix, but this already feels a bit hacky, which is why I mentioned that it'd be better to find a more robust approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
875e4ac fixes the segfault and I think is sufficiently commented on.
getFlush.entries.clear(); | ||
co_await syncGet(storage, kj::mv(entry)); | ||
} else if (getFlush.entries.size() > 1) { | ||
co_await syncGets(storage, kj::mv(getFlush)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this structuring of the code -- where after doing our coalesced reads we have to go ahead and finish executing this startFlushTransaction
method, including successfully flushing all of our writes to storage, before the lastFlush
promise will resolve -- mean that our storage.get()
/storage.getMultiple()
promises no longer resolve until after any dirty writes have been flushed to disk?
If so, that feels like a meaningful latency regression for reads given that writes take much longer than reads.
If not, I'm not seeing what mechanism would cause the reads to resolve prior to this method finishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an initial go at having reads complete separately.
I think the most obvious (potential) issue here is that although reads can now resolve before the flush finishes, the next batch of reads will have to wait for the next flush, which will only start after all the currently flushing writes are finished. This of course assumes subsequent reads aren't hitting cache.
I've been dealing with a headache today so haven't really been able to think about if that's such a bad thing just yet. I suppose we have to introduce delay at some point, otherwise we can't really batch the reads up...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the most obvious (potential) issue here is that although reads can now resolve before the flush finishes, the next batch of reads will have to wait for the next flush
That isn't the end of the world, although I also don't see why it would be necessary (aside from it being more convenient in the code).
src/workerd/io/actor-cache-test.c++
Outdated
@@ -1444,76 +1424,131 @@ KJ_TEST("ActorCache read retry on flush containing only puts") { | |||
KJ_ASSERT(KJ_ASSERT_NONNULL(promise.wait(ws)) == "123"); | |||
} | |||
|
|||
KJ_TEST("ActorCache read hard fail") { | |||
// KJ_TEST("ActorCache read hard fail") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all feels like the natural consequences of reusing lastFlush
and the dirtyList
. Can we keep the storage reads segregated onto a separate promise rather than integrating them so tightly into lastFlush
? e.g. a coalescedReads
or lastBatchedGet
promise or something.
It seems like that would simplify some of this desire to treat read failures separately from write failures (and, as asked about in my other comment, it'd allow reads to resolve prior to needing to wait on writes to flush).
As for removing entries from the dirtyList
, I'm not sure I totally understand the question unfortunately. If we're throwing an exception to the caller for a given get
, then I'd think we should remove it from the list (although presumably not by adding it to the cleanList
given that we don't know its value). If we're still going to try to get the key's value and return it, then we shouldn't. But I suspect that isn't an answer to your question.
047bf09
to
f284b17
Compare
ba84e06
to
cc6e27a
Compare
1d3c4b1
to
c008033
Compare
I turned the 2nd commit into a fixup because I resolved the TODOs I had and the rest of it was very small. Will get to the other comments that I haven't replied to tomorrow, I just had a bunch of code I had |
9ab5599
to
773db0c
Compare
Sorry I waited so long to look at this, I assumed there was still more work to be done. And TBH I'm still not sure if this was waiting on me or if you were planning on doing more. So to be clearer, what is the actual state of this and what can I do to help right now? |
FYI @a-robinson I'm going to push up some new work as a third commit today, with the eventual goal of squashing the commits into 1 before we finally merge this. Just want to keep this one separate for now since it's giving me some trouble... This third commit focuses on having reads intercept exceptions from RPC calls, and then either propagating them to the This fixes the test It seems this change has also broken 9 other tests as well, I haven't had time to look into why just yet because: tl;dr other tests were broken due to a very subtle issue with my initial implementation and I was chasing that bug the last day and a half 😬. Anyways, I suspect these are all breaking for one or 2 reasons that shouldn't be too hard to clear up. Edit: Ok it turns out it was very simple, we've changed the size of |
The latest commit 00ba985 uncomments the "getMultiple hard fail" test and attempts to address the reason for the test failure, but for some reason the exception seems to be causing more problems in this case than in the single-key get() test. We do manage to propagate it to the caller (the test in this case), but the exception is getting logged and the test is failing. I thought it might have to do with how we are catching the exception over RPC, or maybe that we give copies to more than 1 promise fulfiller (which ends up in a single joined promise)? |
This commit attempts to provide a mechanism to cancel the promise that waits for the read batches to complete in syncGets(). This is needed for when the `ActorCache` is destroyed, since we need to destroy all `Entry`s before we destroy the `SharedLru`.
Prior to this commit, coalescing reads meant that exceptions on reads would be subject to flushImpl()s exception handling. In the retriable case, this meant up to 4 retries. Otherwise, this meant breaking the output gate. Prior to coalescing reads, a get() would never break the output gate, and would only be subject to a single retry. This commit gets us a bit closer to how things used to be in that reads will not break the output gate, but we do not guarantee at least one retry upon DISCONNECT.
If we drop all our GetWaiters, we might as well not do the read since it could evict another Entry from cache (and no one is waiting for the result anyways).
The test is still failing because we're throwing a fatal somewhere in ActorCache (which isn't being caught). I'm not really sure where this is happening just yet.
00ba985
to
7f5b18f
Compare
https://github.com/cloudflare/workerd/compare/00ba98560d35f0f7fe216b989ec17f4d07eec522..7f5b18fac3bab8fc6901ba1345a82b4ad0df709d is a rebase near or on |
No description provided.