Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActorCache Read Coalesce #1916

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

ActorCache Read Coalesce #1916

wants to merge 6 commits into from

Conversation

MellowYarker
Copy link
Contributor

No description provided.

@@ -1444,76 +1424,131 @@ KJ_TEST("ActorCache read retry on flush containing only puts") {
KJ_ASSERT(KJ_ASSERT_NONNULL(promise.wait(ws)) == "123");
}

KJ_TEST("ActorCache read hard fail") {
// KJ_TEST("ActorCache read hard fail") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is currently commented out because:

  1. When we kj::mv(mockGet).thenThrow(KJ_EXCEPTION(FAILED, "read failed")); the read promise, we cause the syncGet() storage operation to throw.
  2. This results in startFlushTransaction() to throw, which causes the flushProm in flushImpl() to throw.
  3. We end up in the catch_() handler of the flushProm, and since our exception cannot be retried, we go into the last else and throw another exception and break the output gate.

Prior to the coalesced read implementation, we would expect the failure of a single storage get() to propagate the exception to the JS caller. This was easy because storage gets() were completely independent operations from writes. Now that get() goes through the same flush as writes, we've got a few more things to worry about.


Propagating the Exception to the caller

We'll focus on the single key get() for now. Currently, the JS get caller must wait for the flush to complete, and upon success, the caller will receive the result of the get().

This breaks when we force a storage get() to throw, because instead of succeeding, the flush will fail with broken.outputGateBroken; jsg.Error: Internal error in Durable Object storage write caused object to be reset.

I believe we can fix this by waiting on a joint promise in ActorCache::get(). We can keep the successful flush case (linked above), and add another promise for when the storage get (in syncGet()) fails. To make things explicit, we can wrap much of syncGet() in a try...catch, and upon entering the catch branch, propagate the exception to the waiting ActorCache::get(). This would effectively ensure that the JS caller receives the exception.

I've already tested this and confirmed it works, and it doesn't look too daunting. In short, the constructor for GetWaiter takes all the entries we care about for this get(). For each Entry, we create a PaF, give the fulfiller to the Entry, and add the promise to a single joint "the read failed" promise owned by the GetWaiter.

  • At this point, the GetWaiter has a promise that resolves when any of the associated Entrys use their fulfiller, indicating the entire get has failed.

Back in the try...catch for syncGet(), if we're in the catch branch, we can just use the fulfiller on the Entry to signal that the storage get() failed.

This approach should extend to the multi-key get() fairly easily, since we can just iterate over all the Entrys in a GetFlush and fulfill each of their fulfillers. That said, I don't know if this is what we want -- I would have to look into it more. Not sure if we have a precedent for this.

What to do with the flush

Assuming we've propagated the exception to ActorCache::get(), we now need to decide what to do about this ongoing flush. Should we throw a fatal exception from syncGet()? Maybe only if it's a DISCONNECTED exception?

At least in the case of FAILED, if we still throw the exception, then this ActorCache read hard fail test will still fail because the output gate will break. Perhaps there are certain types of exceptions that should be propagated to the get() caller (FAILED/OVERLOADED), and certain types that should only be thrown from syncGet() (DISCONNECTED so we retry).

Do Entry's need to be removed from dirtyList/cache?

If we don't propagate FAILED exceptions from syncGet(), do we need to worry about the Entrys associated with the failed get() in the dirtyList?

For example, let's say we fail the storage get(), send the exception to ActorCache::get(), and then don't throw from syncGet(). Assuming there are no other storage operations, we will continue down our flush path to the successful flush case, and iterate over the dirtyList, moving values into the cleanList. This seems wrong, although perhaps if the GetWaiter is destroyed these entries will also be removed from the dirtyList before the flushProm continuation runs?

I would think we would need to remove the entries from dirtyList inside syncGet(), but may need to think about this some more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all feels like the natural consequences of reusing lastFlush and the dirtyList. Can we keep the storage reads segregated onto a separate promise rather than integrating them so tightly into lastFlush? e.g. a coalescedReads or lastBatchedGet promise or something.

It seems like that would simplify some of this desire to treat read failures separately from write failures (and, as asked about in my other comment, it'd allow reads to resolve prior to needing to wait on writes to flush).


As for removing entries from the dirtyList, I'm not sure I totally understand the question unfortunately. If we're throwing an exception to the caller for a given get, then I'd think we should remove it from the list (although presumably not by adding it to the cleanList given that we don't know its value). If we're still going to try to get the key's value and return it, then we shouldn't. But I suspect that isn't an answer to your question.

KJ_DISALLOW_COPY_AND_MOVE(GetWaiter);
~GetWaiter() noexcept(false) {
for (auto& entry : entries) {
if (!entry->isShared()) {
Copy link
Contributor Author

@MellowYarker MellowYarker Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of caching keys even if the GetWaiter goes away (reference).

This "cancel if the GetWaiter is destroyed" approach doesn't really work as is now.

If the GetWaiter is holding the last reference to the Entry, then upon its destruction (because JS dropped the get() promise) we would remove the Entry from the dirtyList and we wouldn't attempt to do the get() when we get around to flushing. The thing is, we almost always give the currentValues map a strong reference to the Entry, then addRef that and put it in the dirtyList (see findInCache()). This means by the time the GetWaiter has been initialized, we will (in almost all cases) have at least 2 strong references to the Entry -- one owned by the GetWaiter and one owned by currentValues. When GetWaiter's dtor runs, we won't even try to remove the Entry from dirtyList because it looks like it's shared.

I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in currentValues + the status is UNKNOWN". This would imply we haven't changed the Entry since we initiated the get().


There's an open question on if we want to try to cancel the storage get() during the flush. I think this might be more complicated than it's worth, since we would have already put Entrys into a GetFlush, and the storage operation might have already run. What if the values have already been retrieved from storage, and the waiter is just waiting for the flush promise to finish?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in currentValues + the status is UNKNOWN". This would imply we haven't changed the Entry since we initiated the get().

But what if some other get() request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?

(again, sorry for commenting on this without having yet read the full implementation -- I've gone from the test file to the header file so far).


There's an open question on if we want to try to cancel the storage get() during the flush.

If the storage operation has already been sent (or packed into a GetFlush in preparation for sending), then yeah I see little marginal return in trying to cancel it. But it does (to me at least) seem meaningfully more valuable to attempt to avoid sending a get to storage if it's canceled before we get around to the flush call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if some other get() request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?

Yeah, that's a solid point. If we have 2 outstanding gets() for one key (i.e. two separate GetWaiters), then they'll each hold a strong ref (via addAtomicRef()), and so if one of the GetWaiters is destroyed, we still can't assume with that a match within currentValue & status UNKNOWN means we're the only outstanding get.

Maybe we can put a counter on the Entry which we increment/decrement whenever a GetWaiter with that Entry is created/destroyed? If it goes down to 1 we know we're the last waiter?

Copy link
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I make it to the implementation file I see enough TODO(now) comments that I suppose I may not be intended to review that yet, but if there's more that you want me to comment on beyond the couple big questions you left, let me know. I'll get to those big questions next, at least.

// No writes will be done until our reads finish!
mockStorage->expectNoActivity(ws);

// Let's have the second read complete first.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like previously we had test coverage of reads returning out of order, but now we've lost that. Could we re-add a case in which ActorCache sends a read request out, then another read is requested by the client, and that second read gets sent and responded to before the first read gets its result back?

Perhaps it's not even possible to have two outstanding reads to storage at the same time in this new version of the code, though? I haven't gotten that far yet, so apologies in advance if that turns out not to be possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give this a go tomorrow after looking at cancelling reads, but FWIW I suspect it's not possible. If they end up in separate flushes, then they'll resolve in the order they were requested, but if they end up in the same flush, they resolve at the same time (although I suspect whichever GetWaiter started waiting on readFlush first will resolve first).

// we actually expect an exception.
auto promise = test.get("foo").get<kj::Promise<kj::Maybe<kj::String>>>();
{
// We drop this quickly, the get should still be cached later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment on the previous PR (4a28176#r1540069776) was more about the case where all gets are canceled, not where one is canceled but another isn't.

Because while it's relatively harmless to ask for corge in this case given that we're already having to ask for foo, in the case where all gets are canceled (e.g. if we also immediately dropped the promise for foo), sending the storage read is more wasteful since we don't really need either value and the read is blocking writes from proceeding.


// We won't write anything until the read completes.
mockStorage->expectNoActivity(ws);
// Since we still have the promise for foo and bar, we do send a get for them. However, 'baz' will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh :/ This is weirdly different from the ActorCache read cancel test above, where we drop the promise for the get("corge") call but end up fetching it from storage anyway. What's the cause of the difference? It'd be nice to note it here in the test somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1715,6 +1752,8 @@ KJ_TEST("ActorCache list() with limit") {

// Stuff after the last key is not in cache.
(void)expectUncached(test.get("fooa"));
// Return our uncached get.
mockStorage->expectCall("get", ws).withParams(CAPNP(key = "fooa")).thenReturn(CAPNP());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I've missed where 4a28176#r1543935758 got resolved, but similar to my above comments, what's the deal with why our reads sometimes get canceled when we drop the returned promise and they sometimes don't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/workerd/io/actor-cache-test.c++ Outdated Show resolved Hide resolved
KJ_DISALLOW_COPY_AND_MOVE(GetWaiter);
~GetWaiter() noexcept(false) {
for (auto& entry : entries) {
if (!entry->isShared()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one way to get around this is to change the check from "is shared" to "our Entry matches what's in currentValues + the status is UNKNOWN". This would imply we haven't changed the Entry since we initiated the get().

But what if some other get() request has been made for the same key? Couldn't that logic cause a false positive that would inadvertently cancel the read when it shouldn't?

(again, sorry for commenting on this without having yet read the full implementation -- I've gone from the test file to the header file so far).


There's an open question on if we want to try to cancel the storage get() during the flush.

If the storage operation has already been sent (or packed into a GetFlush in preparation for sending), then yeah I see little marginal return in trying to cancel it. But it does (to me at least) seem meaningfully more valuable to attempt to avoid sending a get to storage if it's canceled before we get around to the flush call.

src/workerd/io/actor-cache.h Show resolved Hide resolved
src/workerd/io/actor-cache.h Show resolved Hide resolved
// for reads instead.
kj::Promise<kj::Maybe<Value>> getForPreview(kj::Own<Entry> entry, ReadOptions options);

// TODO(now): Need to implement getMultiple for preview sessions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a comment as a reminder.

getFlush.batches.clear();
getFlush.entries.clear();

// TODO(now): This doesn't seem to be firing...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readCancellation promise only gets resolved during the ActorCache destructor. That cancelReadFlush->fulfill() call there adds an event to the kj event loop to call any callbacks that were waiting on the readCancellation promise, but it doesn't immediately call them. The ActorCache finishes getting destructed (and whatever code path caused the destruction of the ActorCache would also continue running up until it yields the event loop) before any of those callbacks would run.

And, importantly, syncGets() is only called from within the flushImpl() code path, which is a chain of async work owned by the lastFlush promise on the ActorCache class. So when the ActorCache instance gets destructed, that lastFlush promise will also get canceled.

So it would surprise me greatly (and probably be a bad thing) if there was some code path in which this promise fired. Which means we may need a different approach to more synchronously do whatever needs doing here during destruction (assuming it can't be handled gracefully via destructors/RAII).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

497ded4

This seems to get the test passing locally, but some of our internal tests are seeing segfaults now (need to look into them more, this is weirdly tricky). Any idea why destroying the joint promise wouldn't also cause the GetMultiStreamImpl to be dropped? I had to explicitly clear out the GetMultiStreamImpl::entries, was hoping to just dtor the object but it seems like its lifetime is extended further, but I'm not sure what's keeping it alive.

Copy link
Member

@a-robinson a-robinson May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why destroying the joint promise wouldn't also cause the GetMultiStreamImpl to be dropped?

If the call to syncGets() is still being awaited on somewhere, then the clients vector would still be getting held on the relevant coroutine stack. We could presumably fix that by either (A) moving the clients vector into an attachment on the joint promise, or (B) making sure that whatever promise that's keeping the syncGets call alive gets canceled.

And I'd guess what's keeping it alive is the lastFlush promise, which is held as a member variable on ActorCache.

But would you like me to try poking at this locally? And if so, can you point me at which test(s) to look at? It's always easier to figure cancellation / destruction order stuff out by testing stuff out than just by looking at the code and guessing.

Copy link
Contributor Author

@MellowYarker MellowYarker May 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the test that would fail is bazel test @workerd//src/workerd/io:actor-cache-test, specifically ActorCache list stream cancellation. It seems like the joint promise does get cancelled (and the individual promises in the loop also get cancelled), but the GetMultiStreamImpl, which we move into a rpc::ActorStorage::ListStream::Client, doesn't seem to get destroyed.

I'm not sure if it's because the ListStream::Client is living long, or if the req object which takes a reference to the ListStream::Client is living long. I suspect we're somehow creating a strong reference to the capability (briefly looking at req.setStream(), I see we eventually call Capability::Client::hook->addRef() in PointerHelpers<>::set()), preventing it from being dtor'd even though the promise is done.


It would be good to find a way to rely on RAII here, since this approach is causing a segfault in a test in our internal repo (our clients array contains raw pointers so no surprise there)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is fixed now? At least it's not failing for me when I pull this branch down and try to repro it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! This test is fine, but one of our internal tests is failing because of a segfault that was introduced when fixing ActorCache list stream cancellation. I'll let you know which one some time this week, not able to connect to wifi on my laptop for some reason.

When we destroy ActorCache, we go through and destroy the promises in syncGets(), which results in us calling GetMultiStreamImpl::clear(). However, sometimes this object has already been destroyed by the time we call clear(). There's probably a reasonable fix, but this already feels a bit hacky, which is why I mentioned that it'd be better to find a more robust approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

875e4ac fixes the segfault and I think is sufficiently commented on.

getFlush.entries.clear();
co_await syncGet(storage, kj::mv(entry));
} else if (getFlush.entries.size() > 1) {
co_await syncGets(storage, kj::mv(getFlush));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this structuring of the code -- where after doing our coalesced reads we have to go ahead and finish executing this startFlushTransaction method, including successfully flushing all of our writes to storage, before the lastFlush promise will resolve -- mean that our storage.get()/storage.getMultiple() promises no longer resolve until after any dirty writes have been flushed to disk?

If so, that feels like a meaningful latency regression for reads given that writes take much longer than reads.

If not, I'm not seeing what mechanism would cause the reads to resolve prior to this method finishing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1d3c4b1

Here's an initial go at having reads complete separately.

I think the most obvious (potential) issue here is that although reads can now resolve before the flush finishes, the next batch of reads will have to wait for the next flush, which will only start after all the currently flushing writes are finished. This of course assumes subsequent reads aren't hitting cache.

I've been dealing with a headache today so haven't really been able to think about if that's such a bad thing just yet. I suppose we have to introduce delay at some point, otherwise we can't really batch the reads up...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the most obvious (potential) issue here is that although reads can now resolve before the flush finishes, the next batch of reads will have to wait for the next flush

That isn't the end of the world, although I also don't see why it would be necessary (aside from it being more convenient in the code).

@@ -1444,76 +1424,131 @@ KJ_TEST("ActorCache read retry on flush containing only puts") {
KJ_ASSERT(KJ_ASSERT_NONNULL(promise.wait(ws)) == "123");
}

KJ_TEST("ActorCache read hard fail") {
// KJ_TEST("ActorCache read hard fail") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all feels like the natural consequences of reusing lastFlush and the dirtyList. Can we keep the storage reads segregated onto a separate promise rather than integrating them so tightly into lastFlush? e.g. a coalescedReads or lastBatchedGet promise or something.

It seems like that would simplify some of this desire to treat read failures separately from write failures (and, as asked about in my other comment, it'd allow reads to resolve prior to needing to wait on writes to flush).


As for removing entries from the dirtyList, I'm not sure I totally understand the question unfortunately. If we're throwing an exception to the caller for a given get, then I'd think we should remove it from the list (although presumably not by adding it to the cleanList given that we don't know its value). If we're still going to try to get the key's value and return it, then we shouldn't. But I suspect that isn't an answer to your question.

@MellowYarker MellowYarker force-pushed the milan/counted-deletes branch 2 times, most recently from 047bf09 to f284b17 Compare April 22, 2024 17:37
@MellowYarker MellowYarker changed the base branch from milan/counted-deletes to main April 24, 2024 15:36
@MellowYarker
Copy link
Contributor Author

MellowYarker commented Apr 26, 2024

I turned the 2nd commit into a fixup because I resolved the TODOs I had and the rest of it was very small.

Will get to the other comments that I haven't replied to tomorrow, I just had a bunch of code I had git stashed that I wanted to try and get in first, but decided to deal with the separate read promise chain instead.

@MellowYarker MellowYarker force-pushed the milan/read-coalesce branch 2 times, most recently from 9ab5599 to 773db0c Compare April 27, 2024 00:05
@a-robinson
Copy link
Member

Sorry I waited so long to look at this, I assumed there was still more work to be done. And TBH I'm still not sure if this was waiting on me or if you were planning on doing more.

So to be clearer, what is the actual state of this and what can I do to help right now?

@MellowYarker
Copy link
Contributor Author

MellowYarker commented May 8, 2024

FYI @a-robinson I'm going to push up some new work as a third commit today, with the eventual goal of squashing the commits into 1 before we finally merge this.

Just want to keep this one separate for now since it's giving me some trouble... This third commit focuses on having reads intercept exceptions from RPC calls, and then either propagating them to the GetWaiters, or allowing the exception trigger a retry (via the normal flush retry mechanism).

This fixes the test "ActorCache read hard fail" test, but this test only checks that single key gets() receive the exception. I've added another (currently broken) test that extends "ActorCache read hard fail" to see if we get the same behavior when we have multiple keys awaiting a response from storage. I'm not sure if the mock framework actually lets us test this or not though.

It seems this change has also broken 9 other tests as well, I haven't had time to look into why just yet because: tl;dr other tests were broken due to a very subtle issue with my initial implementation and I was chasing that bug the last day and a half 😬. Anyways, I suspect these are all breaking for one or 2 reasons that shouldn't be too hard to clear up.


Edit: Ok it turns out it was very simple, we've changed the size of Entry and the tests hardcoded the expected value, so we started evicting when we weren't before. My 1 test is still broken but I'll push up what I have once I fully understand the new Entry size (I'd expect it to be 136 but it's 142 bytes).

@MellowYarker
Copy link
Contributor Author

The latest commit 00ba985 uncomments the "getMultiple hard fail" test and attempts to address the reason for the test failure, but for some reason the exception seems to be causing more problems in this case than in the single-key get() test.

We do manage to propagate it to the caller (the test in this case), but the exception is getting logged and the test is failing.

I thought it might have to do with how we are catching the exception over RPC, or maybe that we give copies to more than 1 promise fulfiller (which ends up in a single joined promise)?

bcaimano and others added 6 commits May 22, 2024 20:49
This commit attempts to provide a mechanism to cancel the promise
that waits for the read batches to complete in syncGets(). This is
needed for when the `ActorCache` is destroyed, since we need to destroy
all `Entry`s before we destroy the `SharedLru`.
Prior to this commit, coalescing reads meant that exceptions on reads
would be subject to flushImpl()s exception handling. In the retriable
case, this meant up to 4 retries. Otherwise, this meant breaking the
output gate.

Prior to coalescing reads, a get() would never break the output gate,
and would only be subject to a single retry.

This commit gets us a bit closer to how things used to be in that reads
will not break the output gate, but we do not guarantee at least one
retry upon DISCONNECT.
If we drop all our GetWaiters, we might as well not do the read since it
could evict another Entry from cache (and no one is waiting for the
result anyways).
The test is still failing because we're throwing a fatal somewhere in
ActorCache (which isn't being caught). I'm not really sure where this is
happening just yet.
@MellowYarker
Copy link
Contributor Author

https://github.com/cloudflare/workerd/compare/00ba98560d35f0f7fe216b989ec17f4d07eec522..7f5b18fac3bab8fc6901ba1345a82b4ad0df709d is a rebase near or on main since this work might be paused for a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants