Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do persistent cache writebacks asynchronously #7025

Merged
merged 13 commits into from
Mar 13, 2024
Merged

Do persistent cache writebacks asynchronously #7025

merged 13 commits into from
Mar 13, 2024

Conversation

msullivan
Copy link
Member

Whenever a entry is added to the in memory cache, enqueue it in a
queue that a per-db worker task consumes. The worker task manages all
cache inserts and evictions, which should fully eliminate the
serialization error dangers. This also lets us persistently cache
queries that were first compiled from within a rolled back
transaction.

I also re-enabled the query-cache-changes notifications, with some
very minor rate limiting. We'll see how it goes.

Thoughts?

FUTURE NOTE: The situation for function caching once we want to
reintroduce that will be a little more complicated. We'll need to put
both the SQL text and the function call code in the in-memory cache,
and use the text until the function is visible everywhere (at which
point we can drop the text.) We'll also need to do some thinking about
how to test it properly, because the downside of the approach is that
in the typically path, the first execution of a query can't use the
function. We may need some sort of testing path to allow us to
exercise the functions easily in the test code.

@msullivan msullivan requested review from elprans and fantix March 8, 2024 23:30
@msullivan msullivan changed the title Do persistent cache writebacks asynchrounously Do persistent cache writebacks asynchronously Mar 8, 2024
@msullivan msullivan marked this pull request as ready for review March 11, 2024 19:33
edb/server/dbview/dbview.pyx Outdated Show resolved Hide resolved
edb/server/dbview/dbview.pyx Outdated Show resolved Hide resolved
added_since_signal
and (
self._cache_queue.empty()
or added_since_signal > 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we want to implement a "debounce" pattern here. Basically it would be:

  • A task that exists solely to send the query-cache-changes signal
  • Every time we call it it would extend the wait for some T_ADD time
  • When wait time elapses it would send the PG signal
  • If the total time waiting is longer than T_MAX it should send a signal and be back to square one

This would:

  • eliminate magic numbers like 100
  • save the system from spamming everyone with PG events
  • make the system reasonable with respect to the max lag it can take before notifying others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I planned to do something along those lines, but I wanted to see if the aarch64 trouble recurred after these changes without it, but the aarch64 builds are too broken because of spot instance trouble to know yet really

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I implemented something. Let me know if it basically matches what you were thinking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code a few times and I think it's good. I want @fantix to thoroughly read it too.

Copy link
Member

@1st1 1st1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, Sully. Added some comments, lmk if you need help implementing the debounce thingy.

@@ -174,6 +174,9 @@ cdef class Database:
metrics.background_errors.inc(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ Also we should probably stop using debug.dump directly in code and do proper logging.error or whatnot. If it's not printing when debugging that should be debugged

edb/server/dbview/dbview.pyx Outdated Show resolved Hide resolved
# Empty the queue, for batching reasons.
ops = [await self._cache_queue.get()]
while not self._cache_queue.empty():
ops.append(self._cache_queue.get_nowait())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment that this is safe for asyncio queues (unlike thread queues, where the empty check /get_nowait combo is not guaranteed to succeed due to races)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I froze here for a few moments thinking, mistakingly, that your code is incorrect)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah; I felt naughty while doing this, but should have added a comment

Copy link
Member

@1st1 1st1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is in a good state now. time.time() must be swapped with time.monotonic() -- that's my last major comment. @fantix please review this thoroughly.

Copy link
Member

@fantix fantix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the caching part is good, some nits below; but the pgcon state tracking is broken (I'm surprised no test caught it, I'll add a test or fix one).

edb/server/protocol/execute.pyx Outdated Show resolved Hide resolved
Comment on lines 312 to 313
if query_unit.tx_rollback:
be_conn.last_state = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was out of sync that led to this change? We should fix that instead.

Not updating last_state here will cripple this pgcon with a mismatching state, a concurrent frontend connection with a different state may pick it up and run into wrong data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

The problem was that run_ddl didn't sync state in all cases.

edb/server/dbview/dbview.pyx Show resolved Hide resolved
edb/server/dbview/dbview.pyx Outdated Show resolved Hide resolved
edb/server/dbview/dbview.pyx Show resolved Hide resolved
edb/server/dbview/dbview.pyx Outdated Show resolved Hide resolved
except TimeoutError:
t = time.time()
else:
pending_keys.append(str(key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also drain the queue up to MAX_KEYS pending keys here, because asyncio.wait_for() jumps at least 1 iteration in the event loop if timeout is not None, and we may end up with 100 tight asyncio event loops with our while True loop, which sounds like an expensive CPU moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may just try changing it to use timeout_at instead of wait_for, which looks like it won't create a task and so won't yield back to the event loop in the case of get succeeding immediately.

One reason I'm reluctant to switch to explicitly draining the queue is because the latest version has refactored it to just have an input function instead of operating on a queue (though I could change it back to having an input queue).

@msullivan msullivan requested a review from fantix March 13, 2024 00:19
@msullivan
Copy link
Member Author

I pushed some changes but they aren't visible yet, and there appears to be a github incident: https://www.githubstatus.com/incidents/q3gl44chgxwv

msullivan added a commit that referenced this pull request Mar 13, 2024
This is extracted from #7025, for history cleanliness and backportability.
msullivan added a commit that referenced this pull request Mar 13, 2024
This is extracted from #7025, for history cleanliness and backportability.
Whenever a entry is added to the in memory cache, enqueue it in a
queue that a per-db worker task consumes. The worker task manages all
cache inserts and evictions, which should fully eliminate the
serialization error dangers. This also lets us persistently cache
queries that were first compiled from within a rolled back
transaction.

I also re-enabled the query-cache-changes notifications, with some
very minor rate limiting. We'll see how it goes.

FUTURE NOTE: The situation for function caching once we want to
reintroduce that will be a little more complicated. We'll need to put
both the SQL text and the function call code in the in-memory cache,
and use the text until the function is visible everywhere (at which
point we can drop the text.) We'll also need to do some thinking about
how to test it properly, because the downside of the approach is that
in the typically path, the first execution of a query can't use the
function.  We may need some sort of testing path to allow us to
exercise the functions easily in the test code.
Something was getting out of sync, and the extra connection traffic
was exposing it.
while True:
try:
async with asyncio.timeout_at(target_time):
v = await input()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so you will be waiting with timeout even for the first input() return, but that's not what we want to do. We don't want to keep looping through timeouts until input() returns for the first time.

We want to start debouncing only when it returns. The first await on it shouldn't be with a timeout at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sigh, this is why I don't like None for cases like this. Please add an if.

if target_time is not None:
    async with ...
else:
    await input()

Copy link
Member

@1st1 1st1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit comment which I do want to be fixed, otherwise all good.

@msullivan msullivan merged commit 6bbfde4 into master Mar 13, 2024
23 checks passed
@msullivan msullivan deleted the async-cache branch March 13, 2024 17:38
msullivan added a commit that referenced this pull request Mar 15, 2024
Whenever a entry is added to the in memory cache, enqueue it in a
queue that a per-db worker task consumes. The worker task manages all
cache inserts and evictions, which should fully eliminate the
serialization error dangers. This also lets us persistently cache
queries that were first compiled from within a rolled back
transaction.

Also, reenable query-cache-notifications, but make them include the
specific keys to query and implement a debouncing/batching algorithm
to cut down on traffic.

Unfortunately, even with query-cache-notifications off, the async caching
starts triggering the bizarre failures to lookup transactions in postgres on
arm64 linux.
I'm going to put up a follow up that disables the cache there, so we can
move ahead with testing and the release.

FUTURE NOTE: The situation for function caching once we want to
reintroduce that will be a little more complicated. We'll need to put
both the SQL text and the function call code in the in-memory cache,
and use the text until the function is visible everywhere (at which
point we can drop the text.) We'll also need to do some thinking about
how to test it properly, because the downside of the approach is that
in the typically path, the first execution of a query can't use the
function.  We may need some sort of testing path to allow us to
exercise the functions easily in the test code.
msullivan added a commit that referenced this pull request Mar 15, 2024
This is extracted from #7025, for history cleanliness and backportability.
msullivan added a commit that referenced this pull request Mar 15, 2024
Whenever a entry is added to the in memory cache, enqueue it in a
queue that a per-db worker task consumes. The worker task manages all
cache inserts and evictions, which should fully eliminate the
serialization error dangers. This also lets us persistently cache
queries that were first compiled from within a rolled back
transaction.

Also, reenable query-cache-notifications, but make them include the
specific keys to query and implement a debouncing/batching algorithm
to cut down on traffic.

Unfortunately, even with query-cache-notifications off, the async caching
starts triggering the bizarre failures to lookup transactions in postgres on
arm64 linux.
I'm going to put up a follow up that disables the cache there, so we can
move ahead with testing and the release.

FUTURE NOTE: The situation for function caching once we want to
reintroduce that will be a little more complicated. We'll need to put
both the SQL text and the function call code in the in-memory cache,
and use the text until the function is visible everywhere (at which
point we can drop the text.) We'll also need to do some thinking about
how to test it properly, because the downside of the approach is that
in the typically path, the first execution of a query can't use the
function.  We may need some sort of testing path to allow us to
exercise the functions easily in the test code.
msullivan added a commit that referenced this pull request Mar 15, 2024
This is extracted from #7025, for history cleanliness and backportability.
aljazerzen pushed a commit that referenced this pull request Mar 19, 2024
This is extracted from #7025, for history cleanliness and backportability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants