Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: can OOM a cluster with concurrent TPC-H load #64906

Closed
jordanlewis opened this issue May 8, 2021 · 19 comments
Closed

sql: can OOM a cluster with concurrent TPC-H load #64906

jordanlewis opened this issue May 8, 2021 · 19 comments
Labels
C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team

Comments

@jordanlewis
Copy link
Member

jordanlewis commented May 8, 2021

It's easy to overwhelm a cluster by causing it to OOM when throwing highly concurrent TPC-H load at the cluster.

Instructions to repro:

  1. start roachprod cluster
  2. ./cockroach workload init tpch --data-loader=import
  3. ./cockroach workload run tpch --queries 1 --concurrency=1024 --tolerate-errors
  4. wait. eventually, some of the nodes will probably be oomkilled.

I spent a good deal of time today playing with OOM conditions in Cockroach in general, after learning about this fun issue: golang/go#42805. This issue talks about how spiky allocations can cause issues in Go because of the pacer implementation. I think we see this in practice all the time. Someone wrote a program that is a "watchdog" which runs GC manually when certain watermarks are exceeded (https://github.com/raulk/go-watchdog) so I thought I'd play with this idea.

Things I learned:

  1. If you add a goroutine that calls runtime.GC() once every second the cluster does not do much better, although it lasts a little bit longer
  2. The accounted-for SQL memory stays relatively constant
  3. Turning off the load allows the memory usage to return to baseline relatively quickly
  4. The GC allows the nodes to return to ~50% memory usage for a while, but it slowly begins inching up from that 50% baseline until it eventually dies. I think the 50% is expected (25% for SQL, 25% for pebble)
  5. The cluster can do zero work this entire time (all queries either return SQL memory reservation errors or hang)
  6. heap profiles show "expected size" profiles, with all memory in expected places
  7. the major unaccounted part of the profiles is in the scan buffer from pebble

When the oomkiller strikes, it can be very sudden. One time I saw the memory go from less than 50% heap usage to dead without another sample (was using https://github.com/arl/statsviz which is an amazing project that we should integrate into Cockroach's DB Console, which samples 1/sec when you have it open in the browser). This was pretty scary - nothing that I know of allocates so quickly. What could it mean?

In general I think we have 2 avenues we should poke at:

  1. what is that huge allocation that sometimes strikes right before oomkill? It could very well be a red herring, but I'm not sure.

image

  1. should we account for pebble scan batches? it's odd because we immediately transfer all of that memory into an accounted-for SQL batch, and then allow the pebble scan batch to be freed, but perhaps the moments at which both the SQL batch and the pebble scan batch are live are too long when the cluster is overloaded?
    image

Jira issue: CRDB-7339

@jordanlewis jordanlewis added the C-investigation Further steps needed to qualify. C-label will change. label May 8, 2021
@jordanlewis
Copy link
Member Author

Related issue is #64916. After fixing that issue, I observe better memory behavior, but it is not completely better - I can still crash the cluster with high concurrency on certain TPCH queries (I've been playing with query 18), but it takes somewhat longer now.


I noticed something interesting: there might be a cascading failure behavior here, caused by the differences in performance when distsql plans are correct (table readers on leaseholders) vs when they are incorrect (table readers are not on leaseholders).

DistSQL will plan table readers on leaseholders at plan time. But, if the cluster starts getting overwhelmed and leaseholders move around due to liveness issues, then by the time a plan is run, the leaseholder may have changed. DistSQL does not know how to dynamically replan a plan: it is going to run the same plan that it started with, even if leaseholders change.

The performance difference when a table reader is on the leaseholder and when it's not is quite significant. This is because, when the table reader is not on the leaseholder, it has to do a gRPC request to the leaseholder to fetch data.

Also, when table readers are on leaseholders, there is 1 memory buffer unaccounted for: the pebble one discussed above.

On the other hand, when table readers are not on leaseholders, there are 2 memory buffers unaccounted for:

  1. the gRPC buffers that are used to receive batches from the leaseholder (always allocated and thrown away in recv, shoutout to this classic issue transport: reuse memory on read path grpc/grpc-go#1455)
  2. the BatchResponse protobuf that is decoded into from step 1 (see BatchResponse.Unmarshal generated code)

This is twice the unaccounted-for memory!

So I'm theorizing that, as the performance of a cluster begins to falter due to lack of resources, leaseholders will change around, which causes worse memory behavior (not to mention slower reads in the first place due to extra serialization and network IO caused by the necessity for the remote hop), which overwhelms the cluster further, until it eventually blows up due to things being slow and slop in the memory accounting getting worse as the plans aren't accurate with regards to leaseholder info.

@jordanlewis
Copy link
Member Author

Example of the "double unaccounted-for memory" as it would appear in a pprof flamegraph:

image

When a tablereader is planned on the leaseholder, neither of these stacks will appear: instead, you'll see the pebbleResults.put stack like in the flamegraph higher up in the thread.

@petermattis
Copy link
Collaborator

Are you aware of GODEBUG=allocfreetrace=1? It may be infeasible to use this facility due to the volume of output, but it is another tool for tracing down a hard to discern sudden spike in memory usage. If the volume of output is too large, you could hack the Go runtime to only output allocations/free larger than some threshold size.

@jordanlewis
Copy link
Member Author

Yes, I was considering exactly this strategy yesterday, but I haven't tried it yet.

At the moment, I am theorizing that the large, sudden allocations are caused by gRPC. gRPC has zero memory pressure flow control, which means that a node that sends several batch requests to another node, which is stuck, but suddenly gets unstuck, will happily eat itself to death when all of the batch responses start actually coming back.

@joshimhoff
Copy link
Collaborator

joshimhoff commented May 8, 2021

With the two concrete issues mentioned (pebble scan unaccounted for mem usage & table reader != leaseholder leading to grpc buffer + BatchResponse unaccounted for mem usage), will admission control keep the lack of accounting from leading to OOMs, at least maybe?

  1. pebble scan unaccounted for mem usage

Obviously the accounting being missing could / should (well, y'all tell me) be corrected. But for this to cause OOMs, it sounds like you need a very high level of concurrency. Perhaps with admission control, you will start queuing work before reaching that level of concurrency. Perhaps also, CPU usage will be kept low enough that the below theory will not happen:

perhaps the moments at which both the SQL batch and the pebble scan batch are live are too long when the cluster is overloaded?

  1. table reader != leaseholder leading to grpc buffer + BatchResponse unaccounted for mem usage

Again the accounting being missing could / should be corrected. But IIUC admission control will queue work so as to reserve enough resources to maintain liveness. So leaseholders won't move around, at least not for the concrete reason of failing to maintain liveness.

Thoughts? Maybe there are more "OOM when highly overloaded due to DML only workload" bugs that will also be "fixed" by admission control, as there will be a limit to how overloaded CRDB ever gets, due to pulling (some of) the queueing out of the goruntime into the application.

@jordanlewis
Copy link
Member Author

But for this to cause OOMs, it sounds like you need a very high level of concurrency. Perhaps with admission control, you will start queuing work before reaching that level of concurrency.

I agree that admission control will likely help us get out of this situation. But I'm not sure if it would be enough on its own, depending on how it deals with long-running, giant distsql requests, which are pretty different than kv-style load.


I think we probably need to hook gRPC up to our memory accounting. An appropriate place would be here. As you can see, we read a message length off of the wire, then unconditionally allocate that size of message.

https://github.com/grpc/grpc-go/blob/d2d6bdae07e844b8a3502dcaf00dc7b1b5519a59/rpc_util.go#L562-L575

I'm less convinced that we need to hook pebble up to our memory accounting just yet. I think that the memory accounting delay here is not as significant as that of gRPC, but more experimentation is needed.

@adityamaru
Copy link
Contributor

@jordanlewis I went down a similar path when debugging an index backfiller OOM a few months ago - #58933 (comment). The only thing growing unbounded was the pebble scan batch. We fixed it by using a new Fetcher per index backfill batch, rather than a single, long-lived one - #61366, but I couldn't understand why the underlying pebble batch wasn't being freed. Not much to add to your investigation but I'm keenly following along in case my fix just papered over something more sinister.

@jordanlewis
Copy link
Member Author

Oh, thanks for the tip! That does seem extremely suspicious. I see a couple of places where the fetcher could be holding onto memory from a pebble batch that we could be clearing, but doing that wouldn't hold more than 1 extra pebble batch in memory at a time, so I'm not totally sure how your fix could have solved a problem that was more than accidental 2x memory - I don't see how we could have unbounded growth.

I'm thinking about the indexKey, kv, keyRemainingBytes fields in particular - those are slices into underlying memory that isn't copied out of the pebble batch for performance reasons. We should clear those when the fetcher is finished and that will save us some memory, possibly up to 2x.

I think the cfetcher has similar issues, in the nextKV and lastRowPrefix fields.

Also, there are some similar issues in kv_batch_fetcher, which stores its responses and remainingBatches slices in a queue pattern, and continually slices off the front of them to do the queuing. But, it doesn't nil out the front elements of the queue. I actually made a change to fix this earlier on the branch that I've been playing with for this issue, but it didn't improve anything. It's possible that this was the root cause of #58933, though - it could look like unbounded growth depending on how Go's slice reallocation algorithm works when you need more space in a slice that you've also been "cutting off the front of".

Diff for this last thing that I've been playing with:

diff --git a/pkg/sql/row/kv_batch_fetcher.go b/pkg/sql/row/kv_batch_fetcher.go
index 6197c2d746..5a4e5a95b3 100644
--- a/pkg/sql/row/kv_batch_fetcher.go
+++ b/pkg/sql/row/kv_batch_fetcher.go
@@ -504,11 +506,13 @@ func (f *txnKVFetcher) nextBatch(
 ) (ok bool, kvs []roachpb.KeyValue, batchResponse []byte, origSpan roachpb.Span, err error) {
        if len(f.remainingBatches) > 0 {
                batch := f.remainingBatches[0]
+               f.remainingBatches[0] = nil
                f.remainingBatches = f.remainingBatches[1:]
                return true, nil, batch, f.origSpan, nil
        }
        for len(f.responses) > 0 {
                reply := f.responses[0].GetInner()
+               f.responses[0] = roachpb.ResponseUnion{}
                f.responses = f.responses[1:]
                origSpan := f.requestSpans[0]
                f.requestSpans = f.requestSpans[1:]
@@ -517,12 +521,14 @@ func (f *txnKVFetcher) nextBatch(
                case *roachpb.ScanResponse:
                        if len(t.BatchResponses) > 0 {
                                batchResp = t.BatchResponses[0]
+                               t.BatchResponses[0] = nil
                                f.remainingBatches = t.BatchResponses[1:]
                        }
                        return true, t.Rows, batchResp, origSpan, nil
                case *roachpb.ReverseScanResponse:
                        if len(t.BatchResponses) > 0 {
                                batchResp = t.BatchResponses[0]
+                               t.BatchResponses[0] = nil
                                f.remainingBatches = t.BatchResponses[1:]
                        }
                        return true, t.Rows, batchResp, origSpan, nil

@jordanlewis
Copy link
Member Author

jordanlewis commented May 9, 2021

It's completely sad that Go doesn't have any tools that allow you to explore a heap dump and understand what is retaining different objects. A problem like this would take, seriously, all of 15 minutes to solve with Java and YourKit. </rant>

@jordanlewis
Copy link
Member Author

Btw, I filed golang/go#45984 recently because the Go viewcore program doesn't work on the cores we emit for whatever reason.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 10, 2021
Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs cockroachdb#64906

Release note: None
@joshimhoff
Copy link
Collaborator

joshimhoff commented May 10, 2021

It's completely sad that Go doesn't have any tools that allow you to explore a heap dump and understand what is retaining different objects.

Was reading about this too. Sad!

I agree that admission control will likely help us get out of this situation.

One concrete Q I have is: When you are hitting these OOMs, what CPU & disk IOPS utilization are ya at? High enough that we would be queueing work in an admission control world?

But I'm not sure if it would be enough on its own, depending on how it deals with long-running, giant distsql requests, which are pretty different than kv-style load.

Can you say more about this? Do you have a concrete worry about such load or is it more uncertainty about how admission control (at the SQL level) will interact with such load?

@jordanlewis
Copy link
Member Author

One concrete Q I have is: When you are hitting these OOMs, what CPU & disk IOPS utilization are ya at? High enough that we would be queueing work in an admission control world?

I'm not sure but I will check. I'm almost sure we'd be queueing work here. I didn't make it clear, but sending 1024 concurrent TPCH query 18s at a 3 node cluster is a completely absurd thing to do be doing. It's full overload - there is no chance for the cluster to be able to respond to these queries successfully without queueing. The absurdity of this workload is why it's a good testbed for these OOMs, but also probably (hopefully) why admission control would help protect us here in the future.

Can you say more about this? Do you have a concrete worry about such load or is it more uncertainty about how admission control (at the SQL level) will interact with such load?

It's uncertainty about where the admission control spigots will be placed. Since these queries are large and relatively long-running, and since we kick them off all at once from an idle system, I don't think admission control would necessarily notice to reject or queue anything at first. So I'm uncertain whether anybody will detect the overload and be able to error out some of these queries while they're running.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 17, 2021
Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs cockroachdb#64906

Release note: None
craig bot pushed a commit that referenced this issue May 17, 2021
64946: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola

Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs #64906

Release note: None

65195: kvserver: change shouldCampaignOnWake to campaign when the leader is dead r=lunevalex a=lunevalex

Touches #57093

For a lease to move off a dead node to a new healthy node, a healthy node
must first win a raft election and then apply the lease transfer. For a
quiescent range this means the following:
 - Wait 9s for liveness to expire
 - Attempt to acquire the lease on a healthy node, this is rejected
   because lease acquisition can only happen on the raft leader
 - The rejection unquiesces the range
 - The nodes wait out the election timeout (3s) before trying to
   campaign
 - A raft election occurs and a healthy node wins
 - A lease transfer can now be processed by the newly elected leader

This change proposed to eliminate the extra wait between when the
range is unquiesced and a new campaign is kicked off. The node
unquiescing the range will now check if the raft leader is alive according
to liveness and if it's not it will kick off a campaign to try
and win raft leadership.

Release note: None

65339: tracing: fix benign data race r=irfansharif a=irfansharif

Fixes #65148, it's possible to override the op name associated with a
trace after having instantiated it. We'll want to protect access to it
through a mutex.

Release note: None

Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 18, 2021
Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs cockroachdb#64906

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue May 18, 2021
Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs cockroachdb#64906

Release note: None
craig bot pushed a commit that referenced this issue May 18, 2021
65373: release-20.2: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola

Backport 1/1 commits from #64946.

/cc @cockroachdb/release

---

Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs #64906

Release note: None


Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
craig bot pushed a commit that referenced this issue May 18, 2021
65372: release-21.1: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola

Backport 1/1 commits from #64946.

/cc @cockroachdb/release

---

Previously we would clear fields that can consume significant
memory, like pebbleResults, after getting from the pool. Now this
is done before the put, so that we don't retain memory
unnecessarily.

Informs #64906

Release note: None


Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
@jordanlewis
Copy link
Member Author

#65881 improves the behavior for query 18.

Query 4 can still OOM a server in seconds. I think it's because lookup join does not play nicely and is not well-memory-accounted. I have filed a new issue:

#65904

craig bot pushed a commit that referenced this issue Jun 3, 2021
65881: row: fix intra-query memory leaks in kvFetcher and txnKVFetcher r=jordanlewis a=jordanlewis

Updates #64906. The critical change is the first patch, the kvfetcher one.
The kvbatchfetcher one is theoretically good but I haven't found it to make
as large of a difference as the first patch.

With the first patch applied alone, I can no longer cause OOM conditions with
1024 concurrent TPCH query 18s sent at a single machine, which is a major
improvement. Prior to that patch, such a workload would overwhelm the machine
within 1-2 minutes.

This bug was found with help of the new tooling I've been adding to `viewcore`,
mostly the new `pprof` output format, the existing html object explorer, and the
new type explorer. You can see these updates at
https://github.com/jordanlewis/debug/tree/crl-stuff.

----

The KVFetcher is the piece of code that does the first level of decoding
of a KV batch response, doling out slices of keys and values to higher
level code that further decodes the key values into formats that the SQL
engine can operate on.

The KVFetcher uses a slice into the batch response to keep track of
where it is during the decoding process. Once the slice is empty, it's
finished until someone asks it for a new batch.

However, the KVFetcher used to keep around that empty slice pointer for
its lifetime, or until it was asked for a new batch. This causes the
batch response to be un-garbage-collectable, since there is still a
slice pointing at it, even though the slice is empty.

This causes queries to use up to 2x their accounted-for batch memory,
since the memory accounting system assumes that once data is transfered
out of the batch response into the SQL representation, the batch
response is freed - it assumes there's just 1 "copy" of this batch
response memory.

This is especially problematic for long queries (since they will not
allow that KVFetcher memory to be freed until they're finished).

In effect, this causes 1 extra batch per KVFetcher per query to be
retained in memory. This doesn't sound too bad, since a batch is of
fixed size. But the max batch size is 1 megabyte, so with 1024
concurrent queries, each with 3 KVFetchers, like we see in a TPCH
workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of
unaccounted for memory. This is easily enough memory to push a node over
and cause it to OOM.

This patch nils the batch response pointer once the KVFetcher is
finished decoding it, which allows it to be garbage collected as soon as
possible. In practice, this seems to allow at least a single-node
concurrency-1024 query18 TPCH workload to survive indefinitely (all
queries return out of budget errors) without OOMing.

Release note (bug fix): queries use up to 1MB less actual system memory
per scan, lookup join, index join, zigzag join, or inverted join in
their query plans. This will result in improved memory performance for
workloads with concurrent OLAP-style queries.

----

Previously, we could leave some dangling references to batch responses
around in the txnKVFetcher when we were fetching more than one batch at
a time. This would cause a delay in reclamation of memory for the
lifetime of a given query.

Release note (bug fix): use less memory in some queries, primarily
lookup joins.

65966: build: build cross toolchains r=rail a=rickystewart

The overwhelming majority of the time spent in building the builder
image comes from the cross toolchains. Here we separate out the
toolchain build into a separate step and package them as `.tar.gz`'s
that we simply download and install during the Docker image build.
This also opens up the opportunity to use the same toolchains in the
Bazel build, although consuming these toolchain artifacts from the Bazel
build is a follow-up project.

Closes #65351.

Release note: None

66012: bazel: in `dev` toolchain, don't include unnecessary directory r=rail a=rickystewart

After a macOS update, including the additional directory in the include
search path breaks as the same file is included in two different
locations, leading to some missing declarations.

Closes #64296.

Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Nov 19, 2021
70809: release-21.2: colfetcher: make txnKVFetcher account for memory when used by cFetcher r=yuzefovich a=blathers-crl[bot]

Backport 1/1 commits from #70729 on behalf of @yuzefovich.
Backport 1/1 commits from #71321 on behalf of @yuzefovich.

/cc @cockroachdb/release

----

**colfetcher: make txnKVFetcher account for memory when used by cFetcher**

Previously, the cFetcher created the txnKVFetcher with a nil monitor
under the assumption that it (the cFetcher) performs the memory
accounting itself. I think the justification was that after receiving the
batch response into the txnKVFetcher, the cFetcher will quickly peel
things off and put into `coldata.Vec`s where it is accounted for. This,
however, isn't right because we always perform a deep copy of byte
slices when we set the values into `coldata.Vec`s. (This is the case for
bytes-like and other types, but bytes-like are likely to be affected the
most. I think previously - when we didn't have the flat bytes
representation - we might have skipped the deep copy.)

This results in the batch responses not being accounted for when the
txnKVFetcher is used by the cFetcher (which is the case for the
vectorized scans and index joins). Those batch responses can be on the
order of MBs which is quite large if we have hundred concurrent
cFetchers. This commit fixes the issue by providing the monitor in the
constructor.

Addresses: #64906.

Release note (bug fix): CockroachDB is now less likely to OOM when
queries reading a lot of data are issued with high concurrency (these
queries are likely to hit the memory budget determined by
`--max-sql-memory` startup parameter).

**colfetcher: fix the index join**

The vectorized index join uses the cFetcher which is started multiple
times. Previously, we forgot to close the old kv fetcher on the second
and all subsequent calls. This was mostly ok (modulo not helping the GC
with the collection of old spans) up until recently when we added the
memory accounting to the kv fetcher when used by the cFetcher. After
that change, if the index join issues at least two batches of spans to
fetch, we won't close all of the memory accounts, which would lead to
a crash in non-release build. This is now fixed by eagerly closing the
fetcher when emitting the last batch, and the relevant code path is
exercised by randomizing the index join batch size limit.

Release note: None (no stable release with this bug)

----

Release justification:

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 13, 2022
84229: colexechash: improve memory accounting in the hash table r=yuzefovich a=yuzefovich

This commit improves the memory accounting in the hash table to be more
precise in the case when the `distsql_workmem` limit is exhausted.
Previously, we would allocate large slices first only to perform the
memory accounting after the fact, possibly running out of budget which
would result in a error being thrown. We'd end up in a situation where
the hash table is still referencing larger newly-allocated slices while
only the previous memory usage is accounted for. This commit makes it so
that we account for the needed capacity upfront, then perform the
allocation, and then reconcile the accounting if necessary. This way
we're much more likely to encounter the budget error before making the
large allocations.

Additionally, this commit accounts for some internal slices in the hash
table used only in the hash joiner case.

Also, now both the hash aggregator and the hash joiner eagerly release
references to these internal slices of the hash table when the spilling
to disk occurs (we cannot do the same for the unordered distinct because
there the hash table is actually used after the spilling too).

This required a minor change to the way the unordered distinct spills to
disk. Previously, the memory error could only occur in two spots (and
one of those would leave the hash table in an inconsistent state and we
were "smart" in how we repaired that). However, now the memory error
could occur in more spots (and we could have several different
inconsistent states), so this commit chooses a slight performance
regression of simply rebuilding the hash table from scratch, once, when
the unordered distinct spills to disk.

Addresses: #60022.
Addresses: #64906.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 22, 2022
84319: storage: be tighter with allocations when TargetBytes is present r=yuzefovich a=yuzefovich

Previously, while `put`ting the key into the `repr`, we could make an
allocation that was too large given the remaining `TargetBytes` budget.
This is the case since we're exponentially increasing the capacities of
the buffers until 128MiB and because we're only accounting for the
length of the slice even though the whole capacity would have a memory
footprint.

For example, with 10MiB of `TargetBytes` (which is used by SQL in many
cases) and a ScanResponse that exceeds that limit, we'd allocate
capacities that are powers of two, starting at, say, 256B, and would go
all the way up to 8MiB; however, given that 10MiB limit, we'd only use
2MiB of that last 8MiB `repr` when we encounter the target bytes limit
and stop. Such behavior is kinda ok if the response is marshalled by the
gRPC to be sent to the other node, but it is quite unfortunate in the
local fast-path cases (meaning the SQL and the KV are part of the same
physical machine). In the latter scenario SQL would only account for the
lengths of slices while keeping the whole slices alive for a while,
leading to significant unaccounted for memory. In the example above, on
the order of 6MiB would be unaccounted for - multiply that by some
concurrency, and we have unaccounted memory on the order of hundreds of
MiBs.

The current behavior seems especially bad for the streamer use case
where we issue many requests with the `TargetBytes` set and use
`ScanResponse.NumBytes` field (which tracks the lengths of the slices)
for the memory accounting purposes.

In order to improve here, this commit teaches `put` method about the
maximum capacity it can use. In the example above, the last slice would
be on the order of 2MiB making everything happy: we stay very close to
TargetBytes limit and don't have any wasted space.

Addresses: #64906.
Addresses: #82160.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 29, 2022
83904: rfc: add user defined function RFC r=mgartner a=mgartner

Release note: None

84286: streamingccl: heartbeat persisted frontier r=samiskin a=samiskin

Resolves #84086 

Previously we would forward the ingestion frontier's resolved timestamp
at the point of heartbeat, however this would result in the chance of
the protected timestamp record of the producer to exceed that of the
ingestion job's persisted frontier.

This PR uses the last persisted frontier value instead.

Release note (bug fix): The protected timestamp of the producer job is
no longer able to exceed the persisted ingestion frontier.


85285: colexec: improve handling of the metadata r=yuzefovich a=yuzefovich

**colmem: introduce a helper to release all reservations from allocator**

Release note: None

This commit audits all of the places where we're operating with the
producer metadata and improves things a bit. This was prompted by the
fact that some types of the metadata (in particular, the
LeafTxnFinalState) can be of non-trivial footprint, so the sooner we
lose the reference to the metadata objects, the more stable CRDB will
be.

**colexec: improve handling of the metadata**

This commit adds the memory accounting for the LeafTxnFinalState
metadata objects in most (all?) places that buffer metadata (inbox,
columnarizer, materializer, parallel unordered synchronizer). The
reservations are released whenever the buffered component is drained,
however, the drainer - who takes over the metadata - might not perform
the accounting. The difficulty there is that the lifecycle of the
metadata objects are not super clear: it's possible that we're at the
end of the execution, and the whole plan is being drained - in such a
scenario, the metadata is pushed into the DistSQLReceiver and then
imported into the root txn and is discarded (modulo the increased root
txn footprint); it's also possible that the metadata from the drained
component gets buffered somewhere up the tree. But this commit adds the
accounting in most such places.

Addresses: #64906.
Addresses: #81451.

Release note: None

Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Aug 1, 2022
84230: kvcoord: account for the span overhead when condensing refresh spans r=yuzefovich a=yuzefovich

Previously, we would only account for the lengths of the key and the end
key of the span for the purposes of memory estimation while condensing
the refresh spans set. However, each span has non-trivial overhead (48
bytes) of `roachpb.Span` object itself which we previously ignored. As
a result, the actual footprint of the refresh spans could previously
significantly exceed the target size, especially when the keys are
small. For example, when looking at a recently collected core dump,
I saw the refresh spans taking up about 24MB in the heap whereas the
target setting is only 4MiB. This memory currently is not tracked
against the memory accounting system at all, so such over-shots are
quite bad, especially so given the recent bump of the setting from
256KiB to 4MiB.

Addresses: #64906.
Addresses: #81451.

Release note (ops change): The way we track memory against
`kv.transaction.max_intents_bytes` and
`kv.transaction.max_refresh_spans_bytes` has been adjusted to be more
precise (we no longer ignore some of the overhead). As a result, the
stability of CRDB improves (we're less likely to OOM), however, this
change effectively reduces the budgets determined by those cluster
settings. In practice, this means that
- the intents might be tracked more coarsely (due to span coalescing)
which makes the intent resolution less efficient
- the refresh spans become more coarse too making it more likely that
`ReadWithinUncertaintyIntervalError`s are returned to the user rather
than are retried transparently.

85156: changefeedccl: reduce allocations in kvevent blocking buffer  r=jayshrivastava a=jayshrivastava

This change removes a pointer from the kvevent.Event struct, reducing overall allocations. The hope is that this reduces the amount of work Go gc has to do, which will reduce SQL latency at the end of the day. When doing backfills, the allocations in kv events add up pretty fast, so reducing even one pointer is significant.

See #84709 for more info. I'm not closing the issue with this PR since we may decide to reduce more pointers in future PRs using some of the ideas in the issue comments.

Here are the benchmark results 
```
name          old time/op    new time/op    delta
MemBuffer-10    98.1µs ± 0%    95.8µs ± 1%   -2.35%  (p=0.008 n=5+5)

name          old alloc/op   new alloc/op   delta
MemBuffer-10    76.9kB ± 0%    64.4kB ± 0%  -16.17%  (p=0.008 n=5+5)

name          old allocs/op  new allocs/op  delta
MemBuffer-10       859 ± 0%       675 ± 0%  -21.42%  (p=0.008 n=5+5)

```



85368: roachtest: add KV/YCSB benchmarks with global MVCC range tombstones r=jbowens a=erikgrinaker

**kvserver: add env var to write global MVCC range tombstone**

This patch adds the envvar `COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE`.
When enabled, it will write a global MVCC range tombstone across the
entire table data keyspan during cluster bootstrapping. This can be used
to test performance and correctness in the presence of MVCC range
tombstones, by activating range key-specific code paths while not
semantically affecting the data above it.

Touches #84384.

Release note: None

**roachtest: add KV/YCSB benchmarks with global MVCC range tombstones**

This patch adds a set of benchmark variants that write a single MVCC
range tombstone across the entire SQL keyspan at cluster start, via the
`COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE` env var. Even though this range
tombstone will not affect the data written during the benchmarks, it
activates range key-specific code paths in the storage layer which can
have a significant impact on performance.

The new benchmarks are:

* `kv0/enc=false/nodes=3/cpu=32/mvcc-range-keys=global`
* `kv95/enc=false/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/A/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/B/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/C/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/D/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/E/nodes=3/cpu=32/mvcc-range-keys=global`
* `ycsb/F/nodes=3/cpu=32/mvcc-range-keys=global`

Resolves #84384.


Release note: None

85424: cmd/dev: add support for --show-diff flag from logictests r=rytaft a=rytaft

This commit adds support for the `--show-diff` flag when running tests
with `dev`. This flag is used by the logictests in order to show diffs
between the expected and actual output.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Aug 3, 2022
85286: kvcoord: subject refresh spans to the limit when importing from leaves r=yuzefovich a=yuzefovich

Whenever we're importing the refresh spans from the LeafTxnFinalState,
we're inserting those refresh spans into the refresh footprint set and
attempt to condense the set to stay under the maximum (determined by
a setting). However, the condensing is performed on a best-effort basis,
so previously it was possible to significantly over-shoot the limit.
Furthermore, this refresh footprint is not accounted for. This commit
improves things a bit by giving up on the ability to refresh (i.e.
invalidating the refresh footprint) once the limit is exceeded when
importing the refresh spans from the leaves - it is similar to what we
do after `Send`ing the BatchRequest.

Addresses: #64906.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Aug 23, 2022
86186: sql: do not distribute queries with subqueries returning OIDs r=yuzefovich a=yuzefovich

If a subquery results in a DOid datum, the datum will get a type
annotation (because DOids are ambiguous) when serializing the
render expression involving the result of the subquery. As a
result, we might need to perform a cast on a remote node which
might fail, thus we prohibit the distribution of the main query.

Fixes: #86075.

Release justification: bug fix.

Release note: None

86357: colmem: improve memory accounting when memory limit is exceeded r=yuzefovich a=yuzefovich

**colmem: improve memory accounting when memory limit is exceeded**

This commit improves the memory accounting when memory limit is
exceeded. Previously, in several operators we could run into a situation
where we perform some allocations and run into a memory limit error
later, which results in those allocations being unaccounted for. In some
cases this is acceptable (when the query results in an error), but in
others the memory error is caught and spilling to disk occurs. In the
latter scenarios we would under-account, and this commit fixes most of
such situations.

Now, each disk-spilling operator instantiates a "limited" allocator that
will grow an unlimited memory account when a memory error is
encountered. The idea is that even though the denied allocations cannot
be registered with the main memory account (meaning the operator has
exceeded its memory limit), we still will draw from the
`--max-sql-memory` pool since the allocations can be live for
non-trivial amount of time. If an error occurs when growing the
unlimited memory account, then that error is returned (not the original
memory error) so that the disk spiller doesn't catch it.

This commit audits all operators in `execplan` to use the limited
allocator where appropriate. The new accounting method is only used in
a handful of places which cover most of the use cases. The goal is to
make this commit backportable whereas the follow-up commit will audit
usages of `AdjustMemoryUsage` and will not be backported.

Addresses: #64906.
Fixes: #86351.
Addresses: https://github.com/cockroachlabs/support/issues/1762.

Release justification: bug fix.

Release note: None

**colmem: audit callers of AdjustMemoryUsage**

This commit audits all callers of `Allocator.AdjustMemoryUsage` to use
the newly-exported `AdjustMemoryUsageAfterAllocation` where applicable
(meaning that if an allocation occurs before the method is called, then
the new method is now used). In many cases this won't result in a change
in the behavior since the allocators are not instantiated with limited
memory accounts, but in some cases it is still useful.

Release justification: bug fix.

Release note: None

86402: externalconn,amazon: support s3 KMS in External Connecetions r=benbardin a=adityamaru

Informs: #84228

Release note (sql change): Users can now
`CREATE EXTERNAL CONNECTION` to represent an `aws-kms`
scheme that represents an AWS KMS resource.

Release justification: low risk change to new functionality to register s3 KMS as a supported External Connection

86613: streamproducer: check the job type for replication stream r=yuzefovich a=yuzefovich

Previously, we would panic if the job id corresponded to a job type
different from the replication stream job, and this is now fixed.

Fixes: #86508.

Release justification: bug fix.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Aditya Maru <adityamaru@gmail.com>
craig bot pushed a commit that referenced this issue Feb 24, 2023
97442: kvserver: record ScanStats once per BatchRequest r=yuzefovich a=yuzefovich

Previously, we would record a `kvpb.ScanStats` object into the trace for
each evaluated Get, Scan, and ReverseScan command. This was suboptimal
for two reasons:
- this required an allocation of that `kvpb.ScanStats` object
- this required propagating all of these separate objects via the
tracing infrastructure which might make it so that the tracing limits
are reached resulting in some objects being dropped.

This commit, instead, changes the ScanStats to be tracked at the
BatchRequest level, thus, we only need to record a single object per
BatchRequest. This results in reduced granularity, but that is still
sufficient for the SQL needs which simply aggregates all
`kvpb.ScanStats` from a single SQL processor into one object. As
a result, the tpch_concurrency metric averaged over 20 runs increased
from 76.75 to 84.75.

Additionally, this commit makes it so that we track the number of Gets,
Scans, and ReverseScans actually evaluated as part of the BatchResponse.
This information is plumbed through a couple of protos but is not
exposed in any SQL Observability virtual tables. Still, due to having it
in the protos will include this information into the trace.

Informs: #64906.
Fixes: #71351.

Release note: None

97646: go.mod: bump Pebble to bc4c9afe47a5 r=sumeerbhola a=jbowens

```
bc4c9afe db: use objstorage provider for consistency check
7c833595 db: move CheckConsistency
470c6d49 build: add make targets for asan and msan; add ci jobs
b76bb914 ci: Continuously build for NetBSD
456fd724 vfs: Add support for NetBSD
79eb9477 ci: test against go1.20.1; update staticcheck
ab1a49c9 db: add per-level ValueBlocksSize gauge to Metrics
efd802f1 db: use objstorage to find obsolete objects
f01d8eff sstasble: allow duplicate deletes in deleteObsoleteObject
822e1941 db: fix replay corruption in read only mode
b2e46077 db: add a ScanInternal to scan internal keys
e1e9c2a1 sharedobjcat: store object type
542e01ad sharedobjcat: add creator-id
6adb7237 docs: add benchmark annotation for grandparent boundary splitting
ebc640d1 objstorage: add Sync
ca1ed91d objstorage: return NotExistError when object is not known
```

Epic: None
Release note: None

97655: ui: show events even with max size limit reached r=maryliag a=maryliag

Show events that were returned when we reach the max size limit of the sql-api.

Part Of #96184

https://www.loom.com/share/31e8c55f091c4e2ba4b4243710aa58a4

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
@yuzefovich
Copy link
Member

yuzefovich commented Mar 14, 2023

Over in #98428 I observed that we can now sustain tpch_concurrency roachtest with concurrency of 1000. It is still possible to OOM cockroach with it (at least I think that few crashes are OOMs - I haven't examined them yet), but GOMEMLIMIT (introduced in #97666) saves us vast majority of the time. This OOM happens when the nodes are so overloaded that extremely rare bugs show up, so I think we can call this issue solved. 🎉

@jordanlewis thoughts on closing this?

@jordanlewis
Copy link
Member Author

Amazing work, and great news!

I'm fine to see this get closed if you're feeling satisfied, go ahead.

@yuzefovich
Copy link
Member

Sounds good, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

No branches or pull requests

6 participants