sql: can OOM a cluster with concurrent TPC-H load #64906

jordanlewis · 2021-05-08T01:22:48Z

It's easy to overwhelm a cluster by causing it to OOM when throwing highly concurrent TPC-H load at the cluster.

Instructions to repro:

start roachprod cluster
./cockroach workload init tpch --data-loader=import
./cockroach workload run tpch --queries 1 --concurrency=1024 --tolerate-errors
wait. eventually, some of the nodes will probably be oomkilled.

I spent a good deal of time today playing with OOM conditions in Cockroach in general, after learning about this fun issue: golang/go#42805. This issue talks about how spiky allocations can cause issues in Go because of the pacer implementation. I think we see this in practice all the time. Someone wrote a program that is a "watchdog" which runs GC manually when certain watermarks are exceeded (https://github.com/raulk/go-watchdog) so I thought I'd play with this idea.

Things I learned:

If you add a goroutine that calls runtime.GC() once every second the cluster does not do much better, although it lasts a little bit longer
The accounted-for SQL memory stays relatively constant
Turning off the load allows the memory usage to return to baseline relatively quickly
The GC allows the nodes to return to ~50% memory usage for a while, but it slowly begins inching up from that 50% baseline until it eventually dies. I think the 50% is expected (25% for SQL, 25% for pebble)
The cluster can do zero work this entire time (all queries either return SQL memory reservation errors or hang)
heap profiles show "expected size" profiles, with all memory in expected places
the major unaccounted part of the profiles is in the scan buffer from pebble

When the oomkiller strikes, it can be very sudden. One time I saw the memory go from less than 50% heap usage to dead without another sample (was using https://github.com/arl/statsviz which is an amazing project that we should integrate into Cockroach's DB Console, which samples 1/sec when you have it open in the browser). This was pretty scary - nothing that I know of allocates so quickly. What could it mean?

In general I think we have 2 avenues we should poke at:

what is that huge allocation that sometimes strikes right before oomkill? It could very well be a red herring, but I'm not sure.

should we account for pebble scan batches? it's odd because we immediately transfer all of that memory into an accounted-for SQL batch, and then allow the pebble scan batch to be freed, but perhaps the moments at which both the SQL batch and the pebble scan batch are live are too long when the cluster is overloaded?

Jira issue: CRDB-7339

The text was updated successfully, but these errors were encountered:

jordanlewis · 2021-05-08T18:49:51Z

Related issue is #64916. After fixing that issue, I observe better memory behavior, but it is not completely better - I can still crash the cluster with high concurrency on certain TPCH queries (I've been playing with query 18), but it takes somewhat longer now.

I noticed something interesting: there might be a cascading failure behavior here, caused by the differences in performance when distsql plans are correct (table readers on leaseholders) vs when they are incorrect (table readers are not on leaseholders).

DistSQL will plan table readers on leaseholders at plan time. But, if the cluster starts getting overwhelmed and leaseholders move around due to liveness issues, then by the time a plan is run, the leaseholder may have changed. DistSQL does not know how to dynamically replan a plan: it is going to run the same plan that it started with, even if leaseholders change.

The performance difference when a table reader is on the leaseholder and when it's not is quite significant. This is because, when the table reader is not on the leaseholder, it has to do a gRPC request to the leaseholder to fetch data.

Also, when table readers are on leaseholders, there is 1 memory buffer unaccounted for: the pebble one discussed above.

On the other hand, when table readers are not on leaseholders, there are 2 memory buffers unaccounted for:

the gRPC buffers that are used to receive batches from the leaseholder (always allocated and thrown away in recv, shoutout to this classic issue transport: reuse memory on read path grpc/grpc-go#1455)
the BatchResponse protobuf that is decoded into from step 1 (see BatchResponse.Unmarshal generated code)

This is twice the unaccounted-for memory!

So I'm theorizing that, as the performance of a cluster begins to falter due to lack of resources, leaseholders will change around, which causes worse memory behavior (not to mention slower reads in the first place due to extra serialization and network IO caused by the necessity for the remote hop), which overwhelms the cluster further, until it eventually blows up due to things being slow and slop in the memory accounting getting worse as the plans aren't accurate with regards to leaseholder info.

jordanlewis · 2021-05-08T19:07:01Z

Example of the "double unaccounted-for memory" as it would appear in a pprof flamegraph:

When a tablereader is planned on the leaseholder, neither of these stacks will appear: instead, you'll see the pebbleResults.put stack like in the flamegraph higher up in the thread.

petermattis · 2021-05-08T19:26:48Z

Are you aware of GODEBUG=allocfreetrace=1? It may be infeasible to use this facility due to the volume of output, but it is another tool for tracing down a hard to discern sudden spike in memory usage. If the volume of output is too large, you could hack the Go runtime to only output allocations/free larger than some threshold size.

jordanlewis · 2021-05-08T19:36:24Z

Yes, I was considering exactly this strategy yesterday, but I haven't tried it yet.

At the moment, I am theorizing that the large, sudden allocations are caused by gRPC. gRPC has zero memory pressure flow control, which means that a node that sends several batch requests to another node, which is stuck, but suddenly gets unstuck, will happily eat itself to death when all of the batch responses start actually coming back.

joshimhoff · 2021-05-08T19:38:03Z

With the two concrete issues mentioned (pebble scan unaccounted for mem usage & table reader != leaseholder leading to grpc buffer + BatchResponse unaccounted for mem usage), will admission control keep the lack of accounting from leading to OOMs, at least maybe?

pebble scan unaccounted for mem usage

Obviously the accounting being missing could / should (well, y'all tell me) be corrected. But for this to cause OOMs, it sounds like you need a very high level of concurrency. Perhaps with admission control, you will start queuing work before reaching that level of concurrency. Perhaps also, CPU usage will be kept low enough that the below theory will not happen:

perhaps the moments at which both the SQL batch and the pebble scan batch are live are too long when the cluster is overloaded?

table reader != leaseholder leading to grpc buffer + BatchResponse unaccounted for mem usage

Again the accounting being missing could / should be corrected. But IIUC admission control will queue work so as to reserve enough resources to maintain liveness. So leaseholders won't move around, at least not for the concrete reason of failing to maintain liveness.

Thoughts? Maybe there are more "OOM when highly overloaded due to DML only workload" bugs that will also be "fixed" by admission control, as there will be a limit to how overloaded CRDB ever gets, due to pulling (some of) the queueing out of the goruntime into the application.

jordanlewis · 2021-05-08T19:40:57Z

But for this to cause OOMs, it sounds like you need a very high level of concurrency. Perhaps with admission control, you will start queuing work before reaching that level of concurrency.

I agree that admission control will likely help us get out of this situation. But I'm not sure if it would be enough on its own, depending on how it deals with long-running, giant distsql requests, which are pretty different than kv-style load.

I think we probably need to hook gRPC up to our memory accounting. An appropriate place would be here. As you can see, we read a message length off of the wire, then unconditionally allocate that size of message.

https://github.com/grpc/grpc-go/blob/d2d6bdae07e844b8a3502dcaf00dc7b1b5519a59/rpc_util.go#L562-L575

I'm less convinced that we need to hook pebble up to our memory accounting just yet. I think that the memory accounting delay here is not as significant as that of gRPC, but more experimentation is needed.

adityamaru · 2021-05-09T09:28:26Z

@jordanlewis I went down a similar path when debugging an index backfiller OOM a few months ago - #58933 (comment). The only thing growing unbounded was the pebble scan batch. We fixed it by using a new Fetcher per index backfill batch, rather than a single, long-lived one - #61366, but I couldn't understand why the underlying pebble batch wasn't being freed. Not much to add to your investigation but I'm keenly following along in case my fix just papered over something more sinister.

jordanlewis · 2021-05-09T15:27:19Z

Oh, thanks for the tip! That does seem extremely suspicious. I see a couple of places where the fetcher could be holding onto memory from a pebble batch that we could be clearing, but doing that wouldn't hold more than 1 extra pebble batch in memory at a time, so I'm not totally sure how your fix could have solved a problem that was more than accidental 2x memory - I don't see how we could have unbounded growth.

I'm thinking about the indexKey, kv, keyRemainingBytes fields in particular - those are slices into underlying memory that isn't copied out of the pebble batch for performance reasons. We should clear those when the fetcher is finished and that will save us some memory, possibly up to 2x.

I think the cfetcher has similar issues, in the nextKV and lastRowPrefix fields.

Also, there are some similar issues in kv_batch_fetcher, which stores its responses and remainingBatches slices in a queue pattern, and continually slices off the front of them to do the queuing. But, it doesn't nil out the front elements of the queue. I actually made a change to fix this earlier on the branch that I've been playing with for this issue, but it didn't improve anything. It's possible that this was the root cause of #58933, though - it could look like unbounded growth depending on how Go's slice reallocation algorithm works when you need more space in a slice that you've also been "cutting off the front of".

Diff for this last thing that I've been playing with:

diff --git a/pkg/sql/row/kv_batch_fetcher.go b/pkg/sql/row/kv_batch_fetcher.go
index 6197c2d746..5a4e5a95b3 100644
--- a/pkg/sql/row/kv_batch_fetcher.go
+++ b/pkg/sql/row/kv_batch_fetcher.go
@@ -504,11 +506,13 @@ func (f *txnKVFetcher) nextBatch(
 ) (ok bool, kvs []roachpb.KeyValue, batchResponse []byte, origSpan roachpb.Span, err error) {
        if len(f.remainingBatches) > 0 {
                batch := f.remainingBatches[0]
+               f.remainingBatches[0] = nil
                f.remainingBatches = f.remainingBatches[1:]
                return true, nil, batch, f.origSpan, nil
        }
        for len(f.responses) > 0 {
                reply := f.responses[0].GetInner()
+               f.responses[0] = roachpb.ResponseUnion{}
                f.responses = f.responses[1:]
                origSpan := f.requestSpans[0]
                f.requestSpans = f.requestSpans[1:]
@@ -517,12 +521,14 @@ func (f *txnKVFetcher) nextBatch(
                case *roachpb.ScanResponse:
                        if len(t.BatchResponses) > 0 {
                                batchResp = t.BatchResponses[0]
+                               t.BatchResponses[0] = nil
                                f.remainingBatches = t.BatchResponses[1:]
                        }
                        return true, t.Rows, batchResp, origSpan, nil
                case *roachpb.ReverseScanResponse:
                        if len(t.BatchResponses) > 0 {
                                batchResp = t.BatchResponses[0]
+                               t.BatchResponses[0] = nil
                                f.remainingBatches = t.BatchResponses[1:]
                        }
                        return true, t.Rows, batchResp, origSpan, nil

jordanlewis · 2021-05-09T16:36:37Z

It's completely sad that Go doesn't have any tools that allow you to explore a heap dump and understand what is retaining different objects. A problem like this would take, seriously, all of 15 minutes to solve with Java and YourKit. </rant>

jordanlewis · 2021-05-09T16:48:13Z

Btw, I filed golang/go#45984 recently because the Go viewcore program doesn't work on the cores we emit for whatever reason.

Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs cockroachdb#64906 Release note: None

joshimhoff · 2021-05-10T16:54:37Z

It's completely sad that Go doesn't have any tools that allow you to explore a heap dump and understand what is retaining different objects.

Was reading about this too. Sad!

I agree that admission control will likely help us get out of this situation.

One concrete Q I have is: When you are hitting these OOMs, what CPU & disk IOPS utilization are ya at? High enough that we would be queueing work in an admission control world?

But I'm not sure if it would be enough on its own, depending on how it deals with long-running, giant distsql requests, which are pretty different than kv-style load.

Can you say more about this? Do you have a concrete worry about such load or is it more uncertainty about how admission control (at the SQL level) will interact with such load?

jordanlewis · 2021-05-10T18:30:08Z

One concrete Q I have is: When you are hitting these OOMs, what CPU & disk IOPS utilization are ya at? High enough that we would be queueing work in an admission control world?

I'm not sure but I will check. I'm almost sure we'd be queueing work here. I didn't make it clear, but sending 1024 concurrent TPCH query 18s at a 3 node cluster is a completely absurd thing to do be doing. It's full overload - there is no chance for the cluster to be able to respond to these queries successfully without queueing. The absurdity of this workload is why it's a good testbed for these OOMs, but also probably (hopefully) why admission control would help protect us here in the future.

Can you say more about this? Do you have a concrete worry about such load or is it more uncertainty about how admission control (at the SQL level) will interact with such load?

It's uncertainty about where the admission control spigots will be placed. Since these queries are large and relatively long-running, and since we kick them off all at once from an idle system, I don't think admission control would necessarily notice to reject or queue anything at first. So I'm uncertain whether anybody will detect the overload and be able to error out some of these queries while they're running.

Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs cockroachdb#64906 Release note: None

64946: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs #64906 Release note: None 65195: kvserver: change shouldCampaignOnWake to campaign when the leader is dead r=lunevalex a=lunevalex Touches #57093 For a lease to move off a dead node to a new healthy node, a healthy node must first win a raft election and then apply the lease transfer. For a quiescent range this means the following: - Wait 9s for liveness to expire - Attempt to acquire the lease on a healthy node, this is rejected because lease acquisition can only happen on the raft leader - The rejection unquiesces the range - The nodes wait out the election timeout (3s) before trying to campaign - A raft election occurs and a healthy node wins - A lease transfer can now be processed by the newly elected leader This change proposed to eliminate the extra wait between when the range is unquiesced and a new campaign is kicked off. The node unquiescing the range will now check if the raft leader is alive according to liveness and if it's not it will kick off a campaign to try and win raft leadership. Release note: None 65339: tracing: fix benign data race r=irfansharif a=irfansharif Fixes #65148, it's possible to override the op name associated with a trace after having instantiated it. We'll want to protect access to it through a mutex. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs cockroachdb#64906 Release note: None

65373: release-20.2: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola Backport 1/1 commits from #64946. /cc @cockroachdb/release --- Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs #64906 Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

65372: release-21.1: storage: discard memory in pebbleMVCCScanner before putting in pool r=sumeerbhola a=sumeerbhola Backport 1/1 commits from #64946. /cc @cockroachdb/release --- Previously we would clear fields that can consume significant memory, like pebbleResults, after getting from the pool. Now this is done before the put, so that we don't retain memory unnecessarily. Informs #64906 Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>

jordanlewis · 2021-06-01T17:09:26Z

#65881 improves the behavior for query 18.

Query 4 can still OOM a server in seconds. I think it's because lookup join does not play nicely and is not well-memory-accounted. I have filed a new issue:

#65904

65881: row: fix intra-query memory leaks in kvFetcher and txnKVFetcher r=jordanlewis a=jordanlewis Updates #64906. The critical change is the first patch, the kvfetcher one. The kvbatchfetcher one is theoretically good but I haven't found it to make as large of a difference as the first patch. With the first patch applied alone, I can no longer cause OOM conditions with 1024 concurrent TPCH query 18s sent at a single machine, which is a major improvement. Prior to that patch, such a workload would overwhelm the machine within 1-2 minutes. This bug was found with help of the new tooling I've been adding to `viewcore`, mostly the new `pprof` output format, the existing html object explorer, and the new type explorer. You can see these updates at https://github.com/jordanlewis/debug/tree/crl-stuff. ---- The KVFetcher is the piece of code that does the first level of decoding of a KV batch response, doling out slices of keys and values to higher level code that further decodes the key values into formats that the SQL engine can operate on. The KVFetcher uses a slice into the batch response to keep track of where it is during the decoding process. Once the slice is empty, it's finished until someone asks it for a new batch. However, the KVFetcher used to keep around that empty slice pointer for its lifetime, or until it was asked for a new batch. This causes the batch response to be un-garbage-collectable, since there is still a slice pointing at it, even though the slice is empty. This causes queries to use up to 2x their accounted-for batch memory, since the memory accounting system assumes that once data is transfered out of the batch response into the SQL representation, the batch response is freed - it assumes there's just 1 "copy" of this batch response memory. This is especially problematic for long queries (since they will not allow that KVFetcher memory to be freed until they're finished). In effect, this causes 1 extra batch per KVFetcher per query to be retained in memory. This doesn't sound too bad, since a batch is of fixed size. But the max batch size is 1 megabyte, so with 1024 concurrent queries, each with 3 KVFetchers, like we see in a TPCH workload with 1024 concurrent query 18s, that's 1024 * 1MB * 3 = 3GB of unaccounted for memory. This is easily enough memory to push a node over and cause it to OOM. This patch nils the batch response pointer once the KVFetcher is finished decoding it, which allows it to be garbage collected as soon as possible. In practice, this seems to allow at least a single-node concurrency-1024 query18 TPCH workload to survive indefinitely (all queries return out of budget errors) without OOMing. Release note (bug fix): queries use up to 1MB less actual system memory per scan, lookup join, index join, zigzag join, or inverted join in their query plans. This will result in improved memory performance for workloads with concurrent OLAP-style queries. ---- Previously, we could leave some dangling references to batch responses around in the txnKVFetcher when we were fetching more than one batch at a time. This would cause a delay in reclamation of memory for the lifetime of a given query. Release note (bug fix): use less memory in some queries, primarily lookup joins. 65966: build: build cross toolchains r=rail a=rickystewart The overwhelming majority of the time spent in building the builder image comes from the cross toolchains. Here we separate out the toolchain build into a separate step and package them as `.tar.gz`'s that we simply download and install during the Docker image build. This also opens up the opportunity to use the same toolchains in the Bazel build, although consuming these toolchain artifacts from the Bazel build is a follow-up project. Closes #65351. Release note: None 66012: bazel: in `dev` toolchain, don't include unnecessary directory r=rail a=rickystewart After a macOS update, including the additional directory in the include search path breaks as the same file is included in two different locations, leading to some missing declarations. Closes #64296. Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

@yuzefovich

70809: release-21.2: colfetcher: make txnKVFetcher account for memory when used by cFetcher r=yuzefovich a=blathers-crl[bot] Backport 1/1 commits from #70729 on behalf of @yuzefovich. Backport 1/1 commits from #71321 on behalf of @yuzefovich. /cc @cockroachdb/release ---- **colfetcher: make txnKVFetcher account for memory when used by cFetcher** Previously, the cFetcher created the txnKVFetcher with a nil monitor under the assumption that it (the cFetcher) performs the memory accounting itself. I think the justification was that after receiving the batch response into the txnKVFetcher, the cFetcher will quickly peel things off and put into `coldata.Vec`s where it is accounted for. This, however, isn't right because we always perform a deep copy of byte slices when we set the values into `coldata.Vec`s. (This is the case for bytes-like and other types, but bytes-like are likely to be affected the most. I think previously - when we didn't have the flat bytes representation - we might have skipped the deep copy.) This results in the batch responses not being accounted for when the txnKVFetcher is used by the cFetcher (which is the case for the vectorized scans and index joins). Those batch responses can be on the order of MBs which is quite large if we have hundred concurrent cFetchers. This commit fixes the issue by providing the monitor in the constructor. Addresses: #64906. Release note (bug fix): CockroachDB is now less likely to OOM when queries reading a lot of data are issued with high concurrency (these queries are likely to hit the memory budget determined by `--max-sql-memory` startup parameter). **colfetcher: fix the index join** The vectorized index join uses the cFetcher which is started multiple times. Previously, we forgot to close the old kv fetcher on the second and all subsequent calls. This was mostly ok (modulo not helping the GC with the collection of old spans) up until recently when we added the memory accounting to the kv fetcher when used by the cFetcher. After that change, if the index join issues at least two batches of spans to fetch, we won't close all of the memory accounts, which would lead to a crash in non-release build. This is now fixed by eagerly closing the fetcher when emitting the last batch, and the relevant code path is exercised by randomizing the index join batch size limit. Release note: None (no stable release with this bug) ---- Release justification: Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

84229: colexechash: improve memory accounting in the hash table r=yuzefovich a=yuzefovich This commit improves the memory accounting in the hash table to be more precise in the case when the `distsql_workmem` limit is exhausted. Previously, we would allocate large slices first only to perform the memory accounting after the fact, possibly running out of budget which would result in a error being thrown. We'd end up in a situation where the hash table is still referencing larger newly-allocated slices while only the previous memory usage is accounted for. This commit makes it so that we account for the needed capacity upfront, then perform the allocation, and then reconcile the accounting if necessary. This way we're much more likely to encounter the budget error before making the large allocations. Additionally, this commit accounts for some internal slices in the hash table used only in the hash joiner case. Also, now both the hash aggregator and the hash joiner eagerly release references to these internal slices of the hash table when the spilling to disk occurs (we cannot do the same for the unordered distinct because there the hash table is actually used after the spilling too). This required a minor change to the way the unordered distinct spills to disk. Previously, the memory error could only occur in two spots (and one of those would leave the hash table in an inconsistent state and we were "smart" in how we repaired that). However, now the memory error could occur in more spots (and we could have several different inconsistent states), so this commit chooses a slight performance regression of simply rebuilding the hash table from scratch, once, when the unordered distinct spills to disk. Addresses: #60022. Addresses: #64906. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

84319: storage: be tighter with allocations when TargetBytes is present r=yuzefovich a=yuzefovich Previously, while `put`ting the key into the `repr`, we could make an allocation that was too large given the remaining `TargetBytes` budget. This is the case since we're exponentially increasing the capacities of the buffers until 128MiB and because we're only accounting for the length of the slice even though the whole capacity would have a memory footprint. For example, with 10MiB of `TargetBytes` (which is used by SQL in many cases) and a ScanResponse that exceeds that limit, we'd allocate capacities that are powers of two, starting at, say, 256B, and would go all the way up to 8MiB; however, given that 10MiB limit, we'd only use 2MiB of that last 8MiB `repr` when we encounter the target bytes limit and stop. Such behavior is kinda ok if the response is marshalled by the gRPC to be sent to the other node, but it is quite unfortunate in the local fast-path cases (meaning the SQL and the KV are part of the same physical machine). In the latter scenario SQL would only account for the lengths of slices while keeping the whole slices alive for a while, leading to significant unaccounted for memory. In the example above, on the order of 6MiB would be unaccounted for - multiply that by some concurrency, and we have unaccounted memory on the order of hundreds of MiBs. The current behavior seems especially bad for the streamer use case where we issue many requests with the `TargetBytes` set and use `ScanResponse.NumBytes` field (which tracks the lengths of the slices) for the memory accounting purposes. In order to improve here, this commit teaches `put` method about the maximum capacity it can use. In the example above, the last slice would be on the order of 2MiB making everything happy: we stay very close to TargetBytes limit and don't have any wasted space. Addresses: #64906. Addresses: #82160. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

83904: rfc: add user defined function RFC r=mgartner a=mgartner Release note: None 84286: streamingccl: heartbeat persisted frontier r=samiskin a=samiskin Resolves #84086 Previously we would forward the ingestion frontier's resolved timestamp at the point of heartbeat, however this would result in the chance of the protected timestamp record of the producer to exceed that of the ingestion job's persisted frontier. This PR uses the last persisted frontier value instead. Release note (bug fix): The protected timestamp of the producer job is no longer able to exceed the persisted ingestion frontier. 85285: colexec: improve handling of the metadata r=yuzefovich a=yuzefovich **colmem: introduce a helper to release all reservations from allocator** Release note: None This commit audits all of the places where we're operating with the producer metadata and improves things a bit. This was prompted by the fact that some types of the metadata (in particular, the LeafTxnFinalState) can be of non-trivial footprint, so the sooner we lose the reference to the metadata objects, the more stable CRDB will be. **colexec: improve handling of the metadata** This commit adds the memory accounting for the LeafTxnFinalState metadata objects in most (all?) places that buffer metadata (inbox, columnarizer, materializer, parallel unordered synchronizer). The reservations are released whenever the buffered component is drained, however, the drainer - who takes over the metadata - might not perform the accounting. The difficulty there is that the lifecycle of the metadata objects are not super clear: it's possible that we're at the end of the execution, and the whole plan is being drained - in such a scenario, the metadata is pushed into the DistSQLReceiver and then imported into the root txn and is discarded (modulo the increased root txn footprint); it's also possible that the metadata from the drained component gets buffered somewhere up the tree. But this commit adds the accounting in most such places. Addresses: #64906. Addresses: #81451. Release note: None Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Shiranka Miskin <shiranka.miskin@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

84230: kvcoord: account for the span overhead when condensing refresh spans r=yuzefovich a=yuzefovich Previously, we would only account for the lengths of the key and the end key of the span for the purposes of memory estimation while condensing the refresh spans set. However, each span has non-trivial overhead (48 bytes) of `roachpb.Span` object itself which we previously ignored. As a result, the actual footprint of the refresh spans could previously significantly exceed the target size, especially when the keys are small. For example, when looking at a recently collected core dump, I saw the refresh spans taking up about 24MB in the heap whereas the target setting is only 4MiB. This memory currently is not tracked against the memory accounting system at all, so such over-shots are quite bad, especially so given the recent bump of the setting from 256KiB to 4MiB. Addresses: #64906. Addresses: #81451. Release note (ops change): The way we track memory against `kv.transaction.max_intents_bytes` and `kv.transaction.max_refresh_spans_bytes` has been adjusted to be more precise (we no longer ignore some of the overhead). As a result, the stability of CRDB improves (we're less likely to OOM), however, this change effectively reduces the budgets determined by those cluster settings. In practice, this means that - the intents might be tracked more coarsely (due to span coalescing) which makes the intent resolution less efficient - the refresh spans become more coarse too making it more likely that `ReadWithinUncertaintyIntervalError`s are returned to the user rather than are retried transparently. 85156: changefeedccl: reduce allocations in kvevent blocking buffer r=jayshrivastava a=jayshrivastava This change removes a pointer from the kvevent.Event struct, reducing overall allocations. The hope is that this reduces the amount of work Go gc has to do, which will reduce SQL latency at the end of the day. When doing backfills, the allocations in kv events add up pretty fast, so reducing even one pointer is significant. See #84709 for more info. I'm not closing the issue with this PR since we may decide to reduce more pointers in future PRs using some of the ideas in the issue comments. Here are the benchmark results ``` name old time/op new time/op delta MemBuffer-10 98.1µs ± 0% 95.8µs ± 1% -2.35% (p=0.008 n=5+5) name old alloc/op new alloc/op delta MemBuffer-10 76.9kB ± 0% 64.4kB ± 0% -16.17% (p=0.008 n=5+5) name old allocs/op new allocs/op delta MemBuffer-10 859 ± 0% 675 ± 0% -21.42% (p=0.008 n=5+5) ``` 85368: roachtest: add KV/YCSB benchmarks with global MVCC range tombstones r=jbowens a=erikgrinaker **kvserver: add env var to write global MVCC range tombstone** This patch adds the envvar `COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE`. When enabled, it will write a global MVCC range tombstone across the entire table data keyspan during cluster bootstrapping. This can be used to test performance and correctness in the presence of MVCC range tombstones, by activating range key-specific code paths while not semantically affecting the data above it. Touches #84384. Release note: None **roachtest: add KV/YCSB benchmarks with global MVCC range tombstones** This patch adds a set of benchmark variants that write a single MVCC range tombstone across the entire SQL keyspan at cluster start, via the `COCKROACH_GLOBAL_MVCC_RANGE_TOMBSTONE` env var. Even though this range tombstone will not affect the data written during the benchmarks, it activates range key-specific code paths in the storage layer which can have a significant impact on performance. The new benchmarks are: * `kv0/enc=false/nodes=3/cpu=32/mvcc-range-keys=global` * `kv95/enc=false/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/A/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/B/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/C/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/D/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/E/nodes=3/cpu=32/mvcc-range-keys=global` * `ycsb/F/nodes=3/cpu=32/mvcc-range-keys=global` Resolves #84384. Release note: None 85424: cmd/dev: add support for --show-diff flag from logictests r=rytaft a=rytaft This commit adds support for the `--show-diff` flag when running tests with `dev`. This flag is used by the logictests in order to show diffs between the expected and actual output. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

85286: kvcoord: subject refresh spans to the limit when importing from leaves r=yuzefovich a=yuzefovich Whenever we're importing the refresh spans from the LeafTxnFinalState, we're inserting those refresh spans into the refresh footprint set and attempt to condense the set to stay under the maximum (determined by a setting). However, the condensing is performed on a best-effort basis, so previously it was possible to significantly over-shoot the limit. Furthermore, this refresh footprint is not accounted for. This commit improves things a bit by giving up on the ability to refresh (i.e. invalidating the refresh footprint) once the limit is exceeded when importing the refresh spans from the leaves - it is similar to what we do after `Send`ing the BatchRequest. Addresses: #64906. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

86186: sql: do not distribute queries with subqueries returning OIDs r=yuzefovich a=yuzefovich If a subquery results in a DOid datum, the datum will get a type annotation (because DOids are ambiguous) when serializing the render expression involving the result of the subquery. As a result, we might need to perform a cast on a remote node which might fail, thus we prohibit the distribution of the main query. Fixes: #86075. Release justification: bug fix. Release note: None 86357: colmem: improve memory accounting when memory limit is exceeded r=yuzefovich a=yuzefovich **colmem: improve memory accounting when memory limit is exceeded** This commit improves the memory accounting when memory limit is exceeded. Previously, in several operators we could run into a situation where we perform some allocations and run into a memory limit error later, which results in those allocations being unaccounted for. In some cases this is acceptable (when the query results in an error), but in others the memory error is caught and spilling to disk occurs. In the latter scenarios we would under-account, and this commit fixes most of such situations. Now, each disk-spilling operator instantiates a "limited" allocator that will grow an unlimited memory account when a memory error is encountered. The idea is that even though the denied allocations cannot be registered with the main memory account (meaning the operator has exceeded its memory limit), we still will draw from the `--max-sql-memory` pool since the allocations can be live for non-trivial amount of time. If an error occurs when growing the unlimited memory account, then that error is returned (not the original memory error) so that the disk spiller doesn't catch it. This commit audits all operators in `execplan` to use the limited allocator where appropriate. The new accounting method is only used in a handful of places which cover most of the use cases. The goal is to make this commit backportable whereas the follow-up commit will audit usages of `AdjustMemoryUsage` and will not be backported. Addresses: #64906. Fixes: #86351. Addresses: https://github.com/cockroachlabs/support/issues/1762. Release justification: bug fix. Release note: None **colmem: audit callers of AdjustMemoryUsage** This commit audits all callers of `Allocator.AdjustMemoryUsage` to use the newly-exported `AdjustMemoryUsageAfterAllocation` where applicable (meaning that if an allocation occurs before the method is called, then the new method is now used). In many cases this won't result in a change in the behavior since the allocators are not instantiated with limited memory accounts, but in some cases it is still useful. Release justification: bug fix. Release note: None 86402: externalconn,amazon: support s3 KMS in External Connecetions r=benbardin a=adityamaru Informs: #84228 Release note (sql change): Users can now `CREATE EXTERNAL CONNECTION` to represent an `aws-kms` scheme that represents an AWS KMS resource. Release justification: low risk change to new functionality to register s3 KMS as a supported External Connection 86613: streamproducer: check the job type for replication stream r=yuzefovich a=yuzefovich Previously, we would panic if the job id corresponded to a job type different from the replication stream job, and this is now fixed. Fixes: #86508. Release justification: bug fix. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Aditya Maru <adityamaru@gmail.com>

97442: kvserver: record ScanStats once per BatchRequest r=yuzefovich a=yuzefovich Previously, we would record a `kvpb.ScanStats` object into the trace for each evaluated Get, Scan, and ReverseScan command. This was suboptimal for two reasons: - this required an allocation of that `kvpb.ScanStats` object - this required propagating all of these separate objects via the tracing infrastructure which might make it so that the tracing limits are reached resulting in some objects being dropped. This commit, instead, changes the ScanStats to be tracked at the BatchRequest level, thus, we only need to record a single object per BatchRequest. This results in reduced granularity, but that is still sufficient for the SQL needs which simply aggregates all `kvpb.ScanStats` from a single SQL processor into one object. As a result, the tpch_concurrency metric averaged over 20 runs increased from 76.75 to 84.75. Additionally, this commit makes it so that we track the number of Gets, Scans, and ReverseScans actually evaluated as part of the BatchResponse. This information is plumbed through a couple of protos but is not exposed in any SQL Observability virtual tables. Still, due to having it in the protos will include this information into the trace. Informs: #64906. Fixes: #71351. Release note: None 97646: go.mod: bump Pebble to bc4c9afe47a5 r=sumeerbhola a=jbowens ``` bc4c9afe db: use objstorage provider for consistency check 7c833595 db: move CheckConsistency 470c6d49 build: add make targets for asan and msan; add ci jobs b76bb914 ci: Continuously build for NetBSD 456fd724 vfs: Add support for NetBSD 79eb9477 ci: test against go1.20.1; update staticcheck ab1a49c9 db: add per-level ValueBlocksSize gauge to Metrics efd802f1 db: use objstorage to find obsolete objects f01d8eff sstasble: allow duplicate deletes in deleteObsoleteObject 822e1941 db: fix replay corruption in read only mode b2e46077 db: add a ScanInternal to scan internal keys e1e9c2a1 sharedobjcat: store object type 542e01ad sharedobjcat: add creator-id 6adb7237 docs: add benchmark annotation for grandparent boundary splitting ebc640d1 objstorage: add Sync ca1ed91d objstorage: return NotExistError when object is not known ``` Epic: None Release note: None 97655: ui: show events even with max size limit reached r=maryliag a=maryliag Show events that were returned when we reach the max size limit of the sql-api. Part Of #96184 https://www.loom.com/share/31e8c55f091c4e2ba4b4243710aa58a4 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: maryliag <marylia@cockroachlabs.com>

yuzefovich · 2023-03-14T23:46:55Z

Over in #98428 I observed that we can now sustain tpch_concurrency roachtest with concurrency of 1000. It is still possible to OOM cockroach with it (at least I think that few crashes are OOMs - I haven't examined them yet), but GOMEMLIMIT (introduced in #97666) saves us vast majority of the time. This OOM happens when the nodes are so overloaded that extremely rare bugs show up, so I think we can call this issue solved. 🎉

@jordanlewis thoughts on closing this?

jordanlewis · 2023-03-15T00:11:18Z

Amazing work, and great news!

I'm fine to see this get closed if you're feeling satisfied, go ahead.

yuzefovich · 2023-03-15T00:17:35Z

Sounds good, thanks.

jordanlewis added the C-investigation Further steps needed to qualify. C-label will change. label May 8, 2021

sumeerbhola mentioned this issue May 10, 2021

storage: discard memory in pebbleMVCCScanner before putting in pool #64946

Merged

jordanlewis mentioned this issue May 12, 2021

sql: remote flows do not cancel promptly upon query cancellation #64916

Closed

sumeerbhola mentioned this issue May 18, 2021

release-21.1: storage: discard memory in pebbleMVCCScanner before putting in pool #65372

Merged

sumeerbhola mentioned this issue May 18, 2021

release-20.2: storage: discard memory in pebbleMVCCScanner before putting in pool #65373

Merged

This was referenced May 29, 2021

row: fix intra-query memory leaks in kvFetcher and txnKVFetcher #65881

Merged

row: joinreader has insufficient memory accounting #65904

Closed

This was referenced Jul 13, 2022

storage: be tighter with allocations when TargetBytes is present #84319

Merged

colexechash: improve memory accounting in the hash table #84229

Merged

kvcoord: account for the span overhead when condensing refresh spans #84230

Merged

This was referenced Jul 28, 2022

colexec: improve handling of the metadata #85285

Merged

kvcoord: subject refresh spans to the limit when importing from leaves #85286

Merged

This was referenced Aug 17, 2022

release-22.1: colexechash: improve memory accounting in the hash table #86321

Merged

release-21.2: colexechash: improve memory accounting in the hash table #86323

Merged

colmem: improve memory accounting when memory limit is exceeded #86357

Merged

yuzefovich mentioned this issue Aug 23, 2022

release-22.1: colmem: improve memory accounting when memory limit is exceeded #86663

Merged

yuzefovich mentioned this issue Feb 22, 2023

kvserver: record ScanStats once per BatchRequest #97442

Merged

yuzefovich mentioned this issue Feb 25, 2023

release-22.2: kvserver: record ScanStats once per BatchRequest #97665

Merged

yuzefovich closed this as completed Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: can OOM a cluster with concurrent TPC-H load #64906

sql: can OOM a cluster with concurrent TPC-H load #64906

jordanlewis commented May 8, 2021 •

edited by cockroach-jira-scripts

Loading

jordanlewis commented May 8, 2021

jordanlewis commented May 8, 2021

petermattis commented May 8, 2021

jordanlewis commented May 8, 2021

joshimhoff commented May 8, 2021 •

edited

Loading

jordanlewis commented May 8, 2021

adityamaru commented May 9, 2021

jordanlewis commented May 9, 2021

jordanlewis commented May 9, 2021 •

edited

Loading

jordanlewis commented May 9, 2021

joshimhoff commented May 10, 2021 •

edited

Loading

jordanlewis commented May 10, 2021

jordanlewis commented Jun 1, 2021

yuzefovich commented Mar 14, 2023 •

edited

Loading

jordanlewis commented Mar 15, 2023

yuzefovich commented Mar 15, 2023

sql: can OOM a cluster with concurrent TPC-H load #64906

sql: can OOM a cluster with concurrent TPC-H load #64906

Comments

jordanlewis commented May 8, 2021 • edited by cockroach-jira-scripts Loading

jordanlewis commented May 8, 2021

jordanlewis commented May 8, 2021

petermattis commented May 8, 2021

jordanlewis commented May 8, 2021

joshimhoff commented May 8, 2021 • edited Loading

jordanlewis commented May 8, 2021

adityamaru commented May 9, 2021

jordanlewis commented May 9, 2021

jordanlewis commented May 9, 2021 • edited Loading

jordanlewis commented May 9, 2021

joshimhoff commented May 10, 2021 • edited Loading

jordanlewis commented May 10, 2021

jordanlewis commented Jun 1, 2021

yuzefovich commented Mar 14, 2023 • edited Loading

jordanlewis commented Mar 15, 2023

yuzefovich commented Mar 15, 2023

jordanlewis commented May 8, 2021 •

edited by cockroach-jira-scripts

Loading

joshimhoff commented May 8, 2021 •

edited

Loading

jordanlewis commented May 9, 2021 •

edited

Loading

joshimhoff commented May 10, 2021 •

edited

Loading

yuzefovich commented Mar 14, 2023 •

edited

Loading