importccl: blazing fast IMPORT #34809

danhhz · 2019-02-11T23:09:27Z

IMPORT is one of the first things our customers use in a POC and we use it extensively in our internal testing. Anything we can do to speed it up is pretty high-leverage.

We've laid a great foundation with direct ingestion of SSTables into RocksDB, but there is currently a big distributed sort that is limiting scalability and throughput. Luckily, the schema change team has recently switched index backfills to be based on a technique that uses direct SSTable ingestion, but without the distributed sort! There are a couple details to work out, but no fundamental reason we can't do the same thing for IMPORT, so now is a great time to revisit where we're at and lay down a roadmap for the medium-term future.

Specifically, I have the following goals:

fixtures import should be as fast or faster than fixtures load
fixtures import should scale to 100s of terabytes

fixtures import works by running IMPORT over CSVs that are synthesized from thin air, while fixtures load is a thin wrapper around RESTORE, so this goal is more-or-less that IMPORTon data already on a cluster should be faster thanRESTOREing data from google storage to a cluster on google compute. IMPORThas to do more work thanRESTORE` so if this is possible, it's because of network bandwidth constraints.

At a high-level, IMPORT currently reads a CSV line by line, converting it to our internal mvcc key/values. It currently works by putting these kvs in non-overlapping SSTables, which it then directly ingests. Because we don't have control of the order of rows in the CSVs (plus secondary indexes are basically never sorted in the same order), this requires that we first compute key splits such that the kv data between them is approximately the same size. Then we use these splits as buckets in a big distributed sort.

This big distributed sort means we do the CSV -> kv conversion twice. A distsql flow is set up to sample the splits, converting every CSV -> kvs in the process. Then a followup flow reruns the CSV -> kv conversion, except this time it routes everything according to the splits. This feeds to a second set of processors which buffers every kv in a bucket, sorts them, and runs AddSSTable. These second processors are also responsible for detecting unique index violations (in either the primary index or a secondary index) by finding duplicate keys when it constructs the SSTable for ingestion.

There are a bunch of inefficiencies in this process, but running the CSV->kv conversion twice and doing the distributed sort, along with all the buffering involved, dominates IMPORT's overall throughput. This is all because when we first built IMPORT, it was easier for AddSSTable to only work on non-overlapping buckets (and this was all it needed for RESTORE) The recent index backfill stuff has paved the way for doing overlapping AddSSTables, which means we can rip the double sample and the distributed sort out. ~~@dt has even prototyped this in #34751!~~ (Update: This merged in #34751!)

After that happens, it's worth looking into our new bottlenecks. As always, performance work should be benchmark driven, but here's some preliminary guesses for low hanging fruit.

It's silly to be using CSV as an intermediate format for importing workload data. The magic IMPORT thing works by taking a set of rows as query parameters and outputting the data as a CSV, which IMPORT then has to parse into rows of Datums. All of this is allocation and cpu heavy. The sql team is adding efficient columnar formats and they'll have to be able to serialize them to use them in distributed flows. If they end up using a standard format, we could add it as an alternate non-CSV format for IMPORT, reducing basically all the allocations between the data generation and the kv generation. (Update: dt had a better idea and it merged in importccl: skip detour though CSV when importing from workload #34892!)
In AddSSTable, we compute range mvccstats by making an iterator that merges existing rocksdb data and sstable data and handing that iterator to our common mvccstats code. This is because there could be existing data in the range and (because of unavoidable internal retries) we could end up applying the same sstable twice. Instead, we could compute the mvccstats for the sstable as we construct it and blindly add it to the existing stats, setting the estimate flag. This will not always be correct, but it's probably a good enough guess for the duration of the IMPORT. Then, after the IMPORT is finished, we go back and compute the exact mvcc stats and clear the estimate flag. This means that we only iterate the keyspace being imported into once, not once per sstable over a given span.
SSTables are ingested into RocksDB via the IngestExternalFile call. Historically, this meant we had to write them to disk twice, because RocksDB might need to modify a small part of the file, but raft might later need back the exact original payload. At some point, we added an optimization to try the IngestExternalFile disallowing any modifications to the file, and if that failed, a copy was made and IngestExternalFile was tried again allowing modifications. As of some recent RocksDB version, it's now possible to run IngestExternalFile in a mode that never modifies the file, guaranteeing that it's only written once. (Update: This merged in storage/engine,libroach: 1x write amplification on ingest sst #34886!)
Profile and speed up tpcc generator's impls of initial data generation: workload/tpcc: initial data load perf improvements #35322, and probably more
Switch workload.InitialData.Batch to be columnar, which will allow for drastically reduced allocations during initial data generation of frequently used workloads (tpcc, bank), while keeping the easy interface for other workloads: workload: allow columnar data generation (alloc -73%, MB/s +60%) #35349
Profile and speed up IMPORT: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% #35317, and certainly more

TODO(dan): continue fleshing this all out and starting running perf experiments

Epic CRDB-2340

Jira issue: CRDB-4624

The text was updated successfully, but these errors were encountered:

So we can track our progress in cockroachdb#34809. Diff between distsql IMPORT and direct IMPORT name old time/op new time/op delta ImportFixtureTPCC-8 7.74s ± 3% 4.09s ± 6% -47.18% (p=0.008 n=5+5) name old speed new speed delta ImportFixtureTPCC-8 11.4MB/s ± 3% 21.7MB/s ± 6% +89.46% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ImportFixtureTPCC-8 7.18GB ± 0% 4.12GB ± 0% -42.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ImportFixtureTPCC-8 89.0M ± 0% 39.5M ± 0% -55.56% (p=0.008 n=5+5) Release note: None

35002: workloadccl: benchmark `fixtures import` for single-node tpcc 1 r=dt a=danhhz So we can track our progress in #34809. Diff between distsql IMPORT and direct IMPORT name old time/op new time/op delta ImportFixtureTPCC-8 7.74s ± 3% 4.09s ± 6% -47.18% (p=0.008 n=5+5) name old speed new speed delta ImportFixtureTPCC-8 11.4MB/s ± 3% 21.7MB/s ± 6% +89.46% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ImportFixtureTPCC-8 7.18GB ± 0% 4.12GB ± 0% -42.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ImportFixtureTPCC-8 89.0M ± 0% 39.5M ± 0% -55.56% (p=0.008 n=5+5) Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

danhhz · 2019-02-15T22:52:30Z

In an attempt to start some standard measurements I added a benchmark that does a fixtures import tpcc --warehouses=1 on a single node. I also ran a fixtures import tpcc --warehouses=100 on a single node n1-standard-16 cluster. Multi-node is more representative of real world performance, but I'd first like to get a sense of (and improve) our speed of light.

tpcc-1 old is default new is with --experimental-direct-ingest

name                 old time/op    new time/op    delta
ImportFixtureTPCC-8     7.74s ± 3%     4.09s ± 6%  -47.18%  (p=0.008 n=5+5)

name                 old speed      new speed      delta
ImportFixtureTPCC-8  11.4MB/s ± 3%  21.7MB/s ± 6%  +89.46%  (p=0.008 n=5+5)

name                 old alloc/op   new alloc/op   delta
ImportFixtureTPCC-8    7.18GB ± 0%    4.12GB ± 0%  -42.62%  (p=0.008 n=5+5)

name                 old allocs/op  new allocs/op  delta
ImportFixtureTPCC-8     89.0M ± 0%     39.5M ± 0%  -55.56%  (p=0.008 n=5+5)

tpcc-100

default: 8m49s
direct ingest: 5m45s
delta: -34.8%

So we're not quite getting our linear speedup to 100, but we already know of some issues that could cause that. The biggest is probably that the old import with non-overlapping sstables shouldn't need any compactions, but the new overlapping ones will. I'd bet that write amplification goes up from tpcc-1 to tpcc-100. We might also now be filling up L1 and increasing read amplification for system traffic (jobs updates, liveness, etc). cc @dt

dt · 2019-02-15T22:59:04Z

Another thing I was thinking might be a factor is the fixed vs variable costs of running the sampling phase flow -- and in smaller import, those fixed costs don't shrink, meaning sampling is a bigger % of the overall import... thus skipping sampling entirely might see a bigger % impact on smaller imports.

danhhz · 2019-02-15T23:40:04Z

Perhaps, but sampling sets up a fairly simple flow, which I'd guess would be small in comparison to the 4s that tpcc-1 takes end-to-end.

My wild guesses are:

Non-linear behavior in IngestExternalFile after some point
Non-linear behavior in something else because the CSV "files" are really big
GC. This thing generates a TON of garbage, so we're probably missing some of the cost of GC on smaller tests
Tail latencies

I'm currently trying a single node tpcc-1000.

@danhhz

34892: importccl: skip detour though CSV when importing from workload r=dt a=dt Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases. This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier. The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle. This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var. This is a baby-step towards one of the areas identified in #34809. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

danhhz · 2019-02-26T19:12:52Z

Note to self: run our new benchmarks to measure the impact of #34886

dt · 2019-02-26T19:54:20Z

FWIW, in that PR I mentioned results on a single node MBP we measurable but not very large -- about 5% on 1.7gb tpcc import. It did cut the the direct ingestion write traffic almost half, but our direct writes were a fraction of the compactor's. The total disk write used (as measured by the OS) for importing 1.7gb went from 7.5gb to 6gb. #34886 (comment)

danhhz · 2019-02-26T19:57:54Z

Yup! I saw. I actually left it as a note to self instead of doing it right now because I want to see if we make any progress on the direct ingestion compactions. That bit, and the stability issues it comes with, is really preventing us from measuring our progress well right now. I'm mulling about it in the background.

dt · 2019-02-26T21:49:29Z

Wrote up some thoughts on #35219 re: how much we're re-compacting.

danhhz · 2019-03-01T23:00:16Z

cc @nvanbenschoten

Informs cockroachdb#34809. Before this change, `ErrNameString` only accepted a pointer to a string. This forced the owner of the string to escape. In the following two cases, this resulted in an entire `ColumnDescriptor` proto to be allocated on each call: - `sqlbase.MarshalColumnValue` - `sqlbase.CheckDatumTypeFitsColumnType` This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue` was responsible for about a third of heap allocations. ``` name old time/op new time/op delta ImportFixtureTPCC-4 6.24s ± 2% 5.68s ±11% -9.02% (p=0.000 n=10+10) name old speed new speed delta ImportFixtureTPCC-4 14.2MB/s ± 2% 15.6MB/s ±10% +10.08% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ImportFixtureTPCC-4 4.10GB ± 0% 2.77GB ± 0% -32.34% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ImportFixtureTPCC-4 39.5M ± 0% 33.2M ± 0% -16.11% (p=0.000 n=10+9) ``` It appears that this will also have an effect on `row.Inserter` and `row.Updater`. Release note: None

35317: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% r=nvanbenschoten a=nvanbenschoten Informs #34809. Before this change, `ErrNameString` only accepted a pointer to a string. This forced the owner of the string to escape. In the following two cases, this resulted in an entire `ColumnDescriptor` proto to be allocated on each call: - `sqlbase.MarshalColumnValue` - `sqlbase.CheckDatumTypeFitsColumnType` This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue` was responsible for about a third of heap allocations. ``` name old time/op new time/op delta ImportFixtureTPCC-4 6.24s ± 2% 5.68s ±11% -9.02% (p=0.000 n=10+10) name old speed new speed delta ImportFixtureTPCC-4 14.2MB/s ± 2% 15.6MB/s ±10% +10.08% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ImportFixtureTPCC-4 4.10GB ± 0% 2.77GB ± 0% -32.34% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ImportFixtureTPCC-4 39.5M ± 0% 33.2M ± 0% -16.11% (p=0.000 n=10+9) ``` It appears that this will also have an effect on `row.Inserter` and `row.Updater`. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Use bufalloc.ByteAllocator in randStringFromAlphabet instead of allocating many small byte arrays. Benchmark results: - InitTPCC only measures generating initial data - ImportFixtureTPCC measures loading initial data with COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip. name old time/op new time/op delta InitTPCC/warehouses=1-8 2.33s ± 0% 2.25s ± 1% -3.04% (p=0.008 n=5+5) ImportFixtureTPCC-8 5.00s ± 2% 4.83s ± 2% -3.32% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 30.4MB/s ± 0% 31.4MB/s ± 1% +3.14% (p=0.008 n=5+5) ImportFixtureTPCC-8 17.7MB/s ± 2% 18.3MB/s ± 2% +3.40% (p=0.008 n=5+5) name old alloc/op new alloc/op delta InitTPCC/warehouses=1-8 329MB ± 0% 246MB ± 0% -25.28% (p=0.008 n=5+5) ImportFixtureTPCC-8 3.70GB ± 0% 3.61GB ± 0% -2.20% (p=0.008 n=5+5) name old allocs/op new allocs/op delta InitTPCC/warehouses=1-8 9.38M ± 0% 5.60M ± 0% -40.27% (p=0.008 n=5+5) ImportFixtureTPCC-8 36.0M ± 0% 32.2M ± 0% -10.49% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None

Background: golang/go#21835 The current thinking for go2 is to switch to a Permuted Congruential Generator (PCG), which is much faster than the current "math/rand" generator. It's smaller in memory (128-bit vs 607 64-bit words) and _much_ faster to seed (which is a common operation in workload, which uses seeding to make deterministic results). The linked issues claims "The current implementation, in pure Go, is somewhat slower than math/rand, but could be comparably fast or even faster given compiler support for the 128-bit multiply inside." but for our needs it seems to already be faster. Benchmark results: - InitTPCC only measures generating initial data - ImportFixtureTPCC measures loading initial data with COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip. name old time/op new time/op delta InitTPCC/warehouses=1-8 2.25s ± 1% 2.06s ± 1% -8.07% (p=0.008 n=5+5) ImportFixtureTPCC-8 4.90s ± 3% 4.64s ± 2% -5.44% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 31.5MB/s ± 1% 34.3MB/s ± 1% +8.80% (p=0.008 n=5+5) ImportFixtureTPCC-8 18.0MB/s ± 3% 19.1MB/s ± 2% +5.77% (p=0.008 n=5+5) name old alloc/op new alloc/op delta InitTPCC/warehouses=1-8 246MB ± 0% 246MB ± 0% ~ (p=0.095 n=5+5) ImportFixtureTPCC-8 3.61GB ± 0% 3.61GB ± 0% ~ (p=0.548 n=5+5) name old allocs/op new allocs/op delta InitTPCC/warehouses=1-8 5.60M ± 0% 5.60M ± 0% +0.06% (p=0.008 n=5+5) ImportFixtureTPCC-8 32.2M ± 0% 32.3M ± 0% +0.08% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None

The slowest part of generating random strings is generating the random number. Previously, we made a uint64 and threw away most of it to pick one character from an alphabet. Instead, pick as many characters as we can get from 64 bits: floor(log(math.MaxUint64)/log(len(alpabet)). New random string generation microbenchmark: name time/op RandStringFast/letters-8 75.3ns ± 2% RandStringFast/numbers-8 74.2ns ± 1% RandStringFast/aChars-8 84.5ns ± 1% name speed RandStringFast/letters-8 345MB/s ± 2% RandStringFast/numbers-8 351MB/s ± 1% RandStringFast/aChars-8 308MB/s ± 1% Diffs in TPCC data generation and TPCC IMPORT: name old time/op new time/op delta InitTPCC/warehouses=1-8 2.03s ± 2% 0.65s ± 1% -67.85% (p=0.008 n=5+5) ImportFixtureTPCC-8 4.11s ± 5% 3.75s ± 2% -8.67% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 34.9MB/s ± 2% 108.5MB/s ± 1% +211.01% (p=0.008 n=5+5) ImportFixtureTPCC-8 21.5MB/s ± 5% 23.6MB/s ± 2% +9.45% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None

35322: workload/tpcc: initial data load perf improvements r=petermattis a=danhhz See individual commits for details Touches #34809 Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

37575: sql: save some EvalCtx allocations r=nvanbenschoten a=danhhz EvalCtx is big, so we shouldn't be passing it anyway. Found by digging into allocations that pprof attributed to the method signature of sql.GenerateInsertRow. I tried to see if this showed up on any of our SQL level INSERT benchmarks, but it seemed to be in the noise. name old time/op new time/op delta ConvertToKVs/tpcc/warehouses=1-8 3.83s ± 0% 3.66s ± 1% -4.53% (p=0.016 n=4+5) ConvertToSSTable/tpcc/warehouses=1-8 5.50s ± 4% 5.13s ± 1% -6.80% (p=0.008 n=5+5) name old speed new speed delta ConvertToKVs/tpcc/warehouses=1-8 23.2MB/s ± 0% 24.3MB/s ± 1% +4.77% (p=0.016 n=4+5) ConvertToSSTable/tpcc/warehouses=1-8 15.1MB/s ± 4% 16.2MB/s ± 1% +7.27% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ConvertToKVs/tpcc/warehouses=1-8 953MB ± 0% 704MB ± 0% -26.17% (p=0.029 n=4+4) ConvertToSSTable/tpcc/warehouses=1-8 1.64GB ± 0% 1.39GB ± 0% -15.19% (p=0.016 n=4+5) name old allocs/op new allocs/op delta ConvertToKVs/tpcc/warehouses=1-8 17.4M ± 0% 16.8M ± 0% -3.45% (p=0.008 n=5+5) ConvertToSSTable/tpcc/warehouses=1-8 17.4M ± 0% 16.8M ± 0% -3.45% (p=0.008 n=5+5) Also a small cleanup in some import code to use EvalCtx's Copy method instead of passing around a function closure. Touches #34809 Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

dt · 2019-08-14T15:02:28Z

A few recently merged changes related to this from me and @adityamaru27 :

⬆Direct ingest is on by default.
⬇ Import SST ingestion scans existing data and prohibits key shadowing (required for correctness with unique / pk indexes) added ~1% overhead when ingesting non-overlapping SSTs, higher with overlaps.
⬆Pebble-based SSTable creation (~10% faster in RESTORE).
⬆Caller-provided MVCCStats.
⬇ Removal of pre-buffering by table/index ID (fixed OOM crashes). Thanks to range splits at index bounds this didn't change overlapping SSTs but did mean we flushed small index SSTs when the overall buffer filled but...
⬆ Splitting KV buffering into pk-vs-index buffers to mitigate removal of per-index buffers above (still uses much less mem, in part because Adder avoids expensive roachpb.Value).

A couple more random ideas from the last time I profiled an ingestion:

a) computing MVCCStats incrementally during SST creation iteration (instead of with a second iteration) should be pretty straightforward given the significant constraints on what we're putting in them (all unique user keys).

b) pass ~1mb kvBuf as the kvbatches sent from the frontends, which could also potentially maintain a boolean sorted to skip re-sorting sorted input (this would probably require plumbing separate batches or batch channels for pk-vs-index data).

c) figure out how we can do better at splitting and scattering.

dt · 2019-08-20T21:20:34Z

Direct-ingest is now on by default.

The direct-ingest tpcc1k roachtest is now sitting at in teens of minutes down from 40+ and manual testing of tpcc2k looks like 25-40m down from 50-60min in 19.1 (also using direct-ingest, or well over an hour with sorted).

At this point I think the big pieces of this that bulk-io had on its roadmap are done, so I think bulk-io will be removing this issue form its roadmap now. I'll leave it up to @danhhz to decide if we should actually close it though.

awoods187 · 2019-08-20T21:23:37Z

this is so exciting! i'm looking forward to testing it out. Will this be used for nightlies? can we expect some savings and speedups there?

dt · 2019-08-20T21:25:31Z

it is on in nightlies (that use workload import). several of the changes that affect this all merged late last week, so that change in the tpcc1k roachtest is observed since 8/16.

danhhz · 2019-08-20T21:33:11Z

huge effort from bulk io on direct ingest, thanks @dt!

i've got two last ideas i've been meaning to write up. 1) speed up workload colbatch -> kv conversion (currently roundtrips through datums but doesn't need to) and 2) fewer copies in the addsstable path

dt · 2019-08-20T21:37:42Z

2a13ce6#diff-b010dc66fcc28059908e35277ad85915R157 is another likely big win during heavy ingestion loads -- we see the hangover of our containsEstiamtes flag coming up in how expensive splits become, particularly since we need to split a lot during ingestion.

If @adityamaru27 doesn't get to it before he leaves us (😢) we should try to make sure that actually gets done.

danhhz · 2019-08-20T21:40:13Z

Ah, good to know!

dt · 2019-08-26T22:44:03Z

I did a little more profiling and one of the big places where we're still spending time is sort.Sort on buffers. We're producing sorted data so I got to thinking it'd be cheaper to keep track of if it is sorted (just one comparison during each append to the buffer while we still think it is sorted) than to sort it from scratch (so O(n) instead of O(nlogn) ) but when I went to actually try this, I realized that our parallel creation of individually sorted batches feeding into one buffer ends up making an unsorted buffer. We could potentially sort batches to solve this, but it seems like that might be somewhat workload specific, so I got to thinking maybe the workload frontend should do that internally, and emit a sorted stream of sorted batches on kvCh. That would also eliminate the potential "jagged edge" problem that still pushes some of workload's SSTs to L0, when a flush occurs while one producer worker is ahead of another, causing the whole SST to overlap just because of one of the boundary batches.

I'm thinking it might make sense to add one more worker that just collects batches from the kvCh that currently all the other workers are pushing on to and does a little bit of buffering (e.g. numWorkers*2) in an attempt to emit batches sequentially.

That, combined with switch to a coldata.Bytes or bulk.kvBuf as the row.KVBatch internal format and bumping it up to 1mb should make the pipeline to the sst_batcher a fair bit cheaper.

I don't know if I'll actually get around to playing with this right away though so I stopped here to write it down.

39829: batcheval: MVCCStats for AddSSTable with no shadowing should not contain estimates r=dt a=adityamaru27 There is a significant performance win to be achieved by ensuring that the stats computed when using AddSSTable are not estimates as it prevents recomputation on splits. Running AddSSTable with disallowShadowing=true gets us close to this as we do not allow colliding keys to be ingested. However, in the situation that two SSTs have KV(s) which "perfectly" shadow an existing key (equal ts and value), we do not consider this a collision. While the shadowing KV would be skipped during ingestion, the stats would be added to the cumulative stats for the AddSSTable command, causing a double count for such KVs. Therefore, we compute the stats for these "skipped" KVs on-the-fly while checking for the collision condition in C++ and subtract them from the stats of the SST being ingested before adding them to the running cumulative for this command. These stats can then be marked as accurate. Tagging #34809 Release note: None Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>

nvanbenschoten · 2020-05-27T16:13:35Z

@dt is there more to do here or can this be closed?

github-actions · 2023-09-20T11:11:18Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

danhhz self-assigned this Feb 11, 2019

vivekmenezes added this to Triage in Bulk IO via automation Feb 12, 2019

dt mentioned this issue Feb 14, 2019

importccl: skip detour though CSV when importing from workload #34892

Merged

danhhz mentioned this issue Feb 15, 2019

workloadccl: benchmark fixtures import for single-node tpcc 1 #35002

Merged

nvanbenschoten mentioned this issue Mar 1, 2019

sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% #35317

Merged

danhhz mentioned this issue Mar 2, 2019

workload/tpcc: initial data load perf improvements #35322

Merged

craig bot pushed a commit that referenced this issue Mar 4, 2019

Merge #35322

1128e89

35322: workload/tpcc: initial data load perf improvements r=petermattis a=danhhz See individual commits for details Touches #34809 Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>

awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 6, 2019

kenliu added this to Triage in Disaster Recovery Backlog via automation Jul 3, 2019

kenliu removed this from Triage in Bulk IO Jul 3, 2019

rolandcrosby moved this from Triage to Backlog: Stabilize Bulk-ingest in Disaster Recovery Backlog Jul 10, 2019

rolandcrosby moved this from Backlog: IMPORT to Upcoming in Disaster Recovery Backlog Jul 22, 2019

dt removed this from Previous Milestone (Aug 19) in Disaster Recovery Backlog Aug 20, 2019

adityamaru27 mentioned this issue Aug 20, 2019

bulk: Compute MVCCStats of the SST being ingested on-the-fly #39724

Merged

adityamaru27 mentioned this issue Aug 22, 2019

batcheval: MVCCStats for AddSSTable with no shadowing should not contain estimates #39829

Merged

mwang1026 added this to Triage in Disaster Recovery Backlog via automation May 28, 2020

dt moved this from Triage to IMPORT in Disaster Recovery Backlog Jun 2, 2020

kenliu added the T-disaster-recovery label Dec 5, 2020

dankinder mentioned this issue Mar 25, 2021

IMPORT INTO very slow for large nodes and unsorted data #62614

Closed

exalate-issue-sync bot unassigned danhhz Jul 23, 2021

github-actions bot added the no-issue-activity label Sep 20, 2023

github-actions bot added the X-stale label Oct 3, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023

Disaster Recovery Backlog automation moved this from Import/Export to Done Oct 3, 2023

exalate-issue-sync bot closed this as completed Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: blazing fast IMPORT #34809

importccl: blazing fast IMPORT #34809

danhhz commented Feb 11, 2019 •

edited by cockroach-jira-scripts

Loading

danhhz commented Feb 15, 2019

dt commented Feb 15, 2019 •

edited

Loading

danhhz commented Feb 15, 2019

danhhz commented Feb 26, 2019

dt commented Feb 26, 2019

danhhz commented Feb 26, 2019

dt commented Feb 26, 2019

danhhz commented Mar 1, 2019

dt commented Aug 14, 2019 •

edited

Loading

dt commented Aug 20, 2019

awoods187 commented Aug 20, 2019

dt commented Aug 20, 2019

danhhz commented Aug 20, 2019

dt commented Aug 20, 2019

danhhz commented Aug 20, 2019

dt commented Aug 26, 2019

nvanbenschoten commented May 27, 2020

github-actions bot commented Sep 20, 2023

importccl: blazing fast IMPORT #34809

importccl: blazing fast IMPORT #34809

Comments

danhhz commented Feb 11, 2019 • edited by cockroach-jira-scripts Loading

danhhz commented Feb 15, 2019

dt commented Feb 15, 2019 • edited Loading

danhhz commented Feb 15, 2019

danhhz commented Feb 26, 2019

dt commented Feb 26, 2019

danhhz commented Feb 26, 2019

dt commented Feb 26, 2019

danhhz commented Mar 1, 2019

dt commented Aug 14, 2019 • edited Loading

dt commented Aug 20, 2019

awoods187 commented Aug 20, 2019

dt commented Aug 20, 2019

danhhz commented Aug 20, 2019

dt commented Aug 20, 2019

danhhz commented Aug 20, 2019

dt commented Aug 26, 2019

nvanbenschoten commented May 27, 2020

github-actions bot commented Sep 20, 2023

danhhz commented Feb 11, 2019 •

edited by cockroach-jira-scripts

Loading

dt commented Feb 15, 2019 •

edited

Loading

dt commented Aug 14, 2019 •

edited

Loading