-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: blazing fast IMPORT #34809
Comments
So we can track our progress in cockroachdb#34809. Diff between distsql IMPORT and direct IMPORT name old time/op new time/op delta ImportFixtureTPCC-8 7.74s ± 3% 4.09s ± 6% -47.18% (p=0.008 n=5+5) name old speed new speed delta ImportFixtureTPCC-8 11.4MB/s ± 3% 21.7MB/s ± 6% +89.46% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ImportFixtureTPCC-8 7.18GB ± 0% 4.12GB ± 0% -42.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ImportFixtureTPCC-8 89.0M ± 0% 39.5M ± 0% -55.56% (p=0.008 n=5+5) Release note: None
35002: workloadccl: benchmark `fixtures import` for single-node tpcc 1 r=dt a=danhhz So we can track our progress in #34809. Diff between distsql IMPORT and direct IMPORT name old time/op new time/op delta ImportFixtureTPCC-8 7.74s ± 3% 4.09s ± 6% -47.18% (p=0.008 n=5+5) name old speed new speed delta ImportFixtureTPCC-8 11.4MB/s ± 3% 21.7MB/s ± 6% +89.46% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ImportFixtureTPCC-8 7.18GB ± 0% 4.12GB ± 0% -42.62% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ImportFixtureTPCC-8 89.0M ± 0% 39.5M ± 0% -55.56% (p=0.008 n=5+5) Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
In an attempt to start some standard measurements I added a benchmark that does a tpcc-1 old is default new is with --experimental-direct-ingest
tpcc-100
So we're not quite getting our linear speedup to 100, but we already know of some issues that could cause that. The biggest is probably that the old import with non-overlapping sstables shouldn't need any compactions, but the new overlapping ones will. I'd bet that write amplification goes up from tpcc-1 to tpcc-100. We might also now be filling up L1 and increasing read amplification for system traffic (jobs updates, liveness, etc). cc @dt |
Another thing I was thinking might be a factor is the fixed vs variable costs of running the sampling phase flow -- and in smaller import, those fixed costs don't shrink, meaning sampling is a bigger % of the overall import... thus skipping sampling entirely might see a bigger % impact on smaller imports. |
Perhaps, but sampling sets up a fairly simple flow, which I'd guess would be small in comparison to the 4s that tpcc-1 takes end-to-end. My wild guesses are:
I'm currently trying a single node tpcc-1000. |
34892: importccl: skip detour though CSV when importing from workload r=dt a=dt Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases. This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier. The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle. This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var. This is a baby-step towards one of the areas identified in #34809. Release note: none. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Note to self: run our new benchmarks to measure the impact of #34886 |
FWIW, in that PR I mentioned results on a single node MBP we measurable but not very large -- about 5% on 1.7gb tpcc import. It did cut the the direct ingestion write traffic almost half, but our direct writes were a fraction of the compactor's. The total disk write used (as measured by the OS) for importing 1.7gb went from 7.5gb to 6gb. #34886 (comment) |
Yup! I saw. I actually left it as a note to self instead of doing it right now because I want to see if we make any progress on the direct ingestion compactions. That bit, and the stability issues it comes with, is really preventing us from measuring our progress well right now. I'm mulling about it in the background. |
Wrote up some thoughts on #35219 re: how much we're re-compacting. |
Informs cockroachdb#34809. Before this change, `ErrNameString` only accepted a pointer to a string. This forced the owner of the string to escape. In the following two cases, this resulted in an entire `ColumnDescriptor` proto to be allocated on each call: - `sqlbase.MarshalColumnValue` - `sqlbase.CheckDatumTypeFitsColumnType` This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue` was responsible for about a third of heap allocations. ``` name old time/op new time/op delta ImportFixtureTPCC-4 6.24s ± 2% 5.68s ±11% -9.02% (p=0.000 n=10+10) name old speed new speed delta ImportFixtureTPCC-4 14.2MB/s ± 2% 15.6MB/s ±10% +10.08% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ImportFixtureTPCC-4 4.10GB ± 0% 2.77GB ± 0% -32.34% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ImportFixtureTPCC-4 39.5M ± 0% 33.2M ± 0% -16.11% (p=0.000 n=10+9) ``` It appears that this will also have an effect on `row.Inserter` and `row.Updater`. Release note: None
Informs cockroachdb#34809. Before this change, `ErrNameString` only accepted a pointer to a string. This forced the owner of the string to escape. In the following two cases, this resulted in an entire `ColumnDescriptor` proto to be allocated on each call: - `sqlbase.MarshalColumnValue` - `sqlbase.CheckDatumTypeFitsColumnType` This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue` was responsible for about a third of heap allocations. ``` name old time/op new time/op delta ImportFixtureTPCC-4 6.24s ± 2% 5.68s ±11% -9.02% (p=0.000 n=10+10) name old speed new speed delta ImportFixtureTPCC-4 14.2MB/s ± 2% 15.6MB/s ±10% +10.08% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ImportFixtureTPCC-4 4.10GB ± 0% 2.77GB ± 0% -32.34% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ImportFixtureTPCC-4 39.5M ± 0% 33.2M ± 0% -16.11% (p=0.000 n=10+9) ``` It appears that this will also have an effect on `row.Inserter` and `row.Updater`. Release note: None
35317: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% r=nvanbenschoten a=nvanbenschoten Informs #34809. Before this change, `ErrNameString` only accepted a pointer to a string. This forced the owner of the string to escape. In the following two cases, this resulted in an entire `ColumnDescriptor` proto to be allocated on each call: - `sqlbase.MarshalColumnValue` - `sqlbase.CheckDatumTypeFitsColumnType` This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue` was responsible for about a third of heap allocations. ``` name old time/op new time/op delta ImportFixtureTPCC-4 6.24s ± 2% 5.68s ±11% -9.02% (p=0.000 n=10+10) name old speed new speed delta ImportFixtureTPCC-4 14.2MB/s ± 2% 15.6MB/s ±10% +10.08% (p=0.000 n=10+10) name old alloc/op new alloc/op delta ImportFixtureTPCC-4 4.10GB ± 0% 2.77GB ± 0% -32.34% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ImportFixtureTPCC-4 39.5M ± 0% 33.2M ± 0% -16.11% (p=0.000 n=10+9) ``` It appears that this will also have an effect on `row.Inserter` and `row.Updater`. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Use bufalloc.ByteAllocator in randStringFromAlphabet instead of allocating many small byte arrays. Benchmark results: - InitTPCC only measures generating initial data - ImportFixtureTPCC measures loading initial data with COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip. name old time/op new time/op delta InitTPCC/warehouses=1-8 2.33s ± 0% 2.25s ± 1% -3.04% (p=0.008 n=5+5) ImportFixtureTPCC-8 5.00s ± 2% 4.83s ± 2% -3.32% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 30.4MB/s ± 0% 31.4MB/s ± 1% +3.14% (p=0.008 n=5+5) ImportFixtureTPCC-8 17.7MB/s ± 2% 18.3MB/s ± 2% +3.40% (p=0.008 n=5+5) name old alloc/op new alloc/op delta InitTPCC/warehouses=1-8 329MB ± 0% 246MB ± 0% -25.28% (p=0.008 n=5+5) ImportFixtureTPCC-8 3.70GB ± 0% 3.61GB ± 0% -2.20% (p=0.008 n=5+5) name old allocs/op new allocs/op delta InitTPCC/warehouses=1-8 9.38M ± 0% 5.60M ± 0% -40.27% (p=0.008 n=5+5) ImportFixtureTPCC-8 36.0M ± 0% 32.2M ± 0% -10.49% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None
Background: golang/go#21835 The current thinking for go2 is to switch to a Permuted Congruential Generator (PCG), which is much faster than the current "math/rand" generator. It's smaller in memory (128-bit vs 607 64-bit words) and _much_ faster to seed (which is a common operation in workload, which uses seeding to make deterministic results). The linked issues claims "The current implementation, in pure Go, is somewhat slower than math/rand, but could be comparably fast or even faster given compiler support for the 128-bit multiply inside." but for our needs it seems to already be faster. Benchmark results: - InitTPCC only measures generating initial data - ImportFixtureTPCC measures loading initial data with COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip. name old time/op new time/op delta InitTPCC/warehouses=1-8 2.25s ± 1% 2.06s ± 1% -8.07% (p=0.008 n=5+5) ImportFixtureTPCC-8 4.90s ± 3% 4.64s ± 2% -5.44% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 31.5MB/s ± 1% 34.3MB/s ± 1% +8.80% (p=0.008 n=5+5) ImportFixtureTPCC-8 18.0MB/s ± 3% 19.1MB/s ± 2% +5.77% (p=0.008 n=5+5) name old alloc/op new alloc/op delta InitTPCC/warehouses=1-8 246MB ± 0% 246MB ± 0% ~ (p=0.095 n=5+5) ImportFixtureTPCC-8 3.61GB ± 0% 3.61GB ± 0% ~ (p=0.548 n=5+5) name old allocs/op new allocs/op delta InitTPCC/warehouses=1-8 5.60M ± 0% 5.60M ± 0% +0.06% (p=0.008 n=5+5) ImportFixtureTPCC-8 32.2M ± 0% 32.3M ± 0% +0.08% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None
The slowest part of generating random strings is generating the random number. Previously, we made a uint64 and threw away most of it to pick one character from an alphabet. Instead, pick as many characters as we can get from 64 bits: floor(log(math.MaxUint64)/log(len(alpabet)). New random string generation microbenchmark: name time/op RandStringFast/letters-8 75.3ns ± 2% RandStringFast/numbers-8 74.2ns ± 1% RandStringFast/aChars-8 84.5ns ± 1% name speed RandStringFast/letters-8 345MB/s ± 2% RandStringFast/numbers-8 351MB/s ± 1% RandStringFast/aChars-8 308MB/s ± 1% Diffs in TPCC data generation and TPCC IMPORT: name old time/op new time/op delta InitTPCC/warehouses=1-8 2.03s ± 2% 0.65s ± 1% -67.85% (p=0.008 n=5+5) ImportFixtureTPCC-8 4.11s ± 5% 3.75s ± 2% -8.67% (p=0.008 n=5+5) name old speed new speed delta InitTPCC/warehouses=1-8 34.9MB/s ± 2% 108.5MB/s ± 1% +211.01% (p=0.008 n=5+5) ImportFixtureTPCC-8 21.5MB/s ± 5% 23.6MB/s ± 2% +9.45% (p=0.008 n=5+5) Touches cockroachdb#34809 Release note: None
37575: sql: save some EvalCtx allocations r=nvanbenschoten a=danhhz EvalCtx is big, so we shouldn't be passing it anyway. Found by digging into allocations that pprof attributed to the method signature of sql.GenerateInsertRow. I tried to see if this showed up on any of our SQL level INSERT benchmarks, but it seemed to be in the noise. name old time/op new time/op delta ConvertToKVs/tpcc/warehouses=1-8 3.83s ± 0% 3.66s ± 1% -4.53% (p=0.016 n=4+5) ConvertToSSTable/tpcc/warehouses=1-8 5.50s ± 4% 5.13s ± 1% -6.80% (p=0.008 n=5+5) name old speed new speed delta ConvertToKVs/tpcc/warehouses=1-8 23.2MB/s ± 0% 24.3MB/s ± 1% +4.77% (p=0.016 n=4+5) ConvertToSSTable/tpcc/warehouses=1-8 15.1MB/s ± 4% 16.2MB/s ± 1% +7.27% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ConvertToKVs/tpcc/warehouses=1-8 953MB ± 0% 704MB ± 0% -26.17% (p=0.029 n=4+4) ConvertToSSTable/tpcc/warehouses=1-8 1.64GB ± 0% 1.39GB ± 0% -15.19% (p=0.016 n=4+5) name old allocs/op new allocs/op delta ConvertToKVs/tpcc/warehouses=1-8 17.4M ± 0% 16.8M ± 0% -3.45% (p=0.008 n=5+5) ConvertToSSTable/tpcc/warehouses=1-8 17.4M ± 0% 16.8M ± 0% -3.45% (p=0.008 n=5+5) Also a small cleanup in some import code to use EvalCtx's Copy method instead of passing around a function closure. Touches #34809 Release note: None Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
A few recently merged changes related to this from me and @adityamaru27 :
A couple more random ideas from the last time I profiled an ingestion: a) computing MVCCStats incrementally during SST creation iteration (instead of with a second iteration) should be pretty straightforward given the significant constraints on what we're putting in them (all unique user keys). b) pass ~1mb c) figure out how we can do better at splitting and scattering. |
Direct-ingest is now on by default. The direct-ingest tpcc1k roachtest is now sitting at in teens of minutes down from 40+ and manual testing of tpcc2k looks like 25-40m down from 50-60min in 19.1 (also using direct-ingest, or well over an hour with sorted). At this point I think the big pieces of this that bulk-io had on its roadmap are done, so I think bulk-io will be removing this issue form its roadmap now. I'll leave it up to @danhhz to decide if we should actually close it though. |
this is so exciting! i'm looking forward to testing it out. Will this be used for nightlies? can we expect some savings and speedups there? |
it is on in nightlies (that use |
huge effort from bulk io on direct ingest, thanks @dt! i've got two last ideas i've been meaning to write up. 1) speed up workload colbatch -> kv conversion (currently roundtrips through datums but doesn't need to) and 2) fewer copies in the addsstable path |
2a13ce6#diff-b010dc66fcc28059908e35277ad85915R157 is another likely big win during heavy ingestion loads -- we see the hangover of our If @adityamaru27 doesn't get to it before he leaves us (😢) we should try to make sure that actually gets done. |
Ah, good to know! |
I did a little more profiling and one of the big places where we're still spending time is sort.Sort on buffers. We're producing sorted data so I got to thinking it'd be cheaper to keep track of if it is sorted (just one comparison during each append to the buffer while we still think it is sorted) than to sort it from scratch (so O(n) instead of O(nlogn) ) but when I went to actually try this, I realized that our parallel creation of individually sorted batches feeding into one buffer ends up making an unsorted buffer. We could potentially sort batches to solve this, but it seems like that might be somewhat workload specific, so I got to thinking maybe the workload frontend should do that internally, and emit a sorted stream of sorted batches on kvCh. That would also eliminate the potential "jagged edge" problem that still pushes some of workload's SSTs to L0, when a flush occurs while one producer worker is ahead of another, causing the whole SST to overlap just because of one of the boundary batches. I'm thinking it might make sense to add one more worker that just collects batches from the kvCh that currently all the other workers are pushing on to and does a little bit of buffering (e.g. numWorkers*2) in an attempt to emit batches sequentially. That, combined with switch to a coldata.Bytes or bulk.kvBuf as the row.KVBatch internal format and bumping it up to 1mb should make the pipeline to the sst_batcher a fair bit cheaper. I don't know if I'll actually get around to playing with this right away though so I stopped here to write it down. |
39829: batcheval: MVCCStats for AddSSTable with no shadowing should not contain estimates r=dt a=adityamaru27 There is a significant performance win to be achieved by ensuring that the stats computed when using AddSSTable are not estimates as it prevents recomputation on splits. Running AddSSTable with disallowShadowing=true gets us close to this as we do not allow colliding keys to be ingested. However, in the situation that two SSTs have KV(s) which "perfectly" shadow an existing key (equal ts and value), we do not consider this a collision. While the shadowing KV would be skipped during ingestion, the stats would be added to the cumulative stats for the AddSSTable command, causing a double count for such KVs. Therefore, we compute the stats for these "skipped" KVs on-the-fly while checking for the collision condition in C++ and subtract them from the stats of the SST being ingested before adding them to the running cumulative for this command. These stats can then be marked as accurate. Tagging #34809 Release note: None Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>
@dt is there more to do here or can this be closed? |
We have marked this issue as stale because it has been inactive for |
IMPORT
is one of the first things our customers use in a POC and we use it extensively in our internal testing. Anything we can do to speed it up is pretty high-leverage.We've laid a great foundation with direct ingestion of SSTables into RocksDB, but there is currently a big distributed sort that is limiting scalability and throughput. Luckily, the schema change team has recently switched index backfills to be based on a technique that uses direct SSTable ingestion, but without the distributed sort! There are a couple details to work out, but no fundamental reason we can't do the same thing for IMPORT, so now is a great time to revisit where we're at and lay down a roadmap for the medium-term future.
Specifically, I have the following goals:
fixtures import
should be as fast or faster thanfixtures load
fixtures import
should scale to 100s of terabytesfixtures import
works by running IMPORT over CSVs that are synthesized from thin air, whilefixtures load
is a thin wrapper aroundRESTORE, so this goal is more-or-less that
IMPORTon data already on a cluster should be faster than
RESTOREing data from google storage to a cluster on google compute.
IMPORThas to do more work than
RESTORE` so if this is possible, it's because of network bandwidth constraints.At a high-level,
IMPORT
currently reads a CSV line by line, converting it to our internal mvcc key/values. It currently works by putting these kvs in non-overlapping SSTables, which it then directly ingests. Because we don't have control of the order of rows in the CSVs (plus secondary indexes are basically never sorted in the same order), this requires that we first compute key splits such that the kv data between them is approximately the same size. Then we use these splits as buckets in a big distributed sort.This big distributed sort means we do the CSV -> kv conversion twice. A distsql flow is set up to sample the splits, converting every CSV -> kvs in the process. Then a followup flow reruns the CSV -> kv conversion, except this time it routes everything according to the splits. This feeds to a second set of processors which buffers every kv in a bucket, sorts them, and runs AddSSTable. These second processors are also responsible for detecting unique index violations (in either the primary index or a secondary index) by finding duplicate keys when it constructs the SSTable for ingestion.
There are a bunch of inefficiencies in this process, but running the CSV->kv conversion twice and doing the distributed sort, along with all the buffering involved, dominates
IMPORT
's overall throughput. This is all because when we first builtIMPORT
, it was easier forAddSSTable
to only work on non-overlapping buckets (and this was all it needed for RESTORE) The recent index backfill stuff has paved the way for doing overlappingAddSSTable
s, which means we can rip the double sample and the distributed sort out.@dt has even prototyped this in #34751!(Update: This merged in #34751!)After that happens, it's worth looking into our new bottlenecks. As always, performance work should be benchmark driven, but here's some preliminary guesses for low hanging fruit.
It's silly to be using CSV as an intermediate format for importing workload data. The magic IMPORT thing works by taking a set of rows as query parameters and outputting the data as a CSV, which IMPORT then has to parse into rows of Datums. All of this is allocation and cpu heavy.
The sql team is adding efficient columnar formats and they'll have to be able to serialize them to use them in distributed flows. If they end up using a standard format, we could add it as an alternate non-CSV format for IMPORT, reducing basically all the allocations between the data generation and the kv generation.(Update: dt had a better idea and it merged in importccl: skip detour though CSV when importing from workload #34892!)In AddSSTable, we compute range mvccstats by making an iterator that merges existing rocksdb data and sstable data and handing that iterator to our common mvccstats code. This is because there could be existing data in the range and (because of unavoidable internal retries) we could end up applying the same sstable twice. Instead, we could compute the mvccstats for the sstable as we construct it and blindly add it to the existing stats, setting the estimate flag. This will not always be correct, but it's probably a good enough guess for the duration of the IMPORT. Then, after the IMPORT is finished, we go back and compute the exact mvcc stats and clear the estimate flag. This means that we only iterate the keyspace being imported into once, not once per sstable over a given span.
SSTables are ingested into RocksDB via the
IngestExternalFile
call. Historically, this meant we had to write them to disk twice, because RocksDB might need to modify a small part of the file, but raft might later need back the exact original payload. At some point, we added an optimization to try theIngestExternalFile
disallowing any modifications to the file, and if that failed, a copy was made andIngestExternalFile
was tried again allowing modifications. As of some recent RocksDB version, it's now possible to runIngestExternalFile
in a mode that never modifies the file, guaranteeing that it's only written once. (Update: This merged in storage/engine,libroach: 1x write amplification on ingest sst #34886!)Profile and speed up tpcc generator's impls of initial data generation: workload/tpcc: initial data load perf improvements #35322, and probably more
Switch
workload.InitialData.Batch
to be columnar, which will allow for drastically reduced allocations during initial data generation of frequently used workloads (tpcc, bank), while keeping the easy interface for other workloads: workload: allow columnar data generation (alloc -73%, MB/s +60%) #35349Profile and speed up IMPORT: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% #35317, and certainly more
TODO(dan): continue fleshing this all out and starting running perf experiments
Epic CRDB-2340
Jira issue: CRDB-4624
The text was updated successfully, but these errors were encountered: