Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: blazing fast IMPORT #34809

Closed
5 of 6 tasks
danhhz opened this issue Feb 11, 2019 · 18 comments
Closed
5 of 6 tasks

importccl: blazing fast IMPORT #34809

danhhz opened this issue Feb 11, 2019 · 18 comments
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-disaster-recovery X-stale

Comments

@danhhz
Copy link
Contributor

danhhz commented Feb 11, 2019

IMPORT is one of the first things our customers use in a POC and we use it extensively in our internal testing. Anything we can do to speed it up is pretty high-leverage.

We've laid a great foundation with direct ingestion of SSTables into RocksDB, but there is currently a big distributed sort that is limiting scalability and throughput. Luckily, the schema change team has recently switched index backfills to be based on a technique that uses direct SSTable ingestion, but without the distributed sort! There are a couple details to work out, but no fundamental reason we can't do the same thing for IMPORT, so now is a great time to revisit where we're at and lay down a roadmap for the medium-term future.

Specifically, I have the following goals:

  • fixtures import should be as fast or faster than fixtures load
  • fixtures import should scale to 100s of terabytes

fixtures import works by running IMPORT over CSVs that are synthesized from thin air, while fixtures load is a thin wrapper around RESTORE, so this goal is more-or-less that IMPORTon data already on a cluster should be faster thanRESTOREing data from google storage to a cluster on google compute. IMPORThas to do more work thanRESTORE` so if this is possible, it's because of network bandwidth constraints.

At a high-level, IMPORT currently reads a CSV line by line, converting it to our internal mvcc key/values. It currently works by putting these kvs in non-overlapping SSTables, which it then directly ingests. Because we don't have control of the order of rows in the CSVs (plus secondary indexes are basically never sorted in the same order), this requires that we first compute key splits such that the kv data between them is approximately the same size. Then we use these splits as buckets in a big distributed sort.

This big distributed sort means we do the CSV -> kv conversion twice. A distsql flow is set up to sample the splits, converting every CSV -> kvs in the process. Then a followup flow reruns the CSV -> kv conversion, except this time it routes everything according to the splits. This feeds to a second set of processors which buffers every kv in a bucket, sorts them, and runs AddSSTable. These second processors are also responsible for detecting unique index violations (in either the primary index or a secondary index) by finding duplicate keys when it constructs the SSTable for ingestion.

There are a bunch of inefficiencies in this process, but running the CSV->kv conversion twice and doing the distributed sort, along with all the buffering involved, dominates IMPORT's overall throughput. This is all because when we first built IMPORT, it was easier for AddSSTable to only work on non-overlapping buckets (and this was all it needed for RESTORE) The recent index backfill stuff has paved the way for doing overlapping AddSSTables, which means we can rip the double sample and the distributed sort out. @dt has even prototyped this in #34751! (Update: This merged in #34751!)

After that happens, it's worth looking into our new bottlenecks. As always, performance work should be benchmark driven, but here's some preliminary guesses for low hanging fruit.

  • It's silly to be using CSV as an intermediate format for importing workload data. The magic IMPORT thing works by taking a set of rows as query parameters and outputting the data as a CSV, which IMPORT then has to parse into rows of Datums. All of this is allocation and cpu heavy. The sql team is adding efficient columnar formats and they'll have to be able to serialize them to use them in distributed flows. If they end up using a standard format, we could add it as an alternate non-CSV format for IMPORT, reducing basically all the allocations between the data generation and the kv generation. (Update: dt had a better idea and it merged in importccl: skip detour though CSV when importing from workload #34892!)

  • In AddSSTable, we compute range mvccstats by making an iterator that merges existing rocksdb data and sstable data and handing that iterator to our common mvccstats code. This is because there could be existing data in the range and (because of unavoidable internal retries) we could end up applying the same sstable twice. Instead, we could compute the mvccstats for the sstable as we construct it and blindly add it to the existing stats, setting the estimate flag. This will not always be correct, but it's probably a good enough guess for the duration of the IMPORT. Then, after the IMPORT is finished, we go back and compute the exact mvcc stats and clear the estimate flag. This means that we only iterate the keyspace being imported into once, not once per sstable over a given span.

  • SSTables are ingested into RocksDB via the IngestExternalFile call. Historically, this meant we had to write them to disk twice, because RocksDB might need to modify a small part of the file, but raft might later need back the exact original payload. At some point, we added an optimization to try the IngestExternalFile disallowing any modifications to the file, and if that failed, a copy was made and IngestExternalFile was tried again allowing modifications. As of some recent RocksDB version, it's now possible to run IngestExternalFile in a mode that never modifies the file, guaranteeing that it's only written once. (Update: This merged in storage/engine,libroach: 1x write amplification on ingest sst #34886!)

  • Profile and speed up tpcc generator's impls of initial data generation: workload/tpcc: initial data load perf improvements #35322, and probably more

  • Switch workload.InitialData.Batch to be columnar, which will allow for drastically reduced allocations during initial data generation of frequently used workloads (tpcc, bank), while keeping the easy interface for other workloads: workload: allow columnar data generation (alloc -73%, MB/s +60%) #35349

  • Profile and speed up IMPORT: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% #35317, and certainly more

TODO(dan): continue fleshing this all out and starting running perf experiments

Epic CRDB-2340

Jira issue: CRDB-4624

@danhhz danhhz self-assigned this Feb 11, 2019
@vivekmenezes vivekmenezes added this to Triage in Bulk IO via automation Feb 12, 2019
danhhz added a commit to danhhz/cockroach that referenced this issue Feb 15, 2019
So we can track our progress in cockroachdb#34809.

Diff between distsql IMPORT and direct IMPORT

    name                 old time/op    new time/op    delta
    ImportFixtureTPCC-8     7.74s ± 3%     4.09s ± 6%  -47.18%  (p=0.008 n=5+5)

    name                 old speed      new speed      delta
    ImportFixtureTPCC-8  11.4MB/s ± 3%  21.7MB/s ± 6%  +89.46%  (p=0.008 n=5+5)

    name                 old alloc/op   new alloc/op   delta
    ImportFixtureTPCC-8    7.18GB ± 0%    4.12GB ± 0%  -42.62%  (p=0.008 n=5+5)

    name                 old allocs/op  new allocs/op  delta
    ImportFixtureTPCC-8     89.0M ± 0%     39.5M ± 0%  -55.56%  (p=0.008 n=5+5)

Release note: None
craig bot pushed a commit that referenced this issue Feb 15, 2019
35002: workloadccl: benchmark `fixtures import` for single-node tpcc 1 r=dt a=danhhz

So we can track our progress in #34809.

Diff between distsql IMPORT and direct IMPORT

    name                 old time/op    new time/op    delta
    ImportFixtureTPCC-8     7.74s ± 3%     4.09s ± 6%  -47.18%  (p=0.008 n=5+5)

    name                 old speed      new speed      delta
    ImportFixtureTPCC-8  11.4MB/s ± 3%  21.7MB/s ± 6%  +89.46%  (p=0.008 n=5+5)

    name                 old alloc/op   new alloc/op   delta
    ImportFixtureTPCC-8    7.18GB ± 0%    4.12GB ± 0%  -42.62%  (p=0.008 n=5+5)

    name                 old allocs/op  new allocs/op  delta
    ImportFixtureTPCC-8     89.0M ± 0%     39.5M ± 0%  -55.56%  (p=0.008 n=5+5)

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@danhhz
Copy link
Contributor Author

danhhz commented Feb 15, 2019

In an attempt to start some standard measurements I added a benchmark that does a fixtures import tpcc --warehouses=1 on a single node. I also ran a fixtures import tpcc --warehouses=100 on a single node n1-standard-16 cluster. Multi-node is more representative of real world performance, but I'd first like to get a sense of (and improve) our speed of light.

tpcc-1 old is default new is with --experimental-direct-ingest

name                 old time/op    new time/op    delta
ImportFixtureTPCC-8     7.74s ± 3%     4.09s ± 6%  -47.18%  (p=0.008 n=5+5)

name                 old speed      new speed      delta
ImportFixtureTPCC-8  11.4MB/s ± 3%  21.7MB/s ± 6%  +89.46%  (p=0.008 n=5+5)

name                 old alloc/op   new alloc/op   delta
ImportFixtureTPCC-8    7.18GB ± 0%    4.12GB ± 0%  -42.62%  (p=0.008 n=5+5)

name                 old allocs/op  new allocs/op  delta
ImportFixtureTPCC-8     89.0M ± 0%     39.5M ± 0%  -55.56%  (p=0.008 n=5+5)

tpcc-100

default: 8m49s
direct ingest: 5m45s
delta: -34.8%

So we're not quite getting our linear speedup to 100, but we already know of some issues that could cause that. The biggest is probably that the old import with non-overlapping sstables shouldn't need any compactions, but the new overlapping ones will. I'd bet that write amplification goes up from tpcc-1 to tpcc-100. We might also now be filling up L1 and increasing read amplification for system traffic (jobs updates, liveness, etc). cc @dt

@dt
Copy link
Member

dt commented Feb 15, 2019

Another thing I was thinking might be a factor is the fixed vs variable costs of running the sampling phase flow -- and in smaller import, those fixed costs don't shrink, meaning sampling is a bigger % of the overall import... thus skipping sampling entirely might see a bigger % impact on smaller imports.

@danhhz
Copy link
Contributor Author

danhhz commented Feb 15, 2019

Perhaps, but sampling sets up a fairly simple flow, which I'd guess would be small in comparison to the 4s that tpcc-1 takes end-to-end.

My wild guesses are:

  • Non-linear behavior in IngestExternalFile after some point
  • Non-linear behavior in something else because the CSV "files" are really big
  • GC. This thing generates a TON of garbage, so we're probably missing some of the cost of GC on smaller tests
  • Tail latencies

I'm currently trying a single node tpcc-1000.

craig bot pushed a commit that referenced this issue Feb 19, 2019
34892: importccl: skip detour though CSV when importing from workload r=dt a=dt

Based on some recent suggestions from @danhhz, while I was thinking about ways to fix job progress reporting for workload imports, I looked at how hard it would be to skip our detour though CSV when importing on-the-fly workload-generated data, since that detour is what made progress reporting hard. Specifically, it was harder because the on-the-fly CSV generation doesn't know up front how many bytes it will return. The usual IMPORT progress tracking used by the file-reading frontends is`num_bytes_read/num_bytes` but that doesn't work if you don't know `num_bytes`. 

However, the workload generator itself certainly know how many rows it has generated and how many it will generate as it goes. If we hook it up directly, without a detour though a "file" of CSV data, we can just report that number for progress directly. 

That is what got me looking at this, but more importantly, skipping the detour to CSV lets us skip a lot of expensive string encoding and decoding in some cases.

This is just a first pass at this -- mostly to prove it works and solved my original progress concern, but this initial implementation does not even begin to exploit all the possible optimizations. However it likely makes testing some of them much easier.

The specialized frontend is automatically by reader process when it is configured to read CSV data and all files URIs are pointing to workload generation-backed storage. In that case, it uses the workload configuration values from those file URIs to setup a generator directly and then writes its output to the same row-converter that the CSV reader would have -- simply cutting the CSV encoder and decoder out of the middle.

This is currently gated behind the `COCKROACH_IMPORT_WORKLOAD_FASTER` env var.

This is a baby-step towards one of the areas identified in #34809.

Release note: none.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@danhhz
Copy link
Contributor Author

danhhz commented Feb 26, 2019

Note to self: run our new benchmarks to measure the impact of #34886

@dt
Copy link
Member

dt commented Feb 26, 2019

FWIW, in that PR I mentioned results on a single node MBP we measurable but not very large -- about 5% on 1.7gb tpcc import. It did cut the the direct ingestion write traffic almost half, but our direct writes were a fraction of the compactor's. The total disk write used (as measured by the OS) for importing 1.7gb went from 7.5gb to 6gb. #34886 (comment)

@danhhz
Copy link
Contributor Author

danhhz commented Feb 26, 2019

Yup! I saw. I actually left it as a note to self instead of doing it right now because I want to see if we make any progress on the direct ingestion compactions. That bit, and the stability issues it comes with, is really preventing us from measuring our progress well right now. I'm mulling about it in the background.

@dt
Copy link
Member

dt commented Feb 26, 2019

Wrote up some thoughts on #35219 re: how much we're re-compacting.

@danhhz
Copy link
Contributor Author

danhhz commented Mar 1, 2019

cc @nvanbenschoten

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 1, 2019
Informs cockroachdb#34809.

Before this change, `ErrNameString` only accepted a pointer to a string.
This forced the owner of the string to escape. In the following two
cases, this resulted in an entire `ColumnDescriptor` proto to be
allocated on each call:
- `sqlbase.MarshalColumnValue`
- `sqlbase.CheckDatumTypeFitsColumnType`

This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue`
was responsible for about a third of heap allocations.

```
name                 old time/op    new time/op    delta
ImportFixtureTPCC-4     6.24s ± 2%     5.68s ±11%   -9.02%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
ImportFixtureTPCC-4  14.2MB/s ± 2%  15.6MB/s ±10%  +10.08%  (p=0.000 n=10+10)

name                 old alloc/op   new alloc/op   delta
ImportFixtureTPCC-4    4.10GB ± 0%    2.77GB ± 0%  -32.34%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
ImportFixtureTPCC-4     39.5M ± 0%     33.2M ± 0%  -16.11%  (p=0.000 n=10+9)
```

It appears that this will also have an effect on `row.Inserter` and
`row.Updater`.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 2, 2019
Informs cockroachdb#34809.

Before this change, `ErrNameString` only accepted a pointer to a string.
This forced the owner of the string to escape. In the following two
cases, this resulted in an entire `ColumnDescriptor` proto to be
allocated on each call:
- `sqlbase.MarshalColumnValue`
- `sqlbase.CheckDatumTypeFitsColumnType`

This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue`
was responsible for about a third of heap allocations.

```
name                 old time/op    new time/op    delta
ImportFixtureTPCC-4     6.24s ± 2%     5.68s ±11%   -9.02%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
ImportFixtureTPCC-4  14.2MB/s ± 2%  15.6MB/s ±10%  +10.08%  (p=0.000 n=10+10)

name                 old alloc/op   new alloc/op   delta
ImportFixtureTPCC-4    4.10GB ± 0%    2.77GB ± 0%  -32.34%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
ImportFixtureTPCC-4     39.5M ± 0%     33.2M ± 0%  -16.11%  (p=0.000 n=10+9)
```

It appears that this will also have an effect on `row.Inserter` and
`row.Updater`.

Release note: None
craig bot pushed a commit that referenced this issue Mar 2, 2019
35317: sqlbase: avoid allocations due to ErrNameString, speed up IMPORT by 10% r=nvanbenschoten a=nvanbenschoten

Informs #34809.

Before this change, `ErrNameString` only accepted a pointer to a string.
This forced the owner of the string to escape. In the following two
cases, this resulted in an entire `ColumnDescriptor` proto to be
allocated on each call:
- `sqlbase.MarshalColumnValue`
- `sqlbase.CheckDatumTypeFitsColumnType`

This showed up on IMPORT benchmarks, where the allocation in `MarshalColumnValue`
was responsible for about a third of heap allocations.

```
name                 old time/op    new time/op    delta
ImportFixtureTPCC-4     6.24s ± 2%     5.68s ±11%   -9.02%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
ImportFixtureTPCC-4  14.2MB/s ± 2%  15.6MB/s ±10%  +10.08%  (p=0.000 n=10+10)

name                 old alloc/op   new alloc/op   delta
ImportFixtureTPCC-4    4.10GB ± 0%    2.77GB ± 0%  -32.34%  (p=0.000 n=10+10)

name                 old allocs/op  new allocs/op  delta
ImportFixtureTPCC-4     39.5M ± 0%     33.2M ± 0%  -16.11%  (p=0.000 n=10+9)
```

It appears that this will also have an effect on `row.Inserter` and
`row.Updater`.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 4, 2019
Use bufalloc.ByteAllocator in randStringFromAlphabet instead of
allocating many small byte arrays.

Benchmark results:
- InitTPCC only measures generating initial data
- ImportFixtureTPCC measures loading initial data with
  COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip.

    name                     old time/op    new time/op    delta
    InitTPCC/warehouses=1-8     2.33s ± 0%     2.25s ± 1%   -3.04%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8         5.00s ± 2%     4.83s ± 2%   -3.32%  (p=0.008 n=5+5)

    name                     old speed      new speed      delta
    InitTPCC/warehouses=1-8  30.4MB/s ± 0%  31.4MB/s ± 1%   +3.14%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8      17.7MB/s ± 2%  18.3MB/s ± 2%   +3.40%  (p=0.008 n=5+5)

    name                     old alloc/op   new alloc/op   delta
    InitTPCC/warehouses=1-8     329MB ± 0%     246MB ± 0%  -25.28%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8        3.70GB ± 0%    3.61GB ± 0%   -2.20%  (p=0.008 n=5+5)

    name                     old allocs/op  new allocs/op  delta
    InitTPCC/warehouses=1-8     9.38M ± 0%     5.60M ± 0%  -40.27%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8         36.0M ± 0%     32.2M ± 0%  -10.49%  (p=0.008 n=5+5)

Touches cockroachdb#34809

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 4, 2019
Background: golang/go#21835

The current thinking for go2 is to switch to a Permuted Congruential
Generator (PCG), which is much faster than the current "math/rand"
generator. It's smaller in memory (128-bit vs 607 64-bit words) and
_much_ faster to seed (which is a common operation in workload, which
uses seeding to make deterministic results).

The linked issues claims "The current implementation, in pure Go, is
somewhat slower than math/rand, but could be comparably fast or even
faster given compiler support for the 128-bit multiply inside." but for
our needs it seems to already be faster.

Benchmark results:
- InitTPCC only measures generating initial data
- ImportFixtureTPCC measures loading initial data with
  COCKROACH_IMPORT_WORKLOAD_FASTER=true, which skips the CSV roundtrip.

    name                     old time/op    new time/op    delta
    InitTPCC/warehouses=1-8     2.25s ± 1%     2.06s ± 1%  -8.07%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8         4.90s ± 3%     4.64s ± 2%  -5.44%  (p=0.008 n=5+5)

    name                     old speed      new speed      delta
    InitTPCC/warehouses=1-8  31.5MB/s ± 1%  34.3MB/s ± 1%  +8.80%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8      18.0MB/s ± 3%  19.1MB/s ± 2%  +5.77%  (p=0.008 n=5+5)

    name                     old alloc/op   new alloc/op   delta
    InitTPCC/warehouses=1-8     246MB ± 0%     246MB ± 0%    ~     (p=0.095 n=5+5)
    ImportFixtureTPCC-8        3.61GB ± 0%    3.61GB ± 0%    ~     (p=0.548 n=5+5)

    name                     old allocs/op  new allocs/op  delta
    InitTPCC/warehouses=1-8     5.60M ± 0%     5.60M ± 0%  +0.06%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8         32.2M ± 0%     32.3M ± 0%  +0.08%  (p=0.008 n=5+5)

Touches cockroachdb#34809

Release note: None
danhhz added a commit to danhhz/cockroach that referenced this issue Mar 4, 2019
The slowest part of generating random strings is generating the random
number. Previously, we made a uint64 and threw away most of it to pick
one character from an alphabet. Instead, pick as many characters as we
can get from 64 bits: floor(log(math.MaxUint64)/log(len(alpabet)).

New random string generation microbenchmark:

    name                      time/op
    RandStringFast/letters-8    75.3ns ± 2%
    RandStringFast/numbers-8    74.2ns ± 1%
    RandStringFast/aChars-8     84.5ns ± 1%

    name                      speed
    RandStringFast/letters-8   345MB/s ± 2%
    RandStringFast/numbers-8   351MB/s ± 1%
    RandStringFast/aChars-8    308MB/s ± 1%

Diffs in TPCC data generation and TPCC IMPORT:

    name                     old time/op    new time/op     delta
    InitTPCC/warehouses=1-8     2.03s ± 2%      0.65s ± 1%   -67.85%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8         4.11s ± 5%      3.75s ± 2%    -8.67%  (p=0.008 n=5+5)

    name                     old speed      new speed       delta
    InitTPCC/warehouses=1-8  34.9MB/s ± 2%  108.5MB/s ± 1%  +211.01%  (p=0.008 n=5+5)
    ImportFixtureTPCC-8      21.5MB/s ± 5%   23.6MB/s ± 2%    +9.45%  (p=0.008 n=5+5)

Touches cockroachdb#34809

Release note: None
craig bot pushed a commit that referenced this issue Mar 4, 2019
35322: workload/tpcc: initial data load perf improvements r=petermattis a=danhhz

See individual commits for details

Touches #34809

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 6, 2019
craig bot pushed a commit that referenced this issue May 17, 2019
37575:  sql: save some EvalCtx allocations r=nvanbenschoten a=danhhz

EvalCtx is big, so we shouldn't be passing it anyway. Found by digging
into allocations that pprof attributed to the method signature of
sql.GenerateInsertRow. I tried to see if this showed up on any of our
SQL level INSERT benchmarks, but it seemed to be in the noise.

    name                                  old time/op    new time/op    delta
    ConvertToKVs/tpcc/warehouses=1-8         3.83s ± 0%     3.66s ± 1%   -4.53%  (p=0.016 n=4+5)
    ConvertToSSTable/tpcc/warehouses=1-8     5.50s ± 4%     5.13s ± 1%   -6.80%  (p=0.008 n=5+5)

    name                                  old speed      new speed      delta
    ConvertToKVs/tpcc/warehouses=1-8      23.2MB/s ± 0%  24.3MB/s ± 1%   +4.77%  (p=0.016 n=4+5)
    ConvertToSSTable/tpcc/warehouses=1-8  15.1MB/s ± 4%  16.2MB/s ± 1%   +7.27%  (p=0.008 n=5+5)

    name                                  old alloc/op   new alloc/op   delta
    ConvertToKVs/tpcc/warehouses=1-8         953MB ± 0%     704MB ± 0%  -26.17%  (p=0.029 n=4+4)
    ConvertToSSTable/tpcc/warehouses=1-8    1.64GB ± 0%    1.39GB ± 0%  -15.19%  (p=0.016 n=4+5)

    name                                  old allocs/op  new allocs/op  delta
    ConvertToKVs/tpcc/warehouses=1-8         17.4M ± 0%     16.8M ± 0%   -3.45%  (p=0.008 n=5+5)
    ConvertToSSTable/tpcc/warehouses=1-8     17.4M ± 0%     16.8M ± 0%   -3.45%  (p=0.008 n=5+5)

Also a small cleanup in some import code to use EvalCtx's Copy method
instead of passing around a function closure.

Touches #34809

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
@kenliu kenliu added this to Triage in Disaster Recovery Backlog via automation Jul 3, 2019
@kenliu kenliu removed this from Triage in Bulk IO Jul 3, 2019
@rolandcrosby rolandcrosby moved this from Triage to Backlog: Stabilize Bulk-ingest in Disaster Recovery Backlog Jul 10, 2019
@rolandcrosby rolandcrosby moved this from Backlog: IMPORT to Upcoming in Disaster Recovery Backlog Jul 22, 2019
@dt
Copy link
Member

dt commented Aug 14, 2019

A few recently merged changes related to this from me and @adityamaru27 :

  • ⬆Direct ingest is on by default.
  • ⬇ Import SST ingestion scans existing data and prohibits key shadowing (required for correctness with unique / pk indexes) added ~1% overhead when ingesting non-overlapping SSTs, higher with overlaps.
  • ⬆Pebble-based SSTable creation (~10% faster in RESTORE).
  • ⬆Caller-provided MVCCStats.
  • ⬇ Removal of pre-buffering by table/index ID (fixed OOM crashes). Thanks to range splits at index bounds this didn't change overlapping SSTs but did mean we flushed small index SSTs when the overall buffer filled but...
  • ⬆ Splitting KV buffering into pk-vs-index buffers to mitigate removal of per-index buffers above (still uses much less mem, in part because Adder avoids expensive roachpb.Value).

A couple more random ideas from the last time I profiled an ingestion:

a) computing MVCCStats incrementally during SST creation iteration (instead of with a second iteration) should be pretty straightforward given the significant constraints on what we're putting in them (all unique user keys).

b) pass ~1mb kvBuf as the kvbatches sent from the frontends, which could also potentially maintain a boolean sorted to skip re-sorting sorted input (this would probably require plumbing separate batches or batch channels for pk-vs-index data).

c) figure out how we can do better at splitting and scattering.

@dt
Copy link
Member

dt commented Aug 20, 2019

Direct-ingest is now on by default.

The direct-ingest tpcc1k roachtest is now sitting at in teens of minutes down from 40+ and manual testing of tpcc2k looks like 25-40m down from 50-60min in 19.1 (also using direct-ingest, or well over an hour with sorted).

At this point I think the big pieces of this that bulk-io had on its roadmap are done, so I think bulk-io will be removing this issue form its roadmap now. I'll leave it up to @danhhz to decide if we should actually close it though.

@dt dt removed this from Previous Milestone (Aug 19) in Disaster Recovery Backlog Aug 20, 2019
@awoods187
Copy link
Contributor

this is so exciting! i'm looking forward to testing it out. Will this be used for nightlies? can we expect some savings and speedups there?

@dt
Copy link
Member

dt commented Aug 20, 2019

it is on in nightlies (that use workload import). several of the changes that affect this all merged late last week, so that change in the tpcc1k roachtest is observed since 8/16.

@danhhz
Copy link
Contributor Author

danhhz commented Aug 20, 2019

huge effort from bulk io on direct ingest, thanks @dt!

i've got two last ideas i've been meaning to write up. 1) speed up workload colbatch -> kv conversion (currently roundtrips through datums but doesn't need to) and 2) fewer copies in the addsstable path

@dt
Copy link
Member

dt commented Aug 20, 2019

2a13ce6#diff-b010dc66fcc28059908e35277ad85915R157 is another likely big win during heavy ingestion loads -- we see the hangover of our containsEstiamtes flag coming up in how expensive splits become, particularly since we need to split a lot during ingestion.

If @adityamaru27 doesn't get to it before he leaves us (😢) we should try to make sure that actually gets done.

@danhhz
Copy link
Contributor Author

danhhz commented Aug 20, 2019

Ah, good to know!

@dt
Copy link
Member

dt commented Aug 26, 2019

I did a little more profiling and one of the big places where we're still spending time is sort.Sort on buffers. We're producing sorted data so I got to thinking it'd be cheaper to keep track of if it is sorted (just one comparison during each append to the buffer while we still think it is sorted) than to sort it from scratch (so O(n) instead of O(nlogn) ) but when I went to actually try this, I realized that our parallel creation of individually sorted batches feeding into one buffer ends up making an unsorted buffer. We could potentially sort batches to solve this, but it seems like that might be somewhat workload specific, so I got to thinking maybe the workload frontend should do that internally, and emit a sorted stream of sorted batches on kvCh. That would also eliminate the potential "jagged edge" problem that still pushes some of workload's SSTs to L0, when a flush occurs while one producer worker is ahead of another, causing the whole SST to overlap just because of one of the boundary batches.

I'm thinking it might make sense to add one more worker that just collects batches from the kvCh that currently all the other workers are pushing on to and does a little bit of buffering (e.g. numWorkers*2) in an attempt to emit batches sequentially.

That, combined with switch to a coldata.Bytes or bulk.kvBuf as the row.KVBatch internal format and bumping it up to 1mb should make the pipeline to the sst_batcher a fair bit cheaper.

I don't know if I'll actually get around to playing with this right away though so I stopped here to write it down.

craig bot pushed a commit that referenced this issue Sep 4, 2019
39829: batcheval: MVCCStats for AddSSTable with no shadowing should not contain estimates r=dt a=adityamaru27

There is a significant performance win to be achieved by ensuring that the
stats computed when using AddSSTable are not estimates as it prevents
recomputation on splits. Running AddSSTable with disallowShadowing=true
gets us close to this as we do not allow colliding keys to be ingested.
However, in the situation that two SSTs have KV(s) which "perfectly"
shadow an existing key (equal ts and value), we do not consider this
a collision. While the shadowing KV would be skipped during ingestion,
the stats would be added to the cumulative stats for the AddSSTable
command, causing a double count for such KVs.
Therefore, we compute the stats for these "skipped" KVs on-the-fly while
checking for the collision condition in C++ and subtract them from the
stats of the SST being ingested before adding them to the running
cumulative for this command. These stats can then be marked as accurate.

Tagging #34809 

Release note: None

Co-authored-by: Aditya Maru <adityamaru@cockroachlabs.com>
@nvanbenschoten
Copy link
Member

@dt is there more to do here or can this be closed?

@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 3, 2023
Disaster Recovery Backlog automation moved this from Import/Export to Done Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) no-issue-activity T-disaster-recovery X-stale
Development

No branches or pull requests

5 participants