Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark data generator, basic ingestion/persist/merge/query benchmarks #2875

Merged
merged 1 commit into from May 25, 2016

Conversation

jon-wei
Copy link
Contributor

@jon-wei jon-wei commented Apr 23, 2016

This PR adds a utility class called BenchmarkDataGenerator for randomly generating event rows for benchmarking purposes.

Each column is independently generated (no cross-column relationships are supported), column configuration is handled by the BenchmarkColumnSchema class which defines properties like type, cardinality and distribution of values.

This PR also adds benchmarks for:

  • Adding rows to an IncrementalIndex (IndexIngestionBenchmark)
  • Persisting an IncrementalIndex (IndexPersistBenchmark)
  • Merging several segments (IndexMergeBenchmark)

Basic benchmarks for each query type, running against a single segment with no filters or extraction functions, are added as well.

Each query benchmark has the following 3 benchmarks:

  • Query a single IncrementalIndex and materialize the results
  • Query a single QueryableIndex and materialize the results
  • Query multiple QueryableIndex in parallel, with 1 thread per segment

The number of rows per segment, number of total segments, and other relevant configuration are specified as parameters to the benchmarks.

Each benchmark references a schema defined in BenchmarkSchemas, with the schema to be used defined by a benchmark parameter. Currently there is one schema, "basic".

Each query benchmark can have a set of supported queries per schema, currently each benchmark has one query defined for the "basic" schema. The choice of schema and query is specified by the "schemaAndQuery" parameter, using a schemaName.queryName format.

@jon-wei jon-wei added this to the 0.9.1 milestone Apr 23, 2016
@jon-wei jon-wei force-pushed the datagen branch 2 times, most recently from 01ec95a to 5954bb9 Compare April 23, 2016 01:03
@fjy
Copy link
Contributor

fjy commented Apr 23, 2016

@jon-wei any way we can make this run everytime we run integration tests?

@gianm
Copy link
Contributor

gianm commented Apr 24, 2016

@fjy IMO that should be a different PR, it's non-trivial to decide what "pass" and "fail" mean for performance tests and that will require a bit more work.

@gianm
Copy link
Contributor

gianm commented Apr 24, 2016

see also #2823

@xvrl
Copy link
Member

xvrl commented Apr 25, 2016

@jon-wei would be interesting to see how this benchmark perform across 0.8.3, 0.9.0, and current master, as well as differences from building v9 directly or not. Would it be a lot of work to test that?

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 25, 2016

@xvrl I have some numbers for master as of 4/25/2016 here:
#2760

There is a little bit of backporting work because of changed interfaces/constructors when going back to 0.8.3/0.9.0 but it's not too bad.

I'll post some 0.8.3 and 0.9.0 numbers here later today.

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 26, 2016

@xvrl

Here are numbers for 0.8.3 and 0.9.0, you can find the branches I used to run these here:
https://github.com/jon-wei/druid/tree/083bench
https://github.com/jon-wei/druid/tree/090bench

0.8.3 (no IndexMergerV9 here)

IndexIngestionBenchmark.addRows  avgt   15  565426.186 ± 33740.419  us/op

IndexPersistBenchmark.persist    avgt   15  1793665.248 ± 58084.277  us/op

IndexMergeBenchmark.merge    avgt   15  6333139.896 ± 105724.798  us/op

GroupByBenchmark.queryIncrementalIndex  avgt   15  9660329.264 ± 957270.576  us/op
GroupByBenchmark.queryQueryableIndex    avgt   15  8534107.868 ± 308093.416  us/op

GroupBy with -Xmx8g
GroupByBenchmark.queryIncrementalIndex  avgt   15  10010066.762 ± 1292144.706  us/op
GroupByBenchmark.queryQueryableIndex    avgt   15   7423522.083 ± 1327946.019  us/op

TopNBenchmark.queryIncrementalIndex  avgt   15  174160.393 ± 6363.810  us/op
TopNBenchmark.queryQueryableIndex    avgt   15   24700.923 ± 1361.910  us/op

TimeseriesBenchmark.queryIncrementalIndex  avgt   15  1677029.729 ± 81201.593  us/op
TimeseriesBenchmark.queryQueryableIndex    avgt   15   153731.534 ±  6657.840  us/op

SearchBenchmark.queryIncrementalIndex  avgt   15  1005667.996 ± 55158.067  us/op
SearchBenchmark.queryQueryableIndex    avgt   15    23602.662 ±  1518.212  us/op

SelectBenchmark.queryIncrementalIndex  avgt   15  186545.947 ± 13447.132  us/op
SelectBenchmark.queryQueryableIndex    avgt   15  114733.236 ±  5639.119  us/op

0.9.0

IndexIngestionBenchmark.addRows  avgt   15  434912.853 ± 31638.362  us/op

IndexPersistBenchmark.persist    avgt   15  1372499.310 ± 50927.089  us/op
IndexPersistBenchmark.persistV9  avgt   15  1161832.053 ± 53649.794  us/op

IndexMergeBenchmark.merge    avgt   15  5212872.130 ±  91189.623  us/op
IndexMergeBenchmark.mergeV9  avgt   15  4591738.072 ± 169472.744  us/op

GroupByBenchmark.queryIncrementalIndex  avgt   15  8638107.187 ± 665876.589  us/op
GroupByBenchmark.queryQueryableIndex    avgt   15  7431838.793 ± 607951.815  us/op

TopNBenchmark.queryIncrementalIndex  avgt   15  161786.883 ± 1718.777  us/op
TopNBenchmark.queryQueryableIndex    avgt   15   23048.913 ± 1417.319  us/op

TimeseriesBenchmark.queryIncrementalIndex  avgt   15  1738236.568 ± 101035.241  us/op
TimeseriesBenchmark.queryQueryableIndex    avgt   15   158425.050 ±   6217.590  us/op

SearchBenchmark.queryIncrementalIndex  avgt   15  299506.824 ± 9156.375  us/op
SearchBenchmark.queryQueryableIndex    avgt   15   19393.670 ± 1030.369  us/op

SelectBenchmark.queryIncrementalIndex  avgt   15   89138.661 ± 4694.938  us/op
SelectBenchmark.queryQueryableIndex    avgt   15  120144.442 ± 6814.463  us/op

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 26, 2016

On master, 4/25/2016

Benchmark                        Mode  Cnt       Score       Error  Units
IndexIngestionBenchmark.addRows  avgt   15  477402.429 ± 35892.896  us/op

IndexPersistBenchmark.persist    avgt   15  1473236.884 ± 91442.031  us/op
IndexPersistBenchmark.persistV9  avgt   15  1213295.940 ± 71293.593  us/op

IndexMergeBenchmark.merge    avgt   15  5249935.783 ± 139586.096  us/op
IndexMergeBenchmark.mergeV9  avgt   15  4589218.384 ± 178765.372  us/op

GroupByBenchmark.queryIncrementalIndex  avgt   15  8716371.647 ± 708608.337  us/op
GroupByBenchmark.queryQueryableIndex    avgt   15  7592696.179 ± 548766.165  us/op

GroupBy with -Xmx8g
GroupByBenchmark.queryIncrementalIndex  avgt   15  8503967.992 ± 901464.377  us/op
GroupByBenchmark.queryQueryableIndex    avgt   15  7156946.838 ± 732372.516  us/op

TopNBenchmark.queryIncrementalIndex  avgt   15  175975.530 ± 7078.870  us/op
TopNBenchmark.queryQueryableIndex    avgt   15   25152.234 ± 1932.649  us/op

TimeseriesBenchmark.queryIncrementalIndex  avgt   15  1673851.723 ± 103438.826  us/op
TimeseriesBenchmark.queryQueryableIndex    avgt   15   162409.920 ±   9947.578  us/op

SearchBenchmark.queryIncrementalIndex  avgt   15  310672.434 ± 24370.960  us/op
SearchBenchmark.queryQueryableIndex    avgt   15   18733.833 ±   946.751  us/op

SelectBenchmark.queryIncrementalIndex  avgt   15  79989.341 ± 4352.764  us/op
SelectBenchmark.queryQueryableIndex    avgt   15  99538.635 ± 3424.645  us/op

@fjy
Copy link
Contributor

fjy commented Apr 26, 2016

benchmarks look pretty good

@fjy
Copy link
Contributor

fjy commented Apr 26, 2016

@jon-wei can you benchmark master + dimension typing?

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 26, 2016

@fjy There are numbers for master + dimension interface in #2760

@fjy
Copy link
Contributor

fjy commented Apr 27, 2016

👍

@xvrl
Copy link
Member

xvrl commented Apr 27, 2016

@jon-wei interesting to see that topN got about 5-10% slower in master compared to 0.9.0, we might want to investigate that. Can you try to see if you can get the error bands narrower for topN?

private final List<Double> enumeratedProbabilities;
private final Double nullProbability;

public BenchmarkColumnSchema(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some java docs would be nice to explain what all those inputs mean

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl Added documentation comments for the distribution types and the properties

@jon-wei jon-wei force-pushed the datagen branch 3 times, most recently from 5a68e00 to b8be128 Compare April 29, 2016 02:07
@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 29, 2016

@xvrl Sure, I can run this again on master for TopN

@jon-wei
Copy link
Contributor Author

jon-wei commented Apr 29, 2016

@xvrl

Here are some comparison results for master and 0.9.0 on TopN, with 10 warmup iterations and 25 benchmark iterations.

I ran the benchmark twice for each. I wasn't able to see the error band reduced but the numbers are pretty close.

master 4-28

Benchmark                            Mode  Cnt       Score      Error  Units
TopNBenchmark.queryIncrementalIndex  avgt   25  166855.134 ± 4619.707  us/op
TopNBenchmark.queryQueryableIndex    avgt   25   21872.320 ±  863.520  us/op

Benchmark                            Mode  Cnt       Score      Error  Units
TopNBenchmark.queryIncrementalIndex  avgt   25  168372.170 ± 5809.774  us/op
TopNBenchmark.queryQueryableIndex    avgt   25   22156.555 ±  774.284  us/op

0.9.0

Benchmark                            Mode  Cnt       Score      Error  Units
TopNBenchmark.queryIncrementalIndex  avgt   25  168309.975 ± 5775.354  us/op
TopNBenchmark.queryQueryableIndex    avgt   25   21996.503 ±  553.199  us/op

Benchmark                            Mode  Cnt       Score      Error  Units
TopNBenchmark.queryIncrementalIndex  avgt   25  166868.012 ± 5500.081  us/op
TopNBenchmark.queryQueryableIndex    avgt   25   22098.132 ±  645.229  us/op

@Benchmark
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
public void queryQueryableIndex(Blackhole blackhole) throws Exception
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think ideally we should have benchmarks for a few levels of things. this will help us track down performance improvements and regressions more easily, as well as optimize specific things more easily.

  1. query a single incremental index (with queryRunnerFactory.createRunner(...).run) and materialize the results
  2. query a single queryable index and materialize the results
  3. query a bunch of queryable indexes with mergeRunners + mergeResults applied (like a historical would do), with a variable number of threads in the mergeRunners thread pool

1 and 2 help us test single-segment performance. 3 helps us test whole-query performance, including the merge step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this comment applies to all query types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each query type supports those 3 benchmark types now

@jon-wei jon-wei force-pushed the datagen branch 3 times, most recently from 05f6338 to 985ffa4 Compare May 11, 2016 02:15
@jon-wei jon-wei changed the title Add benchmark data generator, basic ingestion/persist/merge/query benchmarks Add benchmark data generator, basic ingestion/persist/merge/query benchmarks [WIP] May 11, 2016
@jon-wei jon-wei force-pushed the datagen branch 2 times, most recently from 959dda5 to 765e9d5 Compare May 13, 2016 22:02
@jon-wei jon-wei changed the title Add benchmark data generator, basic ingestion/persist/merge/query benchmarks [WIP] Add benchmark data generator, basic ingestion/persist/merge/query benchmarks May 13, 2016
@jon-wei jon-wei closed this May 14, 2016
@jon-wei jon-wei reopened this May 14, 2016
@jon-wei jon-wei force-pushed the datagen branch 2 times, most recently from 6562dc1 to 078ba7b Compare May 24, 2016 23:41
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@Fork(value = 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to always use the -server vm, i.e. add @Fork(jvmArgsPrepend = "-server", value = 1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added -server option to all of the benchmarks

@xvrl
Copy link
Member

xvrl commented May 25, 2016

@jon-wei do you think we could give the schema a bit more explicit column names? For instance, I was trying to benchmark multi-value dimensions and it wasn't clear upfront which dimensions may be multi-value or not. Doesn't have to hold up this PR, but it would be nice to have.

@xvrl
Copy link
Member

xvrl commented May 25, 2016

👍 otherwise for me

@jon-wei
Copy link
Contributor Author

jon-wei commented May 25, 2016

@xvrl I've renamed the columns to note the type of distribution and whether they're multivalued or not

@fjy fjy merged commit b72c54c into apache:master May 25, 2016
@leventov leventov mentioned this pull request Sep 7, 2016
@jon-wei jon-wei deleted the datagen branch October 6, 2017 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants