Skip to content
This repository has been archived by the owner on Aug 3, 2020. It is now read-only.

[FLINK-11986] [state backend, tests] Add micro benchmark for state operations #13

Merged
merged 2 commits into from Apr 30, 2019

Conversation

carp84
Copy link
Contributor

@carp84 carp84 commented Mar 20, 2019

Currently we already have benchmarks for the whole backend, but none for finer grained state operations, and here we propose to add more benchmarks, including (but not limited to):

  • ValueState
    • testPut
    • testGet
  • ListState
    • testUpdate
    • testGet
    • testAddAll
  • MapState
    • testPut
    • testGet
    • testContains
    • testKeys
    • testValues
    • testEntries
    • testIterator
    • testRemove
    • testPutAll

And we will create benchmark for HeapKeyedStateBackend and RocksDBKeyedStateBackend separately.

@pnowojski
Copy link
Contributor

Could you paste/link the example full output of the benchmark run? Benchmark scores, including the output of all of the measurement/warm up iterations etc? The things that I would like to know are:

  1. total time required to execute those new benchmarks.
  2. stability of the results

Copy link
Contributor

@StefanRRichter StefanRRichter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work @carp84, I think this is a very good addition to the performance tests. I had a couple of comments, mostly smaller. I think the most imporant ones are about iterating the list state in get and thinking about adding read-modify-write cycle benchmarks. I also had a question about the targeted scenarios (cache/mem/disk). Furthermore, I wonder if it would make sense to extend the tests in the future to include: timer service performances, check/savepointpoint performances, operational performances with concurrently running checkpoints.

backend.setCurrentKey(random.nextLong(setupKeyCount));
valueState.value();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I wonder if adding read-modify-write tests would not be valueable for additional insight.

class StateBenchmarkConstants {
static final int mapKeyCount = 10;
static final int listValueCount = 100;
static final int setupKeyCount = 500_000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the choice of values: what are we targeting in the benchmark for heap and rock? For example, for heap are would we expect that random op will be answered from memory or from L2/L3 because the whole dataset can still fit there? Similar question for Rocks, do we measure performance for cached blocks or when hitting disk? Should we target the different alternatives, because the message can be very different - for example what if or Rocks benchmarks look all good but never hit the disk and then there are seeks involved to become terrible for users once they hit the disk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set the values large enough to trigger disk operations for rocksdb, and confirm this by checking rocksdb log to see whether there's flush/compaction happened. However, whether we should check the all-fit-in-memory case for rocksdb is a question. What's your opinion? Thanks.

@carp84
Copy link
Contributor Author

carp84 commented Mar 22, 2019

Could you paste/link the example full output of the benchmark run? Benchmark scores, including the output of all of the measurement/warm up iterations etc? The things that I would like to know are:

  1. total time required to execute those new benchmarks.
  2. stability of the results

Thanks for review @pnowojski. Let me get the data after resolving Stefan's review comments since possibly the change will affect the result.

@carp84
Copy link
Contributor Author

carp84 commented Mar 22, 2019

Furthermore, I wonder if it would make sense to extend the tests in the future to include: timer service performances, check/savepointpoint performances, operational performances with concurrently running checkpoints.

We also have internal benchmarks for timer service and checkpoint performance, and let me do upstreaming one by one (smile). For performance with concurrently running checkpoints I agree we need to add one.

@carp84
Copy link
Contributor Author

carp84 commented Mar 26, 2019

Here comes the total time and stability of the results from 2 rounds:

  • JMH configuration:
JMH version: 1.19
VM version: JDK 1.8.0_102, VM 25.102-b52
VM options: -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.ssl
Warmup: 10 iterations, 1 s each
Measurement: 10 iterations, 1 s each
Timeout: 10 min per iteration
Threads: 1 thread, will synchronize iterations
Benchmark mode: Throughput, ops/time
  • Round 1 time: Run complete. Total time: 00:43:34

  • Round 2 time: Run complete. Total time: 00:43:44

  • Result stability

Benchmark Mode Cnt Score Error Units
HeapListStateBenchmark.test1Update#1 thrpt 30 2439.935 ±115.324 ops/ms
HeapListStateBenchmark.test1Update#2 thrpt 30 2415.013 ±122.071 ops/ms
HeapListStateBenchmark.test2Add#1 thrpt 30 3140.404 ±152.354 ops/ms
HeapListStateBenchmark.test2Add#2 thrpt 30 3051.865 ±244.304 ops/ms
HeapListStateBenchmark.test3Get#1 thrpt 30 1940.100 ±71.786 ops/ms
HeapListStateBenchmark.test3Get#2 thrpt 30 1869.135 ±109.670 ops/ms
HeapListStateBenchmark.test4GetAndIterate#1 thrpt 30 1857.606 ±79.844 ops/ms
HeapListStateBenchmark.test4GetAndIterate#2 thrpt 30 1788.352 ±93.340 ops/ms
HeapListStateBenchmark.test5AddAll#1 thrpt 30 294.972 ±122.322 ops/ms
HeapListStateBenchmark.test5AddAll#2 thrpt 30 285.633 ±113.834 ops/ms
HeapMapStateBenchmark.test1Add#1 thrpt 30 1991.866 ±109.012 ops/ms
HeapMapStateBenchmark.test1Add#2 thrpt 30 2033.461 ±48.315 ops/ms
HeapMapStateBenchmark.test1Update#1 thrpt 30 1409.441 ±80.476 ops/ms
HeapMapStateBenchmark.test1Update#2 thrpt 30 1527.825 ±53.739 ops/ms
HeapMapStateBenchmark.test2Get#1 thrpt 30 1396.715 ±42.510 ops/ms
HeapMapStateBenchmark.test2Get#2 thrpt 30 1413.608 ±73.645 ops/ms
HeapMapStateBenchmark.test3Contains#1 thrpt 30 1507.149 ±54.452 ops/ms
HeapMapStateBenchmark.test3Contains#2 thrpt 30 1550.425 ±76.011 ops/ms
HeapMapStateBenchmark.test4Keys#1 thrpt 30 9311.358 ±236.535 ops/ms
HeapMapStateBenchmark.test4Keys#2 thrpt 30 9593.196 ±369.914 ops/ms
HeapMapStateBenchmark.test5Values#1 thrpt 30 9100.976 ±267.255 ops/ms
HeapMapStateBenchmark.test5Values#2 thrpt 30 9288.619 ±209.923 ops/ms
HeapMapStateBenchmark.test6Entries#1 thrpt 30 8723.019 ±186.056 ops/ms
HeapMapStateBenchmark.test6Entries#2 thrpt 30 9448.083 ±273.444 ops/ms
HeapMapStateBenchmark.test7Iterator#1 thrpt 30 9522.637 ±344.523 ops/ms
HeapMapStateBenchmark.test7Iterator#2 thrpt 30 9351.307 ±210.986 ops/ms
HeapMapStateBenchmark.test8Remove#1 thrpt 30 1743.482 ±74.815 ops/ms
HeapMapStateBenchmark.test8Remove#2 thrpt 30 1903.833 ±105.975 ops/ms
HeapMapStateBenchmark.test9PutAll#1 thrpt 30 756.448 ±26.231 ops/ms
HeapMapStateBenchmark.test9PutAll#2 thrpt 30 786.871 ±38.946 ops/ms
HeapValueStateBenchmark.test1Update#1 thrpt 30 2048.441 ±101.184 ops/ms
HeapValueStateBenchmark.test1Update#2 thrpt 30 2022.074 ±178.275 ops/ms
HeapValueStateBenchmark.test2Add#1 thrpt 30 5704.857 ±442.342 ops/ms
HeapValueStateBenchmark.test2Add#2 thrpt 30 5264.967 ±938.853 ops/ms
HeapValueStateBenchmark.test3Get#1 thrpt 30 2001.197 ±78.446 ops/ms
HeapValueStateBenchmark.test3Get#2 thrpt 30 2099.319 ±96.104 ops/ms
RocksDBListStateBenchmark.test1Update#1 thrpt 30 187.203 ±11.839 ops/ms
RocksDBListStateBenchmark.test1Update#2 thrpt 30 189.827 ±11.974 ops/ms
RocksDBListStateBenchmark.test2Add#1 thrpt 30 191.165 ±10.647 ops/ms
RocksDBListStateBenchmark.test2Add#2 thrpt 30 193.191 ±9.661 ops/ms
RocksDBListStateBenchmark.test3Get#1 thrpt 30 428.343 ±40.097 ops/ms
RocksDBListStateBenchmark.test3Get#2 thrpt 30 403.014 ±47.031 ops/ms
RocksDBListStateBenchmark.test4GetAndIterate#1 thrpt 30 421.022 ±36.899 ops/ms
RocksDBListStateBenchmark.test4GetAndIterate#2 thrpt 30 433.174 ±40.616 ops/ms
RocksDBListStateBenchmark.test5AddAll#1 thrpt 30 102.472 ±55.705 ops/ms
RocksDBListStateBenchmark.test5AddAll#2 thrpt 30 106.704 ±54.815 ops/ms
RocksDBMapStateBenchmark.test1Add#1 thrpt 30 349.681 ±36.756 ops/ms
RocksDBMapStateBenchmark.test1Add#2 thrpt 30 342.682 ±40.899 ops/ms
RocksDBMapStateBenchmark.test1Update#1 thrpt 30 350.764 ±31.309 ops/ms
RocksDBMapStateBenchmark.test1Update#2 thrpt 30 354.110 ±37.487 ops/ms
RocksDBMapStateBenchmark.test2Get#1 thrpt 30 45.117 ±0.729 ops/ms
RocksDBMapStateBenchmark.test2Get#2 thrpt 30 45.715 ±0.820 ops/ms
RocksDBMapStateBenchmark.test3Contains#1 thrpt 30 45.824 ±0.620 ops/ms
RocksDBMapStateBenchmark.test3Contains#2 thrpt 30 46.671 ±0.592 ops/ms
RocksDBMapStateBenchmark.test4Keys#1 thrpt 30 323.453 ±12.886 ops/ms
RocksDBMapStateBenchmark.test4Keys#2 thrpt 30 318.254 ±10.652 ops/ms
RocksDBMapStateBenchmark.test5Values#1 thrpt 30 320.683 ±9.605 ops/ms
RocksDBMapStateBenchmark.test5Values#2 thrpt 30 316.464 ±13.635 ops/ms
RocksDBMapStateBenchmark.test6Entries#1 thrpt 30 239.602 ±11.713 ops/ms
RocksDBMapStateBenchmark.test6Entries#2 thrpt 30 240.108 ±13.358 ops/ms
RocksDBMapStateBenchmark.test7Iterator#1 thrpt 30 322.243 ±9.507 ops/ms
RocksDBMapStateBenchmark.test7Iterator#2 thrpt 30 316.570 ±11.002 ops/ms
RocksDBMapStateBenchmark.test8Remove#1 thrpt 30 356.717 ±25.874 ops/ms
RocksDBMapStateBenchmark.test8Remove#2 thrpt 30 351.623 ±29.733 ops/ms
RocksDBMapStateBenchmark.test9PutAll#1 thrpt 30 88.432 ±6.160 ops/ms
RocksDBMapStateBenchmark.test9PutAll#2 thrpt 30 87.454 ±5.779 ops/ms
RocksDBValueStateBenchmark.test1Update#1 thrpt 30 343.421 ±23.262 ops/ms
RocksDBValueStateBenchmark.test1Update#2 thrpt 30 337.099 ±23.011 ops/ms
RocksDBValueStateBenchmark.test2Add#1 thrpt 30 351.838 ±19.350 ops/ms
RocksDBValueStateBenchmark.test2Add#2 thrpt 30 330.423 ±27.198 ops/ms
RocksDBValueStateBenchmark.test3Get#1 thrpt 30 556.239 ±25.828 ops/ms

@pnowojski
Copy link
Contributor

@carp84 that's a lot of benchmarks :)

What is going on with HeapValueStateBenchmark.test2Add#2? (It had a huge spread)

I'm thinking about modifying our speed center setup, so that it won't be flooded/overloaded with number of benchmarks. I think a solution might be to start using projects. We could keep current benchmarks in Flink project, while we could add all of the benchmarks from this PR to a State Backends projects. I have played along with this by manually adding some results and you can see the result here on the left:

Executable
  Flink // <----- project #1
    - Flink 
  State Backends // <------ project #2
    - State Backends

It looks like this would allow us to group the benchmark & results together.

In order to make it fully work, we would need couple of more things:

  1. modify save_jmh_result.py script and add optional parameters for project and executable (lines 61 & 62).
  2. research how could we differentiate between various type of benchmarks here. Currently there is a jenkins job that runs all benchmarks defined in this repository and then uploads them using those two commands:
sh "mvn -Dflink.version=`cat ../flink-version` clean install exec:exec"
sh 'python save_jmh_result.py --environment Hetzner --branch master --commit COMMIT --codespeed URL'

we would need to research how can we modify the first command to execute either Flink or State Backends benchmarks. Then we would upload them using second command (two different executions) by passing correct values for the parameters added in step 1.

Thanks to that we might be able to have two independent jenkins jobs - one for State Backends and another for Flink benchmarks. Could come in handy if you do not want to wait for all of the benchmarks to complete and you are only interested in one of them.

Do you think it makes sense?

@carp84
Copy link
Contributor Author

carp84 commented Apr 16, 2019

What is going on with HeapValueStateBenchmark.test2Add#2?

Didn't notice this and probably due to environment variance. Let me double check.

We could keep current benchmarks in Flink project, while we could add all of the benchmarks from this PR to a State Backends projects.

Agreed to use separate projects. Only that the naming seems a little bit strange since backends also belong to Flink (smile). And internally we also have micro benchmarks for checkpoint and timer-service and plan to upstream later (as mentioned above), so maybe it worth some efforts on categorizing.

we would need to research how can we modify the first command to execute either Flink or State Backends benchmarks.

Please check the 3rd commit here, which generate a shaded jmh jar and we could use command like java -jar target/benchmarks.jar -rf csv org.apache.flink.state.benchmark.* to run different benchmark (and save result separately) against the shaded jar w/ the commit, which is the way how I generated the above results.

@pnowojski
Copy link
Contributor

Only that the naming seems a little bit strange since backends also belong to Flink (smile).

Agree :( However we are re-using here a code speed tool from PyPy project, that is not widely adopted and seems tailor suited just for their single use case, so unless we want to develop UI from scratch or modify code speed, we have to dance around those kind of issues 😒 I'm open to other suggestions.

java -jar target/benchmarks.jar -rf csv org.apache.flink.state.benchmark.*

This looks ok as long as we will be able to integrate this command with jenkins job running the benchmarks.

@carp84
Copy link
Contributor Author

carp84 commented Apr 29, 2019

Have setup a demo in our speed center and it successfully reflected the effect of a recent improvement:
image

image

@carp84
Copy link
Contributor Author

carp84 commented Apr 29, 2019

@StefanRRichter @pnowojski Mind take a look at the latest commit and let me know if any comments? Thanks.

And if current codes look good, I plan to remove the numbering in method names (like from test1ListUpdate to testListUpdate) and clean up all demo data on our codespeed center.

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carp84 for the update and integration with Jenkins :) It looks good. Couple of comments from my side.

Also I would actually also drop test prefix from the test names (rename test1ListUpdate to just listUpdate). I know that this doesn't follow the usual (and ours) Java coding style convention, but in the UI this test prefix doesn't help with anything and just takes more place.

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change LGTM. Unfortunately we do not have any CI hooked in here, so I assume everything compiles and works well with Jenkins? :)

@StefanRRichter do you have some more comments? (you have pending "changes requested").

@StefanRRichter
Copy link
Contributor

@pnowojski No, my request have been addressed.

@carp84
Copy link
Contributor Author

carp84 commented Apr 30, 2019

so I assume everything compiles and works well with Jenkins?

Yes, please refer to this Jenkins job which is a dry run. :-)

And thanks all for review! @pnowojski @StefanRRichter

@pnowojski
Copy link
Contributor

Ok :)

One last comment. Can you @carp84 squash the commits together, except of Add support to generate shaded benchmark package to allow run specifi... and can you update/modify the README.md to document the new way to run only selected benchmarks? (basically the java -jar target/benchmarks.jar -rf csv org.apache.flink.state.benchmark.* command)

…c case in command line

With this change we could run below command in shell instead of updating
the pom file manually:
java -jar target/benchmarks.jar -rf csv org.apache.flink.state.benchmark.*
@carp84
Copy link
Contributor Author

carp84 commented Apr 30, 2019

Updated, please check and let me know whether it looks good, thanks. @pnowojski

Copy link
Contributor

@pnowojski pnowojski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging. Thanks for the big contribution @carp84 !

@pnowojski pnowojski merged commit 8fad36e into dataArtisans:master Apr 30, 2019
Myasuka pushed a commit to Myasuka/flink-benchmarks that referenced this pull request Jul 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants