Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(broker): add state controller performance tests #13884

Merged
merged 4 commits into from
Aug 16, 2023

Conversation

npepinpe
Copy link
Member

@npepinpe npepinpe commented Aug 14, 2023

Description

This PR adds a new performance test to track the StateController#recover process, which includes copying a snapshot and opening the underlying RocksDB state. Specifically, it measures the case where the state is considered "large", up to 6GB.

Note
The way to compute the snapshot size while we're generating it is far from accurate, since RocksDB is working in the background, performing compaction, flushing, etc., so the size is constantly fluctuating. This means that, typically, the final state size is less than the desired targeted size. We could add some factor as a heuristic, but it's unclear to me if that's super useful.

For now, we log the desired and actual snapshot size. I'm a bit worried however that this will significantly skew results, making it hard to assign a correct reference score and maximum deviation.

To simplify adding this test and future tests, a new JMHTestCase utility is added, which lets you easily write a JUnit test that runs a benchmark and asserts its result. Additionally, a new JMHTest annotation is added, which is coupled with a JUnit 5 extension that will inject a pre-configurd JMHTestCase into a test method as a parameter. The annotation ensures all JMH tests will have an appropriate tag and will be excluded from the main CI pipeline, but included in the performance test job.

Note
All of this can be expanded later as we need more stuff.

There are some notable downsides to this approach of running JMH tests however:

  1. Since JMH forks, it's not possible to debug the tests easily locally. To do so, you have to disable forking. Not too difficult, but not obvious when you're not familiar with it.
  2. The reference score is typically different locally and on CI, and we already see that it sometimes randomly fluctuate on CI. Would be nice to find a better approach.
  3. When a benchmark fails with an error, the actual assertion error is no benchmarks were run. This is pretty bad; luckily, the benchmark logs are present in the test output, but it means you have to wait for the complete workflow to finish before being able to grab them.

Related issues

related #13775

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

  • The changes are backwards compatibility with previous versions
  • If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/1.3) to the PR, in case that fails you need to create backports manually.

Testing:

  • There are unit/integration tests that verify all acceptance criterias of the issue
  • New tests are written to ensure backwards compatibility with further versions
  • The behavior is tested manually
  • The change has been verified by a QA run
  • The impact of the changes is verified by a benchmark

Documentation:

  • The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
  • If the PR changes how BPMN processes are validated (e.g. support new BPMN element) then the Camunda modeling team should be informed to adjust the BPMN linting.

Other teams:
If the change impacts another team an issue has been created for this team, explaining what they need to do to support this change.

Please refer to our review guidelines.

@npepinpe npepinpe requested a review from Zelldon August 14, 2023 14:43
@npepinpe
Copy link
Member Author

npepinpe commented Aug 14, 2023

@Zelldon - for now just check that the JMH test makes sense.

It creates a ZeebeDb with all the 57 column families, and adds 1000 (or 10,000 in CI) keys to each, with random key/value (default size of 4kb per key and per value, so 8kb per entry). So in CI this is about 4GB of state, and locally a few hundred MBs.

I think for this specific test we don't really care about the actual data so this should be fine, but of course it could be that RocksDB internals change things slightly. I don't expect significantly though.

This doesn't include any actual fix yet, by the way 🙃

@npepinpe npepinpe linked an issue Aug 14, 2023 that may be closed by this pull request
@npepinpe
Copy link
Member Author

Early tests with a dumb hacky fix (see below) show a major improvement. With the current approach, you can recover a 4GB state about once per two seconds (or 0.5 times per second). With the hack, you can do it about 6 times per second.

Will be interesting to push further with bigger sizes, but locally I didn't want to test with 20-30 GB 😄

Here's the hack:

 final var snapshotFile = snapshot.getPath().toFile();
            try {
              final var db = RocksDB.openReadOnly(snapshotFile.getAbsolutePath());
              Checkpoint.create(db).createCheckpoint(runtimeDirectory.toString());
              openDb(future);
            } catch (final RocksDBException e) {
              future.completeExceptionally(new RuntimeException(
                  String.format("Failed to recover from snapshot %s", snapshot.getId()),
                  e));
            }

That replaces the snapshotStore.copy call and callback.

Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks @npepinpe it is cool that we add more of such tests 👍🏼 I had just some smaller comments or thoughts.

@npepinpe
Copy link
Member Author

With 18bcf57, I added some utilities to easily create junit tests which run JMH benchmarks.

You can see examples in c4c061f

While I moved the test method to the same class as the benchmark, it can still be moved away, and there's some arguments to be made for keeping them separate for sure. I'm not strongly for or against, let me know what you think. We can still get rid of the JUnit extension and use only the utilities, but imo the extension is nice to ensure all of the tests share the same tag, and as such as included in the appropriate job (and still runnable locally).

@npepinpe
Copy link
Member Author

npepinpe commented Aug 15, 2023

@Zelldon - I think I addressed everything. Let me know if there's anything else. I'll get started on the actual fix now, though we could already merge this PR even without the fix to be honest, to avoid the fix PR being too big.

@Zelldon
Copy link
Member

Zelldon commented Aug 15, 2023

Will take a look tomorrow

@npepinpe
Copy link
Member Author

Note, switching from environment variables to tags for enabling groups of tests (which is the usual JUnit 5 way) has the downside that if you run all tests for a module in your IDE, it will run the performance tests. I think before it would skip the class.

@npepinpe
Copy link
Member Author

Another downside right now, the performance test job is quite slow. Since broker depends on the engine, the tests will always run sequentially. The engine test takes 6 minutes, and now the broker one takes 3 minutes. That's 9 minutes alone, so it's often the last job to finish 😅

@npepinpe npepinpe marked this pull request as ready for review August 15, 2023 19:49
@npepinpe
Copy link
Member Author

One way to get a more accurate snapshot size is to pause and resume background work in RocksDB manually. However that kind of breaks our ZeebeDb abstraction (maybe?)

@npepinpe
Copy link
Member Author

Yeah, the way I set up building the state to a certain size is too inaccurate, leading to the variation being too high. 🤔

@npepinpe
Copy link
Member Author

npepinpe commented Aug 16, 2023

I see a few options:

  1. Allow pausing/resuming background work for the DB. This will ensure we compute proper sizes without having to worry about compaction/flushing. Not too keen on this because it's modifying the production API to suit tests =/
  2. Close the DB between batch inserts, which will give us an idea of the "final" size. This is a bit better, but increases the startup time. This is my preferred approach - it's clean and simple, and it's really only inefficient once, when the state is being set up.
  3. Instead of asserting the deviation, we assert that the score is above a minimum target score. This risks missing future improvements, but since what we're testing is fairly small in scope, it might be fine.

EDIT: I tested closing/opening, and got it to be pretty accurate. It takes about 30 seconds to build a 4GB state, which I think is fine.

@npepinpe
Copy link
Member Author

So it keeps fluctuating between 0.5 and 0.65. That's pretty big deviation, and this time the state size is pretty stable, so I don't know where it is. That said, it's a relatively big deviation, but in absolute terms, it means a difference of 300ms, which isn't crazy high.

Copy link
Member

@Zelldon Zelldon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks @npepinpe, I had smaller comments but I think we are in a pretty good state.

Is the test still flaky?

data.wrapBuffer(new UnsafeBuffer(buffer));

return data;
}

private static long computeSnapshotSize(final Path root) {
try (final var files = Files.walk(root)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I guess taking checkpoints would also cause to flush and make it easier to estimate. Or isn't there also a statistics property for this ? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property is (according to their wiki) an estimate of the live SST data. I have no idea how accurate it is, that's why I went with checking the file size 🤷 Since it includes L0, I expect it would also fluctuate just as much if we don't stop background work somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about taking a checkpoint, but here we're not taking a checkpoint, so it shouldn't have any impact.

.github/workflows/ci.yml Show resolved Hide resolved
@npepinpe
Copy link
Member Author

With a 20% max deviation it's not flaky, as it falls within these bounds. Though considering the engine is sometimes flaky, I'm also not 100% confident.

Like I suggested above, it might be that we want to take a different approach than measuring some expected deviation from a reference score. For example, we could only assert that we meet a specific target (e.g. score is greater than some fixed lower bound). Downside here is that thresholds will be different everywhere it's being run, so it may pass locally, but fail in CI. I guess that's still the case with the max deviation though 😅

@npepinpe
Copy link
Member Author

The only remaining thing imo is that the job is the slowest of all now 😄 And the more tests we add, the slower it's going to get. We should eventually look into speeding it up, since each test takes minutes to execute, and likely get executed sequentially...

@npepinpe
Copy link
Member Author

bors merge

@zeebe-bors-camunda
Copy link
Contributor

Build succeeded:

@zeebe-bors-camunda zeebe-bors-camunda bot merged commit 568f246 into main Aug 16, 2023
32 checks passed
@zeebe-bors-camunda zeebe-bors-camunda bot deleted the np-13775-state-recover-perf branch August 16, 2023 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve ZeebePartition recovery time with large state
2 participants