Skip to content

Commit

Permalink
merge: #12483
Browse files Browse the repository at this point in the history
12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon

## Description

Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.)

[Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory))
> use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space).

### Details

SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying.

It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime.

As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well.

Open questions:

1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet.
2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further.

### JMH Benchmarks

I tried it with the JMH benchmark and it gave impressive results
```
Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime":
  656.639 ±(99.9%) 91.394 ops/s [Average]
  (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967
  CI (99.9%): [565.246, 748.033] (assumes normal distribution)
# Run complete. Total time: 00:07:12
Benchmark                                           Mode  Cnt    Score    Error  Units
EnginePerformanceTest.measureProcessExecutionTime  thrpt  200  656.639 ± 91.394  ops/s
```

[Remember the base was ~230](#12241 (comment))

### Zeebe Benchmarks

After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances.

Remember: Previously we died after ~1 hour, when reaching 800 MB of state.
[In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head:
![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png)
![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png)

#### Maxing out benchmark

![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)


<!-- Please explain the changes you made here. -->

## Related issues

<!-- Which issues are closed by this PR or are related -->
related to #12033

Co-authored-by: Christopher Zell <zelldon91@googlemail.com>
Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
  • Loading branch information
zeebe-bors-camunda[bot] and Zelldon committed May 2, 2023
2 parents 544286f + 135b4ab commit ceb47a1
Show file tree
Hide file tree
Showing 5 changed files with 55 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ public final class RocksdbCfg implements ConfigurationEntry {
private int ioRateBytesPerSecond = RocksDbConfiguration.DEFAULT_IO_RATE_BYTES_PER_SECOND;
private boolean disableWal = RocksDbConfiguration.DEFAULT_WAL_DISABLED;

private boolean enableSstPartitioning = RocksDbConfiguration.DEFAULT_SST_PARTITIONING_ENABLED;

@Override
public void init(final BrokerCfg globalConfig, final String brokerBase) {
if (columnFamilyOptions == null) {
Expand Down Expand Up @@ -110,6 +112,14 @@ public void setDisableWal(final boolean disableWal) {
this.disableWal = disableWal;
}

public boolean isEnableSstPartitioning() {
return enableSstPartitioning;
}

public void setEnableSstPartitioning(final boolean enableSstPartitioning) {
this.enableSstPartitioning = enableSstPartitioning;
}

public RocksDbConfiguration createRocksDbConfiguration() {
return new RocksDbConfiguration()
.setColumnFamilyOptions(columnFamilyOptions)
Expand All @@ -119,7 +129,8 @@ public RocksDbConfiguration createRocksDbConfiguration() {
.setMinWriteBufferNumberToMerge(minWriteBufferNumberToMerge)
.setStatisticsEnabled(enableStatistics)
.setIoRateBytesPerSecond(ioRateBytesPerSecond)
.setWalDisabled(disableWal);
.setWalDisabled(disableWal)
.setSstPartitioningEnabled(enableSstPartitioning);
}

@Override
Expand All @@ -141,6 +152,8 @@ public String toString() {
+ ioRateBytesPerSecond
+ ", disableWal="
+ disableWal
+ ", enableSstPartitioning="
+ enableSstPartitioning
+ '}';
}

Expand Down
8 changes: 8 additions & 0 deletions dist/src/main/config/broker.standalone.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -985,6 +985,14 @@
# This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_DISABLEWAL
# disableWal: true

# Configures if the RocksDB SST files should be partitioned based on some virtual column families.
# By default RocksDB will not partition the SST files, which might have influence on the compacting of certain key ranges.
# Enabling this option gives RocksDB some good hints how to improve compaction and reduce the write amplification.
# Benchmarks have show impressive results allowing to sustain performance on larger states, but it is not yet 100% clear what implications it else has except
# increasing the file count of runtime and snapshots.
# This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_ENABLESSTPARTITIONING
# enableSstPartitioning: false

# consistencyChecks:
# Configures if the basic operations on RocksDB, such as inserting or deleting key-value pairs, should check preconditions,
# for example that a key does not already exist when inserting.
Expand Down
8 changes: 8 additions & 0 deletions dist/src/main/config/broker.yaml.template
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,14 @@
# This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_DISABLEWAL
# disableWal: true

# Configures if the RocksDB SST files should be partitioned based on some virtual column families.
# By default RocksDB will not partition the SST files, which might have influence on the compacting of certain key ranges.
# Enabling this option gives RocksDB some good hints how to improve compaction and reduce the write amplification.
# Benchmarks have show impressive results allowing to sustain performance on larger states, but it is not yet 100% clear what implications it else has except
# increasing the file count of runtime and snapshots.
# This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_ENABLESSTPARTITIONING
# enableSstPartitioning: false

# consistencyChecks:
# Configures if the basic operations on RocksDB, such as inserting or deleting key-value pairs, should check preconditions,
# for example that a key does not already exist when inserting.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ public final class RocksDbConfiguration {
*/
public static final boolean DEFAULT_WAL_DISABLED = true;

/**
* This is an experimental feature, it is not 100% clear yet what the implications are besides
* having much better performance (shown in several benchmarks) and generating more SST files.
*
* <p>There will be files created for each virtual colum family.
*/
public static final boolean DEFAULT_SST_PARTITIONING_ENABLED = false;

public static final int DEFAULT_IO_RATE_BYTES_PER_SECOND = 0;

private Properties columnFamilyOptions = new Properties();
Expand All @@ -41,6 +49,8 @@ public final class RocksDbConfiguration {
private int minWriteBufferNumberToMerge = DEFAULT_MIN_WRITE_BUFFER_NUMBER_TO_MERGE;
private boolean walDisabled = DEFAULT_WAL_DISABLED;

private boolean sstPartitioningEnabled = DEFAULT_SST_PARTITIONING_ENABLED;

/**
* Defines how many files are kept open by RocksDB, per default it is unlimited (-1). This is done
* for performance reasons, if we set a value higher then zero it needs to keep track of open
Expand Down Expand Up @@ -135,4 +145,13 @@ public RocksDbConfiguration setWalDisabled(final boolean walDisabled) {
this.walDisabled = walDisabled;
return this;
}

public boolean isSstPartitioningEnabled() {
return sstPartitioningEnabled;
}

public RocksDbConfiguration setSstPartitioningEnabled(final boolean sstPartitioningEnabled) {
this.sstPartitioningEnabled = sstPartitioningEnabled;
return this;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
import org.rocksdb.RateLimiter;
import org.rocksdb.RocksDB;
import org.rocksdb.RocksDBException;
import org.rocksdb.SstPartitionerFixedPrefixFactory;
import org.rocksdb.Statistics;
import org.rocksdb.StatsLevel;
import org.rocksdb.TableFormatConfig;
Expand Down Expand Up @@ -183,6 +184,11 @@ private ColumnFamilyOptions createDefaultColumnFamilyOptions(

final var tableConfig = createTableFormatConfig(closeables, blockCacheMemory);

if (rocksDbConfiguration.isSstPartitioningEnabled()) {
columnFamilyOptions.setSstPartitionerFactory(
new SstPartitionerFixedPrefixFactory(Long.BYTES));
}

return columnFamilyOptions
// to extract our column family type (used as prefix) and seek faster
.useFixedLengthPrefixExtractor(Long.BYTES)
Expand Down

0 comments on commit ceb47a1

Please sign in to comment.