merge: #12483

12483: Introduce experimental SST partitioning r=Zelldon a=Zelldon ## Description Discovered this via in [the RocksDB google group post ](https://groups.google.com/g/rocksdb/c/l3CzFD4YBYQ#:~:text=another%20way%20that%20might%20be%20helpful%20is%20using%20sst_partitioner_factory%20.%20By%20using%20this%20experimental%20feature%2C%20you%20can%20partition%20the%20ssts%20based%20on%20your%20desired%20prefix%20which%20means%20you%20would%20only%20have%20to%20tell%20how%20many%20entries%20are%20in%20that%20sst.) [Form the java docs](https://javadoc.io/static/org.rocksdb/rocksdbjni/6.20.3/org/rocksdb/ColumnFamilyOptionsInterface.html#setSstPartitionerFactory(org.rocksdb.SstPartitionerFactory)) > use the specified factory for a function to determine the partitioning of sst files. This helps compaction to split the files on interesting boundaries (key prefixes) to make propagation of sst files less write amplifying (covering the whole key space). ### Details SST partitioning based on column family prefix (virtual column family) allows to split up key ranges in separate SST files, which should improve compaction and makes propagation of SST files less write amplifying. It will cause to create of more files in runtime and snapshot as it will create more SST files. At least for each column family we use it at runtime. As discussed here https://camunda.slack.com/archives/C04T7T0RPLY/p1681931668446069 we want to add this as an experimental feature for now, so people can play around with it and we can do as well. From the benchmark results so far it looked quite promising. The feature itself is marked as experimental as well at RocksDB so it makes sense to mark it on our side as experimental as well. Open questions: 1. it seems that the config is marked as an experimental feature, at RocksDB Idk what this exactly means, is this a problem for us? Would we just stay on the version when they remove it ? Is it unstable? Not sure yet. 2. The maximum throughput seems to be degraded a bit, as I mentioned earlier we are currently able to reach around ~240 PI/s, [with the configuration we are reaching ~220 PI/s. ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&refresh=10s&from=now-6h&to=now&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-max-out-sst-partitioner&var-pod=All&var-partition=All)I think it depends what right now is our priority, is it the maximum throughput or is it that we can provide stable performance on the larger state. Is it ok to hurt our maximum throughput a little? We will need to investigate this further. ### JMH Benchmarks I tried it with the JMH benchmark and it gave impressive results ``` Result "io.camunda.zeebe.engine.perf.EnginePerformanceTest.measureProcessExecutionTime": 656.639 ±(99.9%) 91.394 ops/s [Average] (min, avg, max) = (1.775, 656.639, 1163.635), stdev = 386.967 CI (99.9%): [565.246, 748.033] (assumes normal distribution) # Run complete. Total time: 00:07:12 Benchmark Mode Cnt Score Error Units EnginePerformanceTest.measureProcessExecutionTime thrpt 200 656.639 ± 91.394 ops/s ``` [Remember the base was ~230](#12241 (comment)) ### Zeebe Benchmarks After the JMH benchmark I started some new benchmarks like for the large state. I wanted to see how it would survive when we continuously just start instances. Remember: Previously we died after ~1 hour, when reaching 800 MB of state. [In the benchmark we had reached at least ~4.5 gig and were still able to handle the same load (over 6 hours). ](https://grafana.dev.zeebe.io/d/I4lo7_EZk/zeebe?orgId=1&from=1681912207012&to=1681930704963&var-DS_PROMETHEUS=Prometheus&var-cluster=All&var-namespace=zell-large-state-sst-partition&var-pod=All&var-partition=All):exploding_head: ![snapshot](https://user-images.githubusercontent.com/2758593/235164591-0ba3cb40-aa47-4bf4-b647-9992ac5d7e88.png) ![general](https://user-images.githubusercontent.com/2758593/235164598-5da0906e-a50f-4235-a5b8-48181dffc9d5.png) #### Maxing out benchmark ![maxgeneral](https://user-images.githubusercontent.com/2758593/235164601-bab9f40c-20be-4cbe-8530-c0ba791ec0f0.png)  ## Related issues  related to #12033 Co-authored-by: Christopher Zell <zelldon91@googlemail.com> Co-authored-by: Christopher Kujawa (Zell) <zelldon91@googlemail.com>
camunda · May 2, 2023 · ceb47a1 · ceb47a1
2 parents 544286f + 135b4ab
commit ceb47a1
Show file tree

Hide file tree

Showing 5 changed files with 55 additions and 1 deletion.
diff --git a/broker/src/main/java/io/camunda/zeebe/broker/system/configuration/RocksdbCfg.java b/broker/src/main/java/io/camunda/zeebe/broker/system/configuration/RocksdbCfg.java
@@ -26,6 +26,8 @@ public final class RocksdbCfg implements ConfigurationEntry {
   private int ioRateBytesPerSecond = RocksDbConfiguration.DEFAULT_IO_RATE_BYTES_PER_SECOND;
   private boolean disableWal = RocksDbConfiguration.DEFAULT_WAL_DISABLED;
 
+  private boolean enableSstPartitioning = RocksDbConfiguration.DEFAULT_SST_PARTITIONING_ENABLED;
+
   @Override
   public void init(final BrokerCfg globalConfig, final String brokerBase) {
     if (columnFamilyOptions == null) {
@@ -110,6 +112,14 @@ public void setDisableWal(final boolean disableWal) {
     this.disableWal = disableWal;
   }
 
+  public boolean isEnableSstPartitioning() {
+    return enableSstPartitioning;
+  }
+
+  public void setEnableSstPartitioning(final boolean enableSstPartitioning) {
+    this.enableSstPartitioning = enableSstPartitioning;
+  }
+
   public RocksDbConfiguration createRocksDbConfiguration() {
     return new RocksDbConfiguration()
         .setColumnFamilyOptions(columnFamilyOptions)
@@ -119,7 +129,8 @@ public RocksDbConfiguration createRocksDbConfiguration() {
         .setMinWriteBufferNumberToMerge(minWriteBufferNumberToMerge)
         .setStatisticsEnabled(enableStatistics)
         .setIoRateBytesPerSecond(ioRateBytesPerSecond)
-        .setWalDisabled(disableWal);
+        .setWalDisabled(disableWal)
+        .setSstPartitioningEnabled(enableSstPartitioning);
   }
 
   @Override
@@ -141,6 +152,8 @@ public String toString() {
         + ioRateBytesPerSecond
         + ", disableWal="
         + disableWal
+        + ", enableSstPartitioning="
+        + enableSstPartitioning
         + '}';
   }
 

diff --git a/dist/src/main/config/broker.standalone.yaml.template b/dist/src/main/config/broker.standalone.yaml.template
@@ -985,6 +985,14 @@
         # This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_DISABLEWAL
         # disableWal: true
 
+        # Configures if the RocksDB SST files should be partitioned based on some virtual column families.
+        # By default RocksDB will not partition the SST files, which might have influence on the compacting of certain key ranges.
+        # Enabling this option gives RocksDB some good hints how to improve compaction and reduce the write amplification.
+        # Benchmarks have show impressive results allowing to sustain performance on larger states, but it is not yet 100% clear what implications it else has except
+        # increasing the file count of runtime and snapshots.
+        # This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_ENABLESSTPARTITIONING
+        # enableSstPartitioning: false
+
       # consistencyChecks:
         # Configures if the basic operations on RocksDB, such as inserting or deleting key-value pairs, should check preconditions,
         # for example that a key does not already exist when inserting.

diff --git a/dist/src/main/config/broker.yaml.template b/dist/src/main/config/broker.yaml.template
@@ -895,6 +895,14 @@
         # This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_DISABLEWAL
         # disableWal: true
 
+        # Configures if the RocksDB SST files should be partitioned based on some virtual column families.
+        # By default RocksDB will not partition the SST files, which might have influence on the compacting of certain key ranges.
+        # Enabling this option gives RocksDB some good hints how to improve compaction and reduce the write amplification.
+        # Benchmarks have show impressive results allowing to sustain performance on larger states, but it is not yet 100% clear what implications it else has except
+        # increasing the file count of runtime and snapshots.
+        # This setting can also be set using the environment variable ZEEBE_BROKER_EXPERIMENTAL_ROCKSDB_ENABLESSTPARTITIONING
+        # enableSstPartitioning: false
+
       # consistencyChecks:
         # Configures if the basic operations on RocksDB, such as inserting or deleting key-value pairs, should check preconditions,
         # for example that a key does not already exist when inserting.

diff --git a/zb-db/src/main/java/io/camunda/zeebe/db/impl/rocksdb/RocksDbConfiguration.java b/zb-db/src/main/java/io/camunda/zeebe/db/impl/rocksdb/RocksDbConfiguration.java
@@ -32,6 +32,14 @@ public final class RocksDbConfiguration {
    */
   public static final boolean DEFAULT_WAL_DISABLED = true;
 
+  /**
+   * This is an experimental feature, it is not 100% clear yet what the implications are besides
+   * having much better performance (shown in several benchmarks) and generating more SST files.
+   *
+   * <p>There will be files created for each virtual colum family.
+   */
+  public static final boolean DEFAULT_SST_PARTITIONING_ENABLED = false;
+
   public static final int DEFAULT_IO_RATE_BYTES_PER_SECOND = 0;
 
   private Properties columnFamilyOptions = new Properties();
@@ -41,6 +49,8 @@ public final class RocksDbConfiguration {
   private int minWriteBufferNumberToMerge = DEFAULT_MIN_WRITE_BUFFER_NUMBER_TO_MERGE;
   private boolean walDisabled = DEFAULT_WAL_DISABLED;
 
+  private boolean sstPartitioningEnabled = DEFAULT_SST_PARTITIONING_ENABLED;
+
   /**
    * Defines how many files are kept open by RocksDB, per default it is unlimited (-1). This is done
    * for performance reasons, if we set a value higher then zero it needs to keep track of open
@@ -135,4 +145,13 @@ public RocksDbConfiguration setWalDisabled(final boolean walDisabled) {
     this.walDisabled = walDisabled;
     return this;
   }
+
+  public boolean isSstPartitioningEnabled() {
+    return sstPartitioningEnabled;
+  }
+
+  public RocksDbConfiguration setSstPartitioningEnabled(final boolean sstPartitioningEnabled) {
+    this.sstPartitioningEnabled = sstPartitioningEnabled;
+    return this;
+  }
 }
diff --git a/zb-db/src/main/java/io/camunda/zeebe/db/impl/rocksdb/ZeebeRocksDbFactory.java b/zb-db/src/main/java/io/camunda/zeebe/db/impl/rocksdb/ZeebeRocksDbFactory.java
@@ -32,6 +32,7 @@
 import org.rocksdb.RateLimiter;
 import org.rocksdb.RocksDB;
 import org.rocksdb.RocksDBException;
+import org.rocksdb.SstPartitionerFixedPrefixFactory;
 import org.rocksdb.Statistics;
 import org.rocksdb.StatsLevel;
 import org.rocksdb.TableFormatConfig;
@@ -183,6 +184,11 @@ private ColumnFamilyOptions createDefaultColumnFamilyOptions(
 
     final var tableConfig = createTableFormatConfig(closeables, blockCacheMemory);
 
+    if (rocksDbConfiguration.isSstPartitioningEnabled()) {
+      columnFamilyOptions.setSstPartitionerFactory(
+          new SstPartitionerFixedPrefixFactory(Long.BYTES));
+    }
+
     return columnFamilyOptions
         // to extract our column family type (used as prefix) and seek faster
         .useFixedLengthPrefixExtractor(Long.BYTES)