[SPARK-37224][SS] Optimize write path on RocksDB state store provider #34502

HeartSaVioR · 2021-11-06T07:03:58Z

What changes were proposed in this pull request?

This PR proposes to optimize write path on RocksDB via removing unnecessary lookup. Removing unnecessary lookup unfortunately also disables the feasibility to track the number of rows, so this PR also introduces a new configuration for RocksDB state store provider to let end users turn it on and off based on their needs.

The new configuration is following:

config name: spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows
default value: true (since we already serve the number and we want to avoid breaking change)

We will give "0" for the number of keys in the state store metric when the config is turned off. The ideal value seems to be a negative one, but currently SQL metric doesn't allow negative value.

We will also handle the case the config is flipped during restart. This will enable the way end users enjoy the benefit but also not lose the chance to know the number of state rows. End users can turn off the flag to maximize the performance, and turn on the flag (restart required) when they want to see the actual number of keys (for observability/debugging/etc).

Why are the changes needed?

This addresses unnecessary lookup in write path, which only needs to track the number of rows. While the metric is a part of basic metrics for stateful operator, we can sacrifice some observability to gain performance on heavy write load.

Does this PR introduce any user-facing change?

Yes, new configuration is added. This is neither a backward incompatible change nor behavior change, since default value of the flag is retaining the behavior as it is.

But there's a glitch regarding rolling back to previous Spark version: if you run query with turning the config off (so that the number of keys is lost) and restart the query in older Spark version, older Spark version will still try to track the number and the number will get messed up. You may want to turn the config on and run some micro-batches before going back to previous Spark version.

How was this patch tested?

New UT. Benchmark will follow in next PR.

HeartSaVioR · 2021-11-06T07:06:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -196,6 +210,29 @@ class RocksDB(
    }
  }

+  private def countKeys(): Long = {


This doesn't leverage iterator() since here we don't need to deserialize the key-value from RocksDB to produce UnsafeRowPair.

SparkQA · 2021-11-06T07:14:03Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49422/

SparkQA · 2021-11-06T07:53:37Z

Test build #144951 has finished for PR 34502 at commit e99e943.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
// When this is enabled, this class does additional lookup on write operations (put/delete) to

SparkQA · 2021-11-06T08:48:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49423/

SparkQA · 2021-11-06T09:24:59Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49423/

SparkQA · 2021-11-06T13:45:38Z

Test build #144952 has finished for PR 34502 at commit 1715105.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-11-07T00:35:12Z

@zsxwing @viirya @xuanyuanking Please take a look. Thanks in advance!

HeartSaVioR · 2021-11-17T09:57:13Z

Please note that 553 Lines are regarding benchmark code and the result. I'll just separate the benchmark and the result to the next PR on top of this, to reduce down the amount of actual code to review.

HeartSaVioR · 2021-11-17T10:23:12Z

Friendly reminder, @tdas @zsxwing @viirya @xuanyuanking

SparkQA · 2021-11-17T11:16:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49800/

SparkQA · 2021-11-17T12:16:07Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49800/

SparkQA · 2021-11-17T15:12:07Z

Test build #145329 has finished for PR 34502 at commit 01e9db5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing

Left some minor comments

zsxwing · 2021-11-17T23:52:16Z

docs/structured-streaming-programming-guide.md

@@ -1956,8 +1956,21 @@ Here are the configs regarding to RocksDB instance of the state store provider:
    <td>Whether we resets all ticker and histogram stats for RocksDB on load.</td>
    <td>True</td>
  </tr>
+  <tr>
+    <td>spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows</td>
+    <td>Whether we track the total number of rows in state store. Please refer the details in "Performance-aspect considerations".</td>


It's better to use a link such as in [Performance-aspect considerations](#performance-aspect-considerations) (the link syntax may be wrong. Could you try to build the docs locally to check it?)

OK will check it.

zsxwing · 2021-11-17T23:56:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+    // Attempt to close this iterator if there is a task failure, or a task interruption.
+    // This is a hack because it assumes that the RocksDB is running inside a task.
+    Option(TaskContext.get()).foreach { tc =>
+      tc.addTaskCompletionListener[Unit] { _ => iter.close() }
+    }


We can put iter.close() in finally instead. This method doesn't return an Iterator to the caller.

Nice finding! I was blindly following the iterator() method, my bad.

zsxwing · 2021-11-18T00:06:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

+
+        val numKeys = if (!conf.trackTotalNumberOfRows) {
+          // we don't track the total number of rows - discard the number being track
+          -1L


Could you point out where we turn -1 to 0? I don't find it.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala

Lines 40 to 44 in 6450f6b

class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {

// This is a workaround for SPARK-11013.

// We may use -1 as initial value of the accumulator, if the accumulator is valid, we will

// update it at the end of task and the value will be at least 0. Then we can filter out the -1

// values before calculate max, min, etc.

Even we separate the values for "no key" vs "don't know", the value will go through SQLMetric and negative values are not contributing on accumulation.

Thanks for the point.

HeartSaVioR · 2021-11-18T03:44:24Z

Thanks! I addressed review comments. Please have a look at new change. Thanks in advance!

SparkQA · 2021-11-18T03:52:38Z

Test build #145366 has finished for PR 34502 at commit 13bddcc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2021-11-18T04:38:43Z

IDEA made fun of me; it didn't give the notice on compilation error. Just fixed.

SparkQA · 2021-11-18T04:45:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49838/

zsxwing

LGTM

SparkQA · 2021-11-18T05:37:16Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49840/

SparkQA · 2021-11-18T06:37:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49840/

HeartSaVioR · 2021-11-18T06:38:20Z

Thanks for the review @zsxwing ! Merging to master.

xuanyuanking

Sorry for the late LGTM. Just some small comments. We can address it in the second PR.

xuanyuanking · 2021-11-18T07:32:46Z

docs/structured-streaming-programming-guide.md

 </table>

+##### Performance-aspect considerations
+
+1. For write-heavy workloads, you may want to disable the track of total number of rows.


Do we have others considerations here? Or we'll add more in the future? (Just want to double confirm the 1. is not a typo.)

What it means "write-heavy workloads" in this context? Should we use the terms that are more understandable under streaming context? E.g., throughput? rows per second?

Because this seems indicating state store, I'm not sure how users measure if it is write-heavy on the state store.

1. is not a typo. Just wanted to reserve a space we would eventually add more. I'm not an expert of RocksDB so don't have insights to put some guides on tuning, but RocksDB itself seems to provide lots of things to tune so it may come up later.

I agree that "write-heavy workloads" sounds unclear; basically it means higher amount of updates (write/delete) against state store. This cannot be inferred from the volume of inputs depending on the operator and window - if the input produces lots of state keys on streaming aggregation, then it's going to issue lots of writes against state store. If the input are huge but binds to a few windows, then a few writes against state store.

Probably we can leverage the state metric "rows to update" and "rows to delete". They represent the amount of updates. Technically this change doesn't introduce perf. regression in any workloads so it's not limited to write-heavy workloads - we make a trade-off on observability so it's up to end users to choose performance vs observability.

Looks like it'd be better to remove the representation "For write-heavy workloads" and simply add "to gain additional performance on state store", with hinting that it will be more effective if the state metric "rows to update" and "rows to delete" are high.

Thanks for the inputs!

xuanyuanking · 2021-11-18T07:33:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

@@ -144,26 +156,28 @@ class RocksDB(
   * Put the given value for the given key and return the last written value.


Please also change the comment correspondingly.

Nice finding! Missed that.

xuanyuanking · 2021-11-18T07:33:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDB.scala

    }
-    oldValue
+    writeBatch.put(key, value)
  }

  /**
   * Remove the key if present, and return the previous value if it was present (null otherwise).


viirya

lgtm. Just a question about the doc.

SparkQA · 2021-11-18T09:44:40Z

Test build #145368 has finished for PR 34502 at commit e439660.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This PR is a follow-up of #34502 to address post-reviews. This PR rewords on the explanation on performance tune on RocksDB state store to make it less confused, and also fix the method docs to be in sync with the code changes. ### Why are the changes needed? 1. The explanation on performance tune on RocksDB state store was unclear in a couple of spots. 2. We changed the method signature, but the change was not reflected to the method doc. ### Does this PR introduce _any_ user-facing change? Yes, end users will get less confused from the explanation on performance tune on RocksDB state store. ### How was this patch tested? N/A Closes #34652 from HeartSaVioR/SPARK-37224-follow-up-postreview. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…ions ### What changes were proposed in this pull request? This PR proposes to add a new benchmark to measure the performance on basic state store operations, and the result file. The proposed change of SPARK-37224 (#34502) is applied in the benchmark. As the benchmark number provides, turning off the config brings lots of performance gain in micro-benchmark perspective, while it is still slower than memory-based state store. ### Why are the changes needed? To track and verify further performance improvements. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Result file from manual run is included in this PR. Closes #34630 from HeartSaVioR/SPARK-37224-follow-up-benchmark. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

github-actions bot added DOCS SQL STRUCTURED STREAMING labels Nov 6, 2021

HeartSaVioR commented Nov 6, 2021

View reviewed changes

HeartSaVioR added 3 commits November 17, 2021 19:03

[SPARK-37224][SS] Optimize write path on RocksDB state store provider

4ba0251

fix

77e65f3

Remove benchmark to reduce the amount of changes

01e9db5

HeartSaVioR force-pushed the SPARK-37224 branch from 1715105 to 01e9db5 Compare November 17, 2021 10:05

HeartSaVioR mentioned this pull request Nov 17, 2021

[SPARK-37224][SS][FOLLOWUP] Add benchmark on basic state store operations #34630

Closed

zsxwing reviewed Nov 18, 2021

View reviewed changes

reflect comments

13bddcc

fix

e439660

zsxwing approved these changes Nov 18, 2021

View reviewed changes

HeartSaVioR closed this in 1c26113 Nov 18, 2021

xuanyuanking reviewed Nov 18, 2021

View reviewed changes

viirya reviewed Nov 18, 2021

View reviewed changes

HeartSaVioR mentioned this pull request Nov 18, 2021

[SPARK-37224][SS][FOLLOWUP] Clarify the guide doc and fix the method doc #34652

Closed

	class SQLMetric(val metricType: String, initValue: Long = 0L) extends AccumulatorV2[Long, Long] {
	// This is a workaround for SPARK-11013.
	// We may use -1 as initial value of the accumulator, if the accumulator is valid, we will
	// update it at the end of task and the value will be at least 0. Then we can filter out the -1
	// values before calculate max, min, etc.

		@@ -144,26 +156,28 @@ class RocksDB(
		* Put the given value for the given key and return the last written value.

[SPARK-37224][SS] Optimize write path on RocksDB state store provider #34502

[SPARK-37224][SS] Optimize write path on RocksDB state store provider #34502

Conversation

HeartSaVioR commented Nov 6, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Nov 6, 2021

SparkQA commented Nov 6, 2021

SparkQA commented Nov 6, 2021

SparkQA commented Nov 6, 2021

SparkQA commented Nov 6, 2021

HeartSaVioR commented Nov 7, 2021

HeartSaVioR commented Nov 17, 2021

HeartSaVioR commented Nov 17, 2021

SparkQA commented Nov 17, 2021

SparkQA commented Nov 17, 2021

SparkQA commented Nov 17, 2021

zsxwing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Nov 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Nov 18, 2021

SparkQA commented Nov 18, 2021

HeartSaVioR commented Nov 18, 2021 • edited

SparkQA commented Nov 18, 2021

zsxwing left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2021

SparkQA commented Nov 18, 2021

HeartSaVioR commented Nov 18, 2021

xuanyuanking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Nov 18, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 18, 2021

HeartSaVioR commented Nov 6, 2021 •

edited

HeartSaVioR Nov 18, 2021 •

edited

HeartSaVioR commented Nov 18, 2021 •

edited

HeartSaVioR Nov 18, 2021 •

edited