[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

jingz-db · 2024-02-29T23:12:34Z

What changes were proposed in this pull request?

This PR adds changes for MapState implementation in State Api v2. This implementation adds a new encoder/decoder to encode grouping key and user key into a composite key to be put into RocksDB so that we could retrieve key-value pair by user specified user key by one rocksdb get.

Why are the changes needed?

These changes are needed to support map values in the State Store. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939

Does this PR introduce any user-facing change?

Yes
This PR introduces a new state type (MapState) that users can use in their Spark streaming queries.

How was this patch tested?

Unit tests in TransforWithMapStateSuite.

Was this patch authored or co-authored using generative AI tooling?

No

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

jingz-db · 2024-03-04T19:46:10Z

Thanks Eric for reviews on my old PR. I've resolved them and incorporated in this one already.

sql/api/src/main/scala/org/apache/spark/sql/streaming/MapState.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala

...main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala

anishshri-db · 2024-03-05T20:43:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

+    // get grouping key byte array
+    val keyByteArr = keySerializer.apply(groupingKey).asInstanceOf[UnsafeRow].getBytes()
+    // get user key byte array
+    val userKeySerializer = encoderFor(userKeyEnc).createSerializer()


Could we reuse this instead of creating new one each time ?

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

anishshri-db · 2024-03-05T20:46:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala

+    require(key != null, "User key cannot be null.")
+    val encodedCompositeKey = stateTypesEncoder.encodeCompositeKey(key, userKeyExprEnc)
+    val unsafeRowValue = store.get(encodedCompositeKey, stateName)
+    if (unsafeRowValue == null) {


To be consistent, I think we can return null here similar to other state types

null.asInstanceOf[V]

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala

anishshri-db · 2024-03-05T22:20:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

+    }
+    val groupingKey = keyOption.get.asInstanceOf[GK]
+    // generate grouping key byte array
+    val groupingKeyByteArr = keySerializer.apply(groupingKey).asInstanceOf[UnsafeRow].getBytes()


I guess you can just directly call keyOption.get here ?

Compiler will complain in the next line:
keySerializer.apply(groupingKey) where groupingKey will be of Any type if we directly call keyOption.get

anishshri-db · 2024-03-05T22:22:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

+  }
+}
+
+object CompositeKeyStateEncoder {


Hmm - do we really need these singleton objects ? Could we just call new CompositeKeyStateEncoder instead ?
cc - @HeartSaVioR - is this a preferred pattern for Spark ?

I was following Bhuwan's style in the base class. Maybe I am missing something but did not find anything useful in the style guide.

If the list of parameters are exactly the same with default constructor, let's just use new.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

anishshri-db · 2024-03-06T18:37:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala

+      useColumnFamilies = useColumnFamilies)
+  }
+
+  protected def newStoreProviderWithValueState(


Can we just rename this function to be generic ? i guess anyway the specific state variable schema is being managed by the individual type information right ?

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala

anishshri-db · 2024-03-06T18:37:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

+import org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider
+import org.apache.spark.sql.internal.SQLConf
+
+case class InputMapRow(key: String, action: String, value: (String, String))


Could we add a class level comment describing what this test suite does ?

anishshri-db · 2024-03-06T18:38:32Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

+    }
+  }
+
+  test("Test put value with null value") {


Can we add a test for batch query using mapState ?

anishshri-db · 2024-03-06T18:40:22Z

@jingz-db - test failure seems related ?

jingz-db · 2024-03-06T21:38:27Z

@jingz-db - test failure seems related ?

Weirdly is passing locally. Let me resolve your comments and retrigger the CI and see if it still fails. Thanks for the review!

anishshri-db · 2024-03-07T01:21:45Z

@jingz-db - could you please fix this style error ?

[info] compiling 1 Scala source to /home/runner/work/spark/spark/tools/target/scala-2.13/classes ...
[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala:195:0: Whitespace at end of line
[info]   Compilation completed in 14.316s.

anishshri-db

LGTM ! pending CI completion with success

HeartSaVioR

Mostly looks good. Left some comments.

HeartSaVioR · 2024-03-11T00:35:56Z

sql/api/src/main/scala/org/apache/spark/sql/streaming/StatefulProcessorHandle.scala

+   * @tparam V - type of value for map state variable
+   * @return - instance of MapState of type [K,V] that can be used to store state persistently
+   */
+  def getMapState[K, V](stateName: String, userKeyEnc: Encoder[K]): MapState[K, V]


Just a friendly reminder: I expect value encoder will be in the parameter as well once this is rebased.

HeartSaVioR · 2024-03-11T00:54:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala

+    store.prefixScan(encodedGroupingKey, stateName)
+      .map {
+        case iter: UnsafeRowPair =>
+          (stateTypesEncoder.decodeCompositeKey(iter.key),


Note: As you will rebase with #45447, UnsafeProjection will reuse the row instance, so we can't store the row persistently unless copying it. If we do copy, we probably want to reduce the scope for key-value vs key vs value.

Maybe good to have a private method decoding key and value in iterator but not creating a map. Each method can get the iterator from the new method, and pick key / value / both, and then copy rows.

Thanks for the input! Not sure if I understand you correctly, are you trying to say that: we want to returnIterator instead of Map to reduce the copy, and we need to use different reused rows for key/value in StateTypesEncoder?

Hi @anishshri-db, need your input on this: Do we want to return Map type or Iterator type for getMap function?
Talked with Jungtaek on Slack, if we decide to return Map type, we'll probably need to materialize the map and copy everything in map into memory (because we reuse UnsafeRow in StateTypeEncoder). So Jungtaek is concerning about the case where we have a large map. I also feel like returning Iterator type makes more sense, because for ListState we also return Iterator for get list function.

Yea sure - lets use the iterator approach. Thx

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

HeartSaVioR · 2024-03-11T01:44:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

+  }
+}
+
+object CompositeKeyStateEncoder {


If the list of parameters are exactly the same with default constructor, let's just use new.

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

HeartSaVioR · 2024-03-11T02:10:53Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

+      testStream(result, OutputMode.Update())(
+        AddData(inputData, inputMapRow),
+        ExpectFailure[SparkIllegalArgumentException](e => {
+          assert(e.getMessage.contains("ILLEGAL_STATE_STORE_VALUE.NULL_VALUE"))


Could we verify the exception with checkError?

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala

HeartSaVioR · 2024-03-12T01:33:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/MapStateSuite.scala

@@ -67,6 +67,7 @@ class MapStateSuite extends StateVariableSuiteBase {
      assert(!testState.exists())
      assert(testState.getMap().hasNext === false)
    }
+    ImplicitGroupingKeyTracker.removeImplicitKey()


Maybe just do this in setup (before) or teardown (after) in StateVariableSuiteBase, to ensure the cleanup is guaranteed to be done - as failure in test A won't clean up the thread local and introduce another failure. We do like to avoid cascading failures.

Done! Thanks Jungtaek!

HeartSaVioR · 2024-03-12T01:52:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala

-      }.toMap
+    val pairsIterator = store.prefixScan(encodedGroupingKey, stateName)
+
+    new Iterator[(K, V)] {


maybe final nit: your previous code just works and it seems much simpler. We just need to remove the call .toMap and done.

Also maybe the method name to remove the map from the name?

getIterator - consistent with getKeys/getValues

iterator - consistent with Map collection. need to change other methods as well, e.g. getKeys to keys, getValues to values

get() - consistent with other type of state. The value of origin type is retrieved with get() consistently.

It's OK to defer the change of method name in follow up PR if we want to have a time to figure out the best name. The first one is something good to do this before merging the code.

Thanks Jungtaek! As discussed, I changed getMap() name to iterator and pair it with getKeys(), getValues() to keys() and values() accordingly.

jingz-db · 2024-03-12T21:25:19Z

Test suite failure seems unrelated in pyspark.

HeartSaVioR · 2024-03-12T22:16:12Z

Yeah, doesn't look to be related. I'm ignoring the CI error.

HeartSaVioR

+1

HeartSaVioR · 2024-03-12T22:17:42Z

Thanks! Merging to master.

### What changes were proposed in this pull request? This PR adds changes for MapState implementation in State Api v2. This implementation adds a new encoder/decoder to encode grouping key and user key into a composite key to be put into RocksDB so that we could retrieve key-value pair by user specified user key by one rocksdb get. ### Why are the changes needed? These changes are needed to support map values in the State Store. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939 ### Does this PR introduce _any_ user-facing change? Yes This PR introduces a new state type (MapState) that users can use in their Spark streaming queries. ### How was this patch tested? Unit tests in `TransforWithMapStateSuite`. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45341 from jingz-db/map-state-impl. Lead-authored-by: jingz-db <jing.zhan@databricks.com> Co-authored-by: Jing Zhan <135738831+jingz-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

github-actions bot added SQL STRUCTURED STREAMING labels Feb 29, 2024

jingz-db marked this pull request as ready for review March 4, 2024 19:06

jingz-db commented Mar 4, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala Outdated Show resolved Hide resolved

jingz-db commented Mar 4, 2024

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala Outdated Show resolved Hide resolved

jingz-db changed the title ~~[SS] Add MapState implementation for State API v2.~~ [SPARK-47272][SS] Add MapState implementation for State API v2. Mar 4, 2024

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala Outdated Show resolved Hide resolved

jingz-db added 5 commits March 5, 2024 13:48

rebase on ListState

997f6d9

refactor encoding composite key layer

8a13c71

revert rocksdbstatencoder

7c32130

refactor StateTypesEncoder, apply NERF, will rebase on master

c413e8a

few nits

cbed9c6

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithMapStateSuite.scala Show resolved Hide resolved

rebase to latest master

fa10594

jingz-db force-pushed the map-state-impl branch from 73821e7 to fa10594 Compare March 5, 2024 21:54

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImpl.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 5, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala Outdated Show resolved Hide resolved

add pure unit tests

d6afc82

anishshri-db reviewed Mar 6, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 6, 2024

View reviewed changes

add minimum batch test, few nits

c1797a3

jingz-db requested a review from anishshri-db March 6, 2024 23:54

anishshri-db approved these changes Mar 7, 2024

View reviewed changes

jingz-db and others added 2 commits March 8, 2024 11:24

fix scala style

f005f0b

scala style - Add a new empty line

d671cc3

HeartSaVioR reviewed Mar 11, 2024

View reviewed changes

jingz-db and others added 2 commits March 11, 2024 09:56

Merge branch 'master' into map-state-impl

6421952

resolve comments

6e3fafe

jingz-db requested a review from HeartSaVioR March 11, 2024 18:38

jingz-db added 2 commits March 11, 2024 14:16

retrigger CI - ValueStateSuite passed locally but not on CI

84f16c5

fix valuestatesuite

646d226

HeartSaVioR reviewed Mar 12, 2024

View reviewed changes

move clean implicitKey to after

3678127

HeartSaVioR reviewed Mar 12, 2024

View reviewed changes

rename API

a87f7ae

jingz-db requested a review from HeartSaVioR March 12, 2024 17:09

HeartSaVioR approved these changes Mar 12, 2024

View reviewed changes

HeartSaVioR closed this in 29e91d0 Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

jingz-db commented Feb 29, 2024

jingz-db commented Mar 4, 2024

anishshri-db Mar 5, 2024

anishshri-db Mar 5, 2024

anishshri-db Mar 5, 2024

jingz-db Mar 5, 2024

anishshri-db Mar 5, 2024

jingz-db Mar 6, 2024

HeartSaVioR Mar 11, 2024

anishshri-db Mar 6, 2024

anishshri-db Mar 6, 2024

anishshri-db Mar 6, 2024

anishshri-db commented Mar 6, 2024

jingz-db commented Mar 6, 2024

anishshri-db commented Mar 7, 2024

anishshri-db left a comment

HeartSaVioR left a comment

HeartSaVioR Mar 11, 2024

HeartSaVioR Mar 11, 2024

jingz-db Mar 11, 2024

jingz-db Mar 12, 2024

anishshri-db Mar 12, 2024

HeartSaVioR Mar 11, 2024

HeartSaVioR Mar 11, 2024

HeartSaVioR Mar 12, 2024 •

edited

jingz-db Mar 12, 2024

HeartSaVioR Mar 12, 2024 •

edited

HeartSaVioR Mar 12, 2024

jingz-db Mar 12, 2024

jingz-db commented Mar 12, 2024

HeartSaVioR commented Mar 12, 2024

HeartSaVioR left a comment

HeartSaVioR commented Mar 12, 2024

[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

Conversation

jingz-db commented Feb 29, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

jingz-db commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anishshri-db commented Mar 6, 2024

jingz-db commented Mar 6, 2024

anishshri-db commented Mar 7, 2024

anishshri-db left a comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Mar 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR Mar 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingz-db commented Mar 12, 2024

HeartSaVioR commented Mar 12, 2024

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Mar 12, 2024

HeartSaVioR Mar 12, 2024 •

edited

HeartSaVioR Mar 12, 2024 •

edited