[SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. #45447

jingz-db · 2024-03-08T23:25:10Z

What changes were proposed in this pull request?

In the new operator for arbitrary state-v2, we cannot rely on the session/encoder being available since the initialization for the various state instances happens on the executors. Hence, for the state serialization, we propose to let user explicitly pass in encoder for state variable and serialize primitives/case classes/POJO with SQL encoder. Leveraging SQL encoder can speed up the serialization.

Why are the changes needed?

These changes are needed for providing a dedicated serializer for state-v2.
The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939

Does this PR introduce any user-facing change?

Users will need to specify the SQL encoder for their state variable:
def getValueState[T](stateName: String, valEncoder: Encoder[T]): ValueState[T]
def getListState[T](stateName: String, valEncoder: Encoder[T]): ListState[T]

For primitive type, Encoder is something as: Encoders.scalaLong; for case class, Encoders.product[CaseClass]; for POJO, Encoders.bean(classOf[POJOClass])

How was this patch tested?

Unit tests for primitives, case classes, POJO separately in ValueStateSuite

Was this patch authored or co-authored using generative AI tooling?

No

anishshri-db · 2024-03-09T00:42:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

 * @param stateName - name of logical state partition
 * @tparam GK - grouping key type
+ * @tparam S - value type


Nit: Could we rename type param to V instead ?

anishshri-db · 2024-03-09T00:44:09Z

sql/core/src/test/java/org/apache/spark/sql/execution/streaming/state/POJOTestClass.java

+    }
+
+    public POJOTestClass(String name, int id) {
+        this.name = name;


indent/spacing seems off ?

sql/core/src/test/java/org/apache/spark/sql/execution/streaming/state/POJOTestClass.java

anishshri-db · 2024-03-09T00:44:55Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/ValueStateSuite.scala

@@ -93,7 +96,7 @@ class ValueStateSuite extends SharedSparkSession
        Encoders.STRING.asInstanceOf[ExpressionEncoder[Any]])

      val stateName = "testState"
-      val testState: ValueState[Long] = handle.getValueState[Long]("testState")
+      val testState: ValueState[Long] = handle.getValueState[Long]("testState", Encoders.scalaLong)


Can we also add a test for other common types such as Double/String etc ?

anishshri-db · 2024-03-09T01:13:04Z

sql/core/src/test/java/org/apache/spark/sql/execution/streaming/state/POJOTestClass.java

+  }
+
+  // Getter methods
+  public String getName() {


Not sure if this is a GH thing - but this still seems a little off to me ?

There was a weird style correction happening without me realizing it. Fixed now!

anishshri-db

lgtm ! thx

HeartSaVioR

+1

Btw, it might be better to have user facing two APIs for scala-friendly vs java-friendly. See flatMapGroupsWithState methods - providing encoder explicitly is only required to java-friendly API and for scala-friendly API people can use implicit. This could be done as another follow-up JIRA ticket.

HeartSaVioR · 2024-03-11T00:19:48Z

GA failure happened only in Pyspark connect, and these failed suites do not have relationship with this change.

HeartSaVioR · 2024-03-11T00:19:58Z

Thanks! Merging to master.

…sed on SQL encoder for Arbitrary State API v2 ### What changes were proposed in this pull request? In the new operator for arbitrary state-v2, we cannot rely on the session/encoder being available since the initialization for the various state instances happens on the executors. Hence, for the state serialization, we propose to let user explicitly pass in encoder for state variable and serialize primitives/case classes/POJO with SQL encoder. Leveraging SQL encoder can speed up the serialization. ### Why are the changes needed? These changes are needed for providing a dedicated serializer for state-v2. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939 ### Does this PR introduce _any_ user-facing change? Users will need to specify the SQL encoder for their state variable: `def getValueState[T](stateName: String, valEncoder: Encoder[T]): ValueState[T]` `def getListState[T](stateName: String, valEncoder: Encoder[T]): ListState[T]` For primitive type, Encoder is something as: `Encoders.scalaLong`; for case class, `Encoders.product[CaseClass]`; for POJO, `Encoders.bean(classOf[POJOClass])` ### How was this patch tested? Unit tests for primitives, case classes, POJO separately in `ValueStateSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45447 from jingz-db/sql-encoder-state-v2. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

jingz-db · 2024-03-14T21:30:15Z

+1

Btw, it might be better to have user facing two APIs for scala-friendly vs java-friendly. See flatMapGroupsWithState methods - providing encoder explicitly is only required to java-friendly API and for scala-friendly API people can use implicit. This could be done as another follow-up JIRA ticket.

Created Task here: https://issues.apache.org/jira/browse/SPARK-47403

…sed on SQL encoder for Arbitrary State API v2 ### What changes were proposed in this pull request? In the new operator for arbitrary state-v2, we cannot rely on the session/encoder being available since the initialization for the various state instances happens on the executors. Hence, for the state serialization, we propose to let user explicitly pass in encoder for state variable and serialize primitives/case classes/POJO with SQL encoder. Leveraging SQL encoder can speed up the serialization. ### Why are the changes needed? These changes are needed for providing a dedicated serializer for state-v2. The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939 ### Does this PR introduce _any_ user-facing change? Users will need to specify the SQL encoder for their state variable: `def getValueState[T](stateName: String, valEncoder: Encoder[T]): ValueState[T]` `def getListState[T](stateName: String, valEncoder: Encoder[T]): ListState[T]` For primitive type, Encoder is something as: `Encoders.scalaLong`; for case class, `Encoders.product[CaseClass]`; for POJO, `Encoders.bean(classOf[POJOClass])` ### How was this patch tested? Unit tests for primitives, case classes, POJO separately in `ValueStateSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45447 from jingz-db/sql-encoder-state-v2. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

SQL encoder

202a759

github-actions bot added SQL STRUCTURED STREAMING labels Mar 8, 2024

jingz-db changed the title ~~SQL encoder~~ [SS] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. Mar 8, 2024

jingz-db marked this pull request as ready for review March 8, 2024 23:31

add doc string

534cd7a

jingz-db changed the title ~~[SS] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2.~~ [SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. Mar 8, 2024

anishshri-db reviewed Mar 9, 2024

View reviewed changes

sql/core/src/test/java/org/apache/spark/sql/execution/streaming/state/POJOTestClass.java Outdated Show resolved Hide resolved

anishshri-db reviewed Mar 9, 2024

View reviewed changes

resolve comments, add more test cases

e554bac

anishshri-db reviewed Mar 9, 2024

View reviewed changes

jingz-db added 2 commits March 8, 2024 17:16

java style

4495af4

java style

002f6b7

anishshri-db approved these changes Mar 9, 2024

View reviewed changes

HeartSaVioR approved these changes Mar 11, 2024

View reviewed changes

HeartSaVioR closed this in afbebfb Mar 11, 2024

HeartSaVioR mentioned this pull request Mar 11, 2024

[SPARK-47272][SS] Add MapState implementation for State API v2. #45341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. #45447

[SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. #45447

jingz-db commented Mar 8, 2024 •

edited

anishshri-db Mar 9, 2024

anishshri-db Mar 9, 2024

anishshri-db Mar 9, 2024

anishshri-db Mar 9, 2024

jingz-db Mar 9, 2024

anishshri-db left a comment

HeartSaVioR left a comment

HeartSaVioR commented Mar 11, 2024

HeartSaVioR commented Mar 11, 2024

jingz-db commented Mar 14, 2024

[SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. #45447

[SS][SPARK-47331] Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. #45447

Conversation

jingz-db commented Mar 8, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

anishshri-db Mar 9, 2024

Choose a reason for hiding this comment

anishshri-db Mar 9, 2024

Choose a reason for hiding this comment

anishshri-db Mar 9, 2024

Choose a reason for hiding this comment

anishshri-db Mar 9, 2024

Choose a reason for hiding this comment

jingz-db Mar 9, 2024

Choose a reason for hiding this comment

anishshri-db left a comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Mar 11, 2024

HeartSaVioR commented Mar 11, 2024

jingz-db commented Mar 14, 2024

jingz-db commented Mar 8, 2024 •

edited