New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPARK-46979][SS] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider #45038

Closed

anishshri-db wants to merge 6 commits into apache:master from anishshri-db:task/SPARK-46979

Contributor

anishshri-db commented Feb 6, 2024 •

edited

What changes were proposed in this pull request?

Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider

Why are the changes needed?

This change allows us to specify encoder for key/values separately and avoid encoding additional bytes. Also, it allows us to set schemas/encoders for individual column families, which will be required for future changes related to transformWithState operator (listState/timer changes etc)

We are refactoring a bit here given the upcoming changes. so we are proposing to split key and value encoders.
Key encoders can be of 2 types:

with prefix scan
without prefix scan

Value encoders can also eventually be of 2 types:

single value
multiple values (used for list state)

And we now also allow setting schema and getting encoder for each column family.
So after the change, we can potentially allow something like this:

col family 1 - with keySchema with prefix scan and valueSchema with single value and binary type
col family 2 - with keySchema without prefix scan and valueSchema with multiple values

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests

[info] Run completed in 3 minutes, 5 seconds.
[info] Total number of tests run: 286
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 286, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No


          [SPARK-46979] Add support for specifying key and value encoder separa…

3ad16ab

…tely and also for each col family in RocksDB state store provider

github-actions bot added SQL STRUCTURED STREAMING labels

anishshri-db changed the title ~~[SPARK-46979] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider~~ [SPARK-46979][SS] Add support for specifying key and value encoder separately and also for each col family in RocksDB state store provider

Contributor Author

anishshri-db commented Feb 6, 2024

@HeartSaVioR - PTAL, thx !


          Misc fix

dc1fcc4

sahnib reviewed

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved

sahnib reviewed

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved


          Change to address Bhuwan's comments

4afdc1a

anishshri-db requested a review from sahnib

February 7, 2024 02:47

sahnib approved these changes

View reviewed changes

Contributor

sahnib left a comment

LGTM. Thanks for separating the encoders, this helps us avoid evolve the key/value encoders independently and use both Multi valued, and prefix key encoder.

HeartSaVioR reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImpl.scala Outdated Show resolved Hide resolved

HeartSaVioR reviewed

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved

HeartSaVioR reviewed

View reviewed changes

Contributor

HeartSaVioR left a comment

Only minors. Looks great in overall.

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

@@ @@ -215,7 +240,9 @@ private[sql] class RocksDBStateStoreProvider @@
                     (keySchema.length > numColsPrefixKey), "The number of columns in the key must be " +
                     "greater than the number of columns for prefix key!")
-                  this.encoder = RocksDBStateEncoder.getEncoder(keySchema, valueSchema, numColsPrefixKey)
+                  keyValueEncoderMap.putIfAbsent(StateStore.DEFAULT_COL_FAMILY_NAME,

Contributor

HeartSaVioR Feb 13, 2024

(Maybe microbenchmark could tell that this could regress for default column family only - map lookup with carefully crafted lock operation in every op, though I'd rather not concern before we see actual regression.)

Contributor Author

anishshri-db Feb 13, 2024

Yea I didn't worry about it too much, given that the provider init likely happens once for long lived queries and where we can retain the use of the same provider on the same executor across m/batch executions.

Contributor

HeartSaVioR Feb 13, 2024 •

edited

No, what I meant is to look up concurrent map per "every op" to figure out encoder, for existing stateful operators - previously it was just a reference to the field. But ops is relatively very cheap compared to commit as of now, so let's see.

Contributor Author

anishshri-db Feb 13, 2024

Ah ok - yea mainly didn't want to maintain 2 data structures for this. But if we find that its more expensive, then we can just split some of the logic for the default col family case

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala Outdated Show resolved Hide resolved

anishshri-db added 2 commits

February 12, 2024 21:52


          Merge branch 'master' into task/SPARK-46979

71cf5d1


          Change to address Jungtaek's comments

c96ad24

HeartSaVioR reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImpl.scala Outdated Show resolved Hide resolved


          Misc fix

62cba05

HeartSaVioR approved these changes

View reviewed changes

Contributor

HeartSaVioR left a comment

+1 pending CI

HeartSaVioR mentioned this pull request

[SPARK-46928][SS] Add support for ListState in Arbitrary State API v2. #44961

Closed

Contributor

HeartSaVioR commented Feb 13, 2024

Thanks! Merging to master.

HeartSaVioR closed this in

63b97c6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment