[SPARK-49630][SS] Add flatten option to process collection types with state data source reader #48110

anishshri-db · 2024-09-13T19:35:06Z

What changes were proposed in this pull request?

Add flatten option to process collection types with state data source reader

Why are the changes needed?

Changes are needed to process entries row-by-row in case we don't have enough memory to fit these collections inside a single row

Does this PR introduce any user-facing change?

Yes

Users can provide the following query option:

        val stateReaderDf = spark.read
          .format("statestore")
          .option(StateSourceOptions.PATH, <state_checkpoint_loc>)
          .option(StateSourceOptions.STATE_VAR_NAME, <state_var_name>)
          .option(StateSourceOptions.FLATTEN_COLLECTION_TYPES, <true | false>)
          .load()

How was this patch tested?

Added unit tests

[info] Run completed in 1 minute, 10 seconds.
[info] Total number of tests run: 12
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

…te data source reader

anishshri-db · 2024-09-14T01:42:43Z

cc - @HeartSaVioR @jingz-db - PTAL, thx !

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

pavel0fadeev · 2024-09-15T20:39:03Z

If you are working on this file anyway, could you also fix the typo in the comment here?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

Line 152 in 931ab06

* Source. It reads the the state at a particular batchId.

the the state -> the state

anishshri-db · 2024-09-16T03:38:44Z

If you are working on this file anyway, could you also fix the typo in the comment here?

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

Line 152 in 931ab06

* Source. It reads the the state at a particular batchId.

the the state -> the state

Sure done

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala

jingz-db

Small nits otherwise LGTM! Thanks for making the change! Shall we also document in the PR API section that the default value for flatten option is True?

HeartSaVioR

First pass. Mostly minors.

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/utils/SchemaUtil.scala

...apache/spark/sql/execution/datasources/v2/state/StateDataSourceTransformWithStateSuite.scala

HeartSaVioR

+1 pending CI

HeartSaVioR · 2024-09-24T10:05:33Z

The build only failed with test suite from SparkConnect and it seems to be flaky and not relevant to this change.
https://github.com/anishshri-db/spark/actions/runs/11008519536/job/30567579146

HeartSaVioR · 2024-09-24T10:05:41Z

Thanks! Merging to master.

[SPARK-49630] Add flatten option to process collection types with sta…

4945c64

…te data source reader

github-actions bot added the SQL label Sep 13, 2024

anishshri-db changed the title ~~[SPARK-49630] Add flatten option to process collection types with state data source reader~~ [SPARK-49630][SS] Add flatten option to process collection types with state data source reader Sep 13, 2024

anishshri-db added 3 commits September 13, 2024 13:11

Fix test

c684901

Misc updates

04a7b9a

Add tests

21711de

github-actions bot added the STRUCTURED STREAMING label Sep 14, 2024

Misc fix

a870be2

pavel0fadeev reviewed Sep 15, 2024

View reviewed changes

...ore/src/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StateDataSource.scala Outdated Show resolved Hide resolved

pavel0fadeev reviewed Sep 15, 2024

View reviewed changes

...rc/main/scala/org/apache/spark/sql/execution/datasources/v2/state/StatePartitionReader.scala Outdated Show resolved Hide resolved

Misc fix

d086f70

anishshri-db requested a review from pavel0fadeev September 16, 2024 03:38

jingz-db reviewed Sep 17, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithStateSuite.scala Outdated Show resolved Hide resolved

jingz-db approved these changes Sep 17, 2024

View reviewed changes

Address Jing's comments

da5abe5

HeartSaVioR reviewed Sep 23, 2024

View reviewed changes

anishshri-db added 3 commits September 23, 2024 18:33

Address Jungtaek's comments

4e77d0c

Misc fix

7e9a821

Test fix

c5a540b

anishshri-db requested a review from HeartSaVioR September 24, 2024 06:50

HeartSaVioR approved these changes Sep 24, 2024

View reviewed changes

HeartSaVioR closed this in 73d6bd7 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49630][SS] Add flatten option to process collection types with state data source reader #48110

[SPARK-49630][SS] Add flatten option to process collection types with state data source reader #48110

Uh oh!

anishshri-db commented Sep 13, 2024 •

edited

Loading

Uh oh!

anishshri-db commented Sep 14, 2024

Uh oh!

Uh oh!

Uh oh!

pavel0fadeev commented Sep 15, 2024

Uh oh!

anishshri-db commented Sep 16, 2024

Uh oh!

Uh oh!

jingz-db left a comment

Uh oh!

HeartSaVioR left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR left a comment

Uh oh!

HeartSaVioR commented Sep 24, 2024

Uh oh!

HeartSaVioR commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-49630][SS] Add flatten option to process collection types with state data source reader #48110

[SPARK-49630][SS] Add flatten option to process collection types with state data source reader #48110

Uh oh!

Conversation

anishshri-db commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

anishshri-db commented Sep 14, 2024

Uh oh!

Uh oh!

Uh oh!

pavel0fadeev commented Sep 15, 2024

Uh oh!

anishshri-db commented Sep 16, 2024

Uh oh!

Uh oh!

jingz-db left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Sep 24, 2024

Uh oh!

HeartSaVioR commented Sep 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anishshri-db commented Sep 13, 2024 •

edited

Loading