Skip to content

Conversation

@anishshri-db
Copy link
Contributor

@anishshri-db anishshri-db commented Sep 13, 2024

What changes were proposed in this pull request?

Add flatten option to process collection types with state data source reader

Why are the changes needed?

Changes are needed to process entries row-by-row in case we don't have enough memory to fit these collections inside a single row

Does this PR introduce any user-facing change?

Yes

Users can provide the following query option:

        val stateReaderDf = spark.read
          .format("statestore")
          .option(StateSourceOptions.PATH, <state_checkpoint_loc>)
          .option(StateSourceOptions.STATE_VAR_NAME, <state_var_name>)
          .option(StateSourceOptions.FLATTEN_COLLECTION_TYPES, <true | false>)
          .load()

How was this patch tested?

Added unit tests

[info] Run completed in 1 minute, 10 seconds.
[info] Total number of tests run: 12
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 13, 2024
@anishshri-db anishshri-db changed the title [SPARK-49630] Add flatten option to process collection types with state data source reader [SPARK-49630][SS] Add flatten option to process collection types with state data source reader Sep 13, 2024
@anishshri-db
Copy link
Contributor Author

cc - @HeartSaVioR @jingz-db - PTAL, thx !

@pavel0fadeev
Copy link

If you are working on this file anyway, could you also fix the typo in the comment here?


the the state -> the state

@anishshri-db
Copy link
Contributor Author

If you are working on this file anyway, could you also fix the typo in the comment here?

the the state -> the state

Sure done

Copy link
Contributor

@jingz-db jingz-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits otherwise LGTM! Thanks for making the change! Shall we also document in the PR API section that the default value for flatten option is True?

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Mostly minors.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending CI

@HeartSaVioR
Copy link
Contributor

The build only failed with test suite from SparkConnect and it seems to be flaky and not relevant to this change.
https://github.com/anishshri-db/spark/actions/runs/11008519536/job/30567579146

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants