[SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API #48124

jingz-db · 2024-09-16T20:11:49Z

What changes were proposed in this pull request?

This PR adds support to define event time column in the output dataset of TransformWithStateInPandas operator. The new event time column will be used to evaluate watermark expressions in downstream operators.

Why are the changes needed?

This change is to couple with the scala implementation of chaining of operators. PR in Scala: #45376

Does this PR introduce any user-facing change?

Yes. User can now specify a event time column as:

df.groupBy("id")
  .transformWithStateInPandas(
      statefulProcessor=stateful_processor,
      outputStructType=output_schema,
      outputMode="Update",
      timeMode=timeMode,
      eventTimeColumnName="outputTimestamp"
  )

How was this patch tested?

Integration tests.

Was this patch authored or co-authored using generative AI tooling?

No.

micheal-o · 2024-09-17T03:14:05Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

nit: why wait for 10?

Thanks for leaving a comments on the state-v2 PR! Our test suite is not very large and 10 seconds should be quite sufficient to finish. I am also following the same set ups in the whole suite - you may find other test case is also using 10.

bogao007 · 2024-10-07T21:12:16Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

Since this PR is fairly light-weighted, should we include that check in the same PR? If not, please create a Jira to track the TODO and link it here.

This is also the way it is on Scala side here. Let me check if there is any specific reason for Bhuwan to leave this as a TODO here.

Confirmed with Anish - the way we have it today is that we'll keep constructing new batches because we may have future timer expiring. I sent out a SPARK JIRA to keep track of this issue: https://issues.apache.org/jira/browse/SPARK-50180 and will update the comments in both Scala and Python.

Since this PR is fairly light-weighted, should we include that check in the same PR?

Some additional testing on AvailableNow trigger mode may be needed together with fixing the issue and it exceeds the description of this PR. I'll leave it as it is today and let's keep track of the fix in the SPARK-50180

bogao007 · 2024-10-07T21:21:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

Could you elaborate more on why we need 2 entries for TransformWithStateInPandasExec? Also, can we have TransformWithStateInPandasExec specific comments? Thanks!

We don't. Deleted a case class and will add more comments.

bogao007

LGTM

HeartSaVioR

First pass. Haven't looked into test in detail as addressing review comments would simply the changeset. I'll revisit the test once the PR is updated.

HeartSaVioR · 2024-11-11T08:13:42Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

EliminateEventTimeWatermark is no longer needed as we now have this in Analyzer rule.

But maybe we could deal with this in separate PR to address other cases as well.

EliminateEventTimeWatermark is no longer needed as we now have this in Analyzer rule.

Is this already done? I saw we still keep the TODO here:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Line 3933 in d6b7334

* TODO: add this rule into analyzer rule list.

Yeah I forgot to remove the TODO. My bad. It's added.

HeartSaVioR · 2024-11-11T08:32:53Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

I don't see any caller for this. Have you missed to add caller?

Yeah... It got lost in a resolve of merge conflicts. I just added back.

HeartSaVioR · 2024-11-11T08:34:02Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

nit: probably linter may complain?

Ran the linter check and seems not complaining... If it is OK with you, I'll leave it here and see if it passes on the CI.

HeartSaVioR · 2024-11-11T08:36:26Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

Instead of doing this in all places, shall we find the way for pytest to perform the same with before() / after()?

Also, since we are reading from files, why not just use Trigger.AvailableNow() which will terminate the query automatically instead of doing this? sleep(10) won't be necessary as well.

Trigger.AvailableNow() will also keep making new batches... I had a discussion with Bo previously here: #48124 (comment)
Not sure if we can do this in the in the after() as the query will keep making new batches because shouldRunAnotherBatch will always return True for processingTime mode. Scala side doesn't have the issue because we are using the testStream, CheckAnswer framework and it will shut down the query explicitly.

It is only happening with processing time timer mode and I don't expect a lot of tests would use that.

But I'm fine either way. I'm OK to use the same approach for all timer modes.

HeartSaVioR · 2024-11-11T08:39:00Z

...re/src/main/scala/org/apache/spark/sql/execution/python/TransformWithStateInPandasExec.scala

Wow, I wonder how we could miss this... Nice catch! I was about to ask for a new test covering this, but I'll revisit once you update the test as your new test seems to also contain this part.

Included a check on eviction of late events in the test case.

HeartSaVioR · 2024-11-11T08:46:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

I wonder why Project is necessary between UpdateEventTimeColumn and TransformWithStateInPandas - could you please check the value of projectList? If that value is the same with output of TransformWithStateInPandasExec, optimizer might be able to remove out Project (in future). If that's the case, we may want to address the pattern of not having a Project in between as well.

The output we got from the projectList variable here is the output schema of the output event time column:

projectList: List(outputTimestamp#15)

I really appreciate you bring the optimization thing up here. After reading your comments, I tried to figure out where this extra projection is coming from and it took me very long to figure out - we did not add this projection anywhere in any of the event time related change. I found this is probably coming from the pushed down projection from the chaining operator. In the test case, groupBy is chained after TWS, and this groupBy only aggregate on some of the output columns of the TWS output. I am not super familiar with optimization of analyzer, but I think eventually this projection is coming from the pushed down filter from the aggregator operator. So we will see the extra layer of Projection here. That being said, the match case class that contains the Projection is not sufficient.
I added another match case class above and I also added another test case to test on cases that matches the situation without the Projection.
I did a quick test on Scala side and it doesn't have this issue. It is because TWS node is wrapped by SerializeFromObjectExec and this probably does not match with some optimizer rule.

But I think the exact match on the ProjectExec is a bit hacky - there is probably more optimization rule I am not aware of. Do you think we should do a recursive match here in case TWS node is (recursively) wrapped by other operators besides Project?

It's probably easier to not allow pushdown between UpdateEventTimeColumnExec and TransformWithStateExec.

I see filter is not allowed to be pushed down below UpdateEventTimeColumnExec, so it is OK. This might be the case of column pruning -

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1036-L1045

This could insert the project down to children nodes; we can disallow this for UpdateEventTimeColumnExec, like following:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1031-L1032

Could you give a try with this and see you don't see Projection? If that works, we can just consider UpdateEventTimeColumnExec and TransformWithStateExec to be always close to each other.

Thanks for all the code pointers! I added a line in the Optimizer and the the column is now not pruned so UpdateEventTimeColumnExec and TransformWithStateInPandasExec are now placed together. I leave the rule inside Optimizer and IncrementalExecution will only match with only the case where UpdateEventTimeColumnExec and TransformWithStateInPandasExec are placed together with no extra layer in between.

jingz-db · 2024-11-16T00:40:01Z

Hey @HeartSaVioR, could you take another look? Thanks!

HeartSaVioR

I'll review the test code once the issue about the output mode is addressed.

HeartSaVioR · 2024-11-20T03:31:37Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+            .count()
+            .writeStream.queryName("chaining_ops_query")
+            .foreachBatch(check_results)
+            .outputMode("update")


update mode? We don't support multiple stateful operators with update mode. If the query does not break, we are missing whether the operator is stateful or not. We should check UnsupportedOperator.

HeartSaVioR · 2024-11-20T03:33:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

    case p @ Project(_, _: LeafNode) => p

+    // Can't prune the columns on UpdateEventTimeWatermarkColumn
+    case p@Project(_, _: UpdateEventTimeWatermarkColumn) => p


nit: space between p and @, and P

HeartSaVioR · 2024-11-22T03:19:19Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

    case d: Deduplicate if d.isStreaming && d.keys.exists(hasEventTimeCol) => true
    case d: DeduplicateWithinWatermark if d.isStreaming => true
    case t: TransformWithState if t.isStreaming => true
+    case t: TransformWithStateInPandas if t.isStreaming => true


Thanks for adding this. It is lucky we figured out before releasing the feature. :)

Did a comparison with TWS Scala base PR and added a blocker for batch query. Thanks for the reminding!

HeartSaVioR · 2024-11-22T03:44:27Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+        import pyspark.sql.functions as f
+
+        input_path = tempfile.mkdtemp()
+        self._prepare_test_resource1(input_path)


nit: We should remove this. This is no-op because the file is overwritten later, but it is definitely confusing.

Fixed in my new commits! The test suite was still under progress when you made the comments. Feel free to skip the older commits and review the newest change.

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

HeartSaVioR · 2024-11-22T03:48:17Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

+        self._test_transform_with_state_in_pandas_chaining_ops(
+            StatefulProcessorChainingOps(), check_results, "eventTime"
+        )
+        self._test_transform_with_state_in_pandas_chaining_ops(


nit: either add some data for another id with having another check_results, or remove this. This is not an exhaustive test anyway as it doesn't really test we group by id as well.

The new test case now includes input of different ids. The purpose of this test is checking the column pruning is correctly set and the case class in IncrementalExecution matches with all logical plans.

OK, so it's not our main interest to verify that the groupBy will produce the aggregated result per grouping key.

Yeah it is not our main interest as we are testing for event time watermark is propagated properly for TWS here. I believe we should have abundant tests for aggregation already. I am keeping this for safety because previously without the column pruning change, groupby on single column and multiple columns will match different case class in IncrementalExecution.

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

HeartSaVioR · 2024-11-26T02:06:58Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

        throwError("dropDuplicatesWithinWatermark is not supported with batch " +
          "DataFrames/DataSets")(d)

+      case t: TransformWithStateInPandas =>


Let's file a JIRA ticket with TODO like this.

// TODO(SPARK-40443): support applyInPandasWithState in batch query throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3176")

I see my inconsistency here - the above is in BasicOperators in SparkStrategies.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

Lines 966 to 968 in f712213

case _: FlatMapGroupsInPandasWithState =>

// TODO(SPARK-40443): support applyInPandasWithState in batch query

throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3176")

Shall we move this to BasicOperators with adding TODO comment like the above? Let's leave this place to be something there is a reason not to support and won't support in future.

HeartSaVioR · 2024-11-26T02:21:38Z

python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py

-                }
+                assert batch_df.isEmpty()
            elif batch_id == 1:
+                # watermark is 25 - 5 = 20, no more event for eventTime=10


No, it's not.

The watermark for "late events" in batch ID "N" is the watermark for "eviction" in batch "N - 1". (default: 0)

The watermark for "eviction" in batch "N" is "calculated" based on the input data in batch "N - 1".

That said, the watermark for "eviction" in batch 1 is, 15 (max event time from batch 0) - 5 = 10. The watermark for "late events" in batch 0 is, 0.

With the same logic, the watermark for "eviction" in batch 2 is, 25 - 5 = 20. The watermark for "late events" in batch 2 is, the watermark for "eviction" in batch 1, hence 10.

Please update the code comment with the correction. I see both comments in batch 1 and 2 are incorrect.

HeartSaVioR

Only one comment.

HeartSaVioR · 2024-11-27T00:12:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala

        throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3176")
+      case _: TransformWithStateInPandas =>
+        // TODO(SPARK-50428): support TransformWithStateInPandas in batch query
+        throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3176")


nit: sorry you need to define another error class; it's saying applyInPandasWithState. OK to add another temporary error class as we want to fix it anyway.

Modified to reuse this error class used for TransformWithState previously: https://github.com/apache/spark/commit/289543eba5bd07395282c28cb08934ec625a4935#diff-7c879c08d2f379c139d5229a888[…]69bb48f0a138a3d64e1b2dde3502feR46

HeartSaVioR

+1 pending CI. Thanks for the patience.

I know it's a tech debt to leave this to temp error class, so let's file a JIRA ticket to change the error class to the non-temp one if we can't find the time to address the support of batch query.

jingz-db · 2024-11-27T01:31:55Z

+1 pending CI. Thanks for the patience.

I know it's a tech debt to leave this to temp error class, so let's file a JIRA ticket to change the error class to the non-temp one if we can't find the time to address the support of batch query.

Created here: https://issues.apache.org/jira/browse/SPARK-50429.
cc: @bogao007

HeartSaVioR · 2024-11-27T06:27:22Z

Thanks! Merging to master.

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Sep 16, 2024

jingz-db changed the title ~~[SS] Add Support for Chaining of Operators in transformWithStateInPandas API~~ [SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API Sep 16, 2024

jingz-db marked this pull request as ready for review September 16, 2024 20:21

micheal-o reviewed Sep 17, 2024

View reviewed changes

bogao007 reviewed Oct 7, 2024

View reviewed changes

jingz-db force-pushed the python-chaining-op branch from 266b832 to d0aa755 Compare October 30, 2024 00:32

jingz-db requested a review from bogao007 October 30, 2024 23:59

bogao007 approved these changes Nov 7, 2024

View reviewed changes

HeartSaVioR reviewed Nov 11, 2024

View reviewed changes

jingz-db requested a review from HeartSaVioR November 12, 2024 02:58

rebase on latest master

cdd65c6

jingz-db force-pushed the python-chaining-op branch from 2768d23 to cdd65c6 Compare November 13, 2024 23:38

fix test failure on CI

3d91876

HeartSaVioR reviewed Nov 20, 2024

View reviewed changes

resolve comments

a8c9c08

HeartSaVioR reviewed Nov 22, 2024

View reviewed changes

fix test cases

bb03c2b

jingz-db requested a review from HeartSaVioR November 26, 2024 01:04

HeartSaVioR reviewed Nov 26, 2024

View reviewed changes

jingz-db and others added 3 commits November 26, 2024 15:25

Merge branch 'master' into python-chaining-op

9835898

correct comments for test case

05a8572

move batch query checker to sparkstrategies

ce236ed

HeartSaVioR reviewed Nov 27, 2024

View reviewed changes

reuse existing temp error class

b53652a

HeartSaVioR approved these changes Nov 27, 2024

View reviewed changes

HeartSaVioR closed this in e03319f Nov 27, 2024

jingz-db mentioned this pull request Feb 12, 2025

[SPARK-50856][SS][PYTHON][CONNECT] Spark Connect Support for TransformWithStateInPandas In Python #49560

Closed

	case _: FlatMapGroupsInPandasWithState =>
	// TODO(SPARK-40443): support applyInPandasWithState in batch query
	throw new SparkUnsupportedOperationException("_LEGACY_ERROR_TEMP_3176")

[SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API #48124

[SPARK-49676][SS][PYTHON] Add Support for Chaining of Operators in transformWithStateInPandas API #48124

Uh oh!

Conversation

jingz-db commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogao007 left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jingz-db commented Nov 16, 2024

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jingz-db commented Sep 16, 2024 •

edited

Loading

jingz-db Nov 12, 2024 •

edited

Loading

jingz-db Nov 12, 2024 •

edited

Loading

jingz-db Nov 26, 2024 •

edited

Loading

HeartSaVioR Nov 26, 2024 •

edited

Loading

jingz-db Nov 27, 2024 •

edited

Loading