Skip to content

Conversation

@sahnib
Copy link
Contributor

@sahnib sahnib commented Mar 4, 2024

What changes were proposed in this pull request?

This PR adds support to define event time column in the output dataset of TransformWithState operator. The new event time column will be used to evaluate watermark expressions in downstream operators.

  1. Note that the transformWithState operator does not enforce that values generated by user's computation adhere to the watermark semantics. (no output rows are generated which have event time less than watermark).
  2. Updated the watermark value passed in TimerInfo as evictionWatermark, rather than lateEventsWatermark.
  3. Ensure that event time column can only be defined in output if a watermark has been defined previously.

Why are the changes needed?

This change is required to support chaining of stateful operators after transformWithState. Event time column is required to evaluate watermark expressions in downstream stateful operators.

Does this PR introduce any user-facing change?

Yes. Adds a new version of transformWithState API which allows redefining the event time column.

How was this patch tested?

Added unit test cases.

Was this patch authored or co-authored using generative AI tooling?

No

@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch 4 times, most recently from a60c2f1 to 8b3d169 Compare March 5, 2024 02:26
@sahnib sahnib marked this pull request as ready for review March 5, 2024 04:39
@sahnib sahnib changed the title Allow chaining other stateful operators after transformWIthState operator. [SS] Allow chaining other stateful operators after transformWIthState operator. Mar 5, 2024
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from 8b3d169 to 5789ff2 Compare April 23, 2024 17:02
@sahnib sahnib changed the title [SS] Allow chaining other stateful operators after transformWIthState operator. [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. Apr 23, 2024
@sahnib sahnib requested a review from anishshri-db April 23, 2024 17:14
@anishshri-db
Copy link
Contributor

@sahnib - seems like there are still conflicts on the base branch ?

@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from 5789ff2 to 214142e Compare April 23, 2024 17:34
@github-actions github-actions bot removed the DOCS label Apr 23, 2024
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from 214142e to e21eb92 Compare April 23, 2024 18:02
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from ec376f2 to 7c2a950 Compare April 24, 2024 14:52
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from c442920 to 770f086 Compare April 24, 2024 23:05
@sahnib
Copy link
Contributor Author

sahnib commented Apr 24, 2024

@HeartSaVioR PTAL, thanks.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass.

sahnib added 2 commits May 6, 2024 10:27
…ator. This change allows the user to redefine the event time column in generated output of transformWithState, which is used in watermark expressions for operators following transformWithState
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch 2 times, most recently from f39d941 to 3082693 Compare May 6, 2024 17:28
@sahnib sahnib force-pushed the tws-chaining-stateful-operators branch from 3082693 to 5d02835 Compare May 6, 2024 17:41
Seq(
ResolveWithCTE,
ExtractDistributedSequenceID) ++
Seq(ResolveUpdateEventTimeWatermarkColumn) ++
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR I have added the ResolveUpdateEventTimeWatermarkColumn after all resolution rules. At this place, we are guaranteed (IIUC) to have resolved the query plan and should be able to extract watermark delay from child nodes of UpdateEventTimeWatermarkColumn. Let me know if you think otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed for all resolutions to happen in one loop. fixedPoint means having iterations of application of the set of rules. That said, we shouldn't still assume that child is resolved, and only make the rule to be effective when child is resolved. I'll left a comment.

@sahnib sahnib requested a review from HeartSaVioR May 6, 2024 17:43
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have an issue where we have to fix the issue and update the test case right now or defer both. I'm OK with latter, but let's file a blocker JIRA ticket if we want to do the latter.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Let's be sure to either 1) introduce a method to users which gives a watermark value before advancing (late events) or 2) construct a story for users to set the event time timestamp properly without watermark value.
@sahnib Could you please file a JIRA ticket with blocker priority?

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented May 8, 2024

Let's handle that later - I assume we will have a JIRA ticket. Thanks! Merging to master.

@HeartSaVioR HeartSaVioR changed the title [SPARK-47960][SS] Allow chaining other stateful operators after transformWIthState operator. [SPARK-47960][SS] Allow chaining other stateful operators after transformWithState operator. May 8, 2024
@sahnib
Copy link
Contributor Author

sahnib commented May 8, 2024

Let's be sure to either 1) introduce a method to users which gives a watermark value before advancing (late events) or 2) construct a story for users to set the event time timestamp properly without watermark value. @sahnib Could you please file a JIRA ticket with blocker priority?

@HeartSaVioR Created the JIRA https://issues.apache.org/jira/browse/SPARK-48199 for follow up item.

HeartSaVioR pushed a commit that referenced this pull request Nov 27, 2024
…ansformWithStateInPandas API

### What changes were proposed in this pull request?

This PR adds support to define event time column in the output dataset of `TransformWithStateInPandas` operator. The new event time column will be used to evaluate watermark expressions in downstream operators.

### Why are the changes needed?

This change is to couple with the scala implementation of chaining of operators. PR in Scala: #45376

### Does this PR introduce _any_ user-facing change?

Yes. User can now specify a event time column as:
```
df.groupBy("id")
  .transformWithStateInPandas(
      statefulProcessor=stateful_processor,
      outputStructType=output_schema,
      outputMode="Update",
      timeMode=timeMode,
      eventTimeColumnName="outputTimestamp"
  )
```

### How was this patch tested?

Integration tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48124 from jingz-db/python-chaining-op.

Lead-authored-by: jingz-db <jing.zhan@databricks.com>
Co-authored-by: Jing Zhan <135738831+jingz-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants