-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-48511][SS] Remove TimeMode None from TransformWithState. #46825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4981316 to
b00a034
Compare
|
Mind filing a JIRA please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When did we add this? If this is already released out, we can't just remove and break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added this on April 7, 2024 as part of TTL support for Streaming State with TransformWithState operator. This operator is a new addition to Spark Structured Streaming, and is planned to be released in Spark 4.1. The API is currently in heavy development and evolving. I think its safe to make this change at this point, given its still unreleased in a Spark Major version.
cc: @anishshri-db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea not released yet. This is only in the 4.0 branch.
@sahnib - while we are here - is there a way to make this package private to sql too on the Java side ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java doesn't support package private with specifying ancestor package. It's always limited to the same package and it won't may not work for our case.
(It might change in recent versions though I haven't heard about such change.)
b00a034 to
05ff15b
Compare
Filed SPARK-48511. Thanks. |
|
@HeartSaVioR PTAL, thanks. |
| watermarkPropagator.getInputWatermarkForEviction(tentativeBatchId, | ||
| p.stateInfo.get.operatorId)) | ||
| }.exists(_ == true) | ||
| }.contains(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unrelated change? Please file a new minor PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove these minor changes from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution. Left comments. General comments are 1) whether we can use any time mode where we could use None, or only ProcessingTime, and 2) why we need manual clock while we just removed out time mode None (doesn't sound to be relevant).
| class ExpiredTimerInfoImpl( | ||
| isValid: Boolean, | ||
| expiryTimeInMsOpt: Option[Long] = None, | ||
| timeMode: TimeMode = TimeMode.None()) extends ExpiredTimerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we assume we don't need to provide either it was from event time semantic vs processing time semantic? What was the rationale to add this and why this could be removed while we just remove out None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. TimeMode is passed in the StatefulProcessor.init function to the user. We were also passing it to the ExpiredTimerInfoImpl but the interface ExpiredTimerInfo does not even expose it to the user, and we never set this variable with proper timeMode. Hence, I have removed it. I dont think this was intended to here. cc: @anishshri-db can you confirm?
| val lastExecutionRequiresAnotherBatch = noDataBatchesEnabled && | ||
| // need to check the execution plan of the previous batch | ||
| execCtx.previousContext.map { plan => | ||
| execCtx.previousContext.exists { plan => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: again, separate minor PR for unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
| Dataset<String> transformWithStateMapped = grouped.transformWithState( | ||
| new TestStatefulProcessorWithInitialState(), | ||
| TimeMode.None(), | ||
| TimeMode.ProcessingTime(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just to be fully sure, either ProcessingTime or EventTime works, do I understand correctly? If either one only works for replacement of None, should be better to be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either ProcessingTime or EventTime works as a replacement.
| val store = provider.getStore(0) | ||
| val handle = new StatefulProcessorHandleImpl(store, UUID.randomUUID(), | ||
| Encoders.STRING.asInstanceOf[ExpressionEncoder[Any]], TimeMode.None()) | ||
| Encoders.STRING.asInstanceOf[ExpressionEncoder[Any]], TimeMode.ProcessingTime()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same, wanted to know whether this is just a convenience or required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the test code change is required given the code change is just to remove TimeMode None. Do we fix some test flakiness as well here, or what is the rationale of manual clock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransformWithStateExec currently returns true for ProcessingTime timeMode in shouldRunAnotherBatch. (Note that this behavior is same as FMGWS). I had to add a manual clock to ensure that data processing is triggerred before Check/Expect blocks (otherwise AddData does not finish). Let me know if that answers your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Also let's revisit the UX. If they use neither timeout nor TTL, they could just do None regardless they have event time column or not. Given we remove None, what is the expectation of UX? Does user have to specify either one based on what they have (event time column is set or not), even though they never use timeout? If that is the case, that doesn't sound like a better UX. |
Even with TimeMode None -> the query has eventTime (if eventTimeColumn is specified), and processingTime (based on driver's clock). I think this makes TimeMode None confusing. |
|
Just leaving a history here: currently processing time mode triggers the batch continuously which is not acceptable if there is no timer/TTL to check. We are now discussing how to handle the case. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This PR removes the TimeMode None from supported time modes in TransformWithState operator. A structured streaming query works in either Processing time mode, or event Time mode depending on whether eventTime has been specified, hence this change aligns TimeMode properly with Streaming query semantics.
Note that if eventTimeColumn is specified for output dataset in transformWithState operator. TimeMode defaults to EventTime.
Why are the changes needed?
These changes are needed to align TimeMode values with how time flows in Streaming query.
Does this PR introduce any user-facing change?
Yes, modifies the TimeMode semantic for transformWithState.
How was this patch tested?
All existing unit test pass.
Was this patch authored or co-authored using generative AI tooling?
No