New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47570][SS] Integrate range scan encoder changes with timer implementation #45709
Conversation
@jingz-db mind liking the JIRA into the PR title? See also https://spark.apache.org/contributing.html |
164928f
to
1f128d5
Compare
Just did. Thanks! |
@@ -161,25 +161,23 @@ case class TransformWithStateExec( | |||
assert(batchTimestampMs.isDefined) | |||
val batchTimestamp = batchTimestampMs.get | |||
val procTimeIter = processorHandle.getExpiredTimers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pass the timeout to the iterator and just return the expired rows instead ? i.e. move the filtering logic within the TimerStateImpl
?
9f78369
to
b34b66c
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
val rowPair = iter.next() | ||
val keyRow = rowPair.key | ||
val result = getTimerRowFromSecIndex(keyRow) | ||
val rowPair = if (iter.hasNext) iter.next() else null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets keep the original if condition as it is ? and add a condition to return null
or result
within this case ?
@@ -164,11 +164,13 @@ class StatefulProcessorHandleImpl( | |||
|
|||
/** | |||
* Function to retrieve all registered timers for all grouping keys | |||
* @param expiryTimestampMs Threshold for expired timestamp in milliseconds, this function | |||
* will return every timers that has (strictly) smaller timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: will return all timers that have timestamp less than passed threshold
83e5721
to
150d8c9
Compare
timerTimerStamps3.foreach(timerState3.registerTimer) | ||
ImplicitGroupingKeyTracker.removeImplicitKey() | ||
|
||
ImplicitGroupingKeyTracker.setImplicitKey("test_key1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove setting the implicit key just for this section (for the calls to getExpiredTimers) would the test still work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it works. Will removed them in the next commit.
b73dff6
to
62eee1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -188,9 +187,12 @@ class TimerStateImpl( | |||
|
|||
/** | |||
* Function to get all the registered timers for all grouping keys | |||
* @param expiryTimestampMs Threshold for expired timestamp in milliseconds, this function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe add a small comment here mentioning that we perform a range scan and stop iterating once the key row timestamp exceeds the threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is not user-facing, right? If it is, I'd suggest avoiding implementation detail. Looks like as it doesn't seem to be an user facing, but just to remind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out (I did not realize this before but looks like we did the right thing)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks! Merging to master. |
…lementation ### What changes were proposed in this pull request? Previously timer state implementation was using No prefix rocksdb state encoder. When doing `iterator()` or `prefix()`, the returned iterator is not sorted on timestamp value. After Anish's PR for supporting range scan encoder, we could integrate it with `TimerStateImpl` such that we will use range scan encoder on `timer to key`. ### Why are the changes needed? The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests in `TimerSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45709 from jingz-db/integrate-range-scan. Authored-by: jingz-db <jing.zhan@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
What changes were proposed in this pull request?
Previously timer state implementation was using No prefix rocksdb state encoder. When doing
iterator()
orprefix()
, the returned iterator is not sorted on timestamp value. After Anish's PR for supporting range scan encoder, we could integrate it withTimerStateImpl
such that we will use range scan encoder ontimer to key
.Why are the changes needed?
The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests in
TimerSuite
Was this patch authored or co-authored using generative AI tooling?
No