New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SPARK-47052][SS] Separate state tracking variables from MicroBatchExecution/StreamExecution #45109

Closed

jerrypeng wants to merge 3 commits into apache:master from jerrypeng:SPARK-47052

+945 −470

Contributor

jerrypeng commented Feb 14, 2024 •

edited

What changes were proposed in this pull request?

To improve code clarity and maintainability, I propose that we move all the variables that track mutable state and metrics for a streaming query into a separate class. With this refactor, it would be easy to track and find all the mutable state a microbatch can have.

Why are the changes needed?

To improve code clarity and maintainability. All the state and metrics that is needed for the execution lifecycle of a microbatch is consolidated into one class. If we decide to modify or add additional state to a streaming query, it will be easier to determine 1) where to add it 2) what existing state are there.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests should suffice

Was this patch authored or co-authored using generative AI tooling?

No

github-actions bot added SQL STRUCTURED STREAMING labels

jerrypeng changed the title ~~[SPARK-47052][WIP] Separate state tracking variables from MicroBatchExecution/StreamExecution~~ [SPARK-47052] Separate state tracking variables from MicroBatchExecution/StreamExecution

HeartSaVioR reviewed

View reviewed changes

Contributor

HeartSaVioR left a comment

Only styles, the rationale makes total sense and the code change looks great in overall. Thanks for making this change!

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                  currentStatus = currentStatus.copy(message = message)

                /** Extracts observed metrics from the most recent query execution. */

                private def extractObservedMetrics(

                  lastExecution: QueryExecution): Map[String, Row] = {

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                  val timeTaken = math.max(endTime - startTime, 0)

                /** Extracts statistics from the most recent query execution. */

                private def extractExecutionStats(

                  hasNewData: Boolean,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                    return Map.empty

                /** Extract statistics about stateful operators from the executed query plan. */

                private def extractStateOperatorMetrics(

                  lastExecution: IncrementalExecution): Seq[StateOperatorProgress] = {

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                  if (!hasNewData) {

                    return ExecutionStats(Map.empty, stateOperators, watermarkTimestamp)

                private def warnIfFinishTriggerTakesTooLong(

                  triggerEndTimestamp: Long,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                 * Override of finishTrigger to extract the map from IncrementalExecution.

                 */

                def finishTrigger(

                  hasNewData: Boolean,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                    to: StreamProgress,

                    latest: StreamProgress): Unit = {

                def recordTriggerOffsets(

                  from: StreamProgress,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
               * during the execution lifecycle of a batch that is being processed by the streaming query

               */

              abstract class ProgressContext(

                id: UUID,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala Outdated

    
                }

                def updateIdleness(

                  id: UUID,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala Outdated

    
                private def getStartOffset(dataStream: SparkDataStream): OffsetV2 = {

                  val startOffsetOpt = availableOffsets.get(dataStream)

                private def getStartOffset(

                  execCtx: MicroBatchExecutionContext,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala Outdated

    
                private def populateStartOffsets(sparkSessionToRunBatches: SparkSession): Unit = {

                  sinkCommitProgress = None

                protected def populateStartOffsets(

                  execCtx: MicroBatchExecutionContext,

Contributor

HeartSaVioR Feb 20, 2024

nit: 2 more spaces

HeartSaVioR changed the title ~~[SPARK-47052] Separate state tracking variables from MicroBatchExecution/StreamExecution~~ [SPARK-47052][SS] Separate state tracking variables from MicroBatchExecution/StreamExecution

Contributor

HeartSaVioR commented Feb 20, 2024

Could you please rebase the master branch of your fork with the latest in OSS repo, and rebase your PR against the latest master as well? It may resolve issues the CI shows us as failure in linter.


          [SPARK-47052] Separate state tracking variables from MicroBatchExecut…

45ae190

…ion/StreamExecution

jerrypeng force-pushed the SPARK-47052 branch from 0e3b8ab to 45ae190 Compare

February 20, 2024 23:50

jerrypeng added 2 commits

February 20, 2024 15:55


          addressing comments

c8f94d4


          fixing formatting

8adb80d

Contributor Author

jerrypeng commented Feb 21, 2024

@HeartSaVioR thanks for your review! I have addressed your comments. PTAL!

HeartSaVioR approved these changes

View reviewed changes

Contributor

HeartSaVioR left a comment

+1 pending CI.

Contributor

HeartSaVioR commented Feb 21, 2024

The CI only failed from pyspark-connect and failures look to be unrelated (not related to streaming).

Contributor

HeartSaVioR commented Feb 21, 2024

Thanks! Merging to master.

HeartSaVioR closed this in

bffa92c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment