[SPARK-40460][SS] Fix streaming metrics when selecting `_metadata` #37905

Yaohua628 · 2022-09-16T02:10:09Z

What changes were proposed in this pull request?

Streaming metrics report all 0 (processedRowsPerSecond, etc) when selecting _metadata column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, here we cannot find the plan and collect metrics correctly.

This PR fixes this by replacing the initial LogicalPlan with the LogicalPlan containing the metadata column

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing + New UTs

Yaohua628 · 2022-09-16T05:10:31Z

Hi, @cloud-fan @HeartSaVioR could you please take a look whenever you have a chance? Thanks! Happy weekend!

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala

cloud-fan · 2022-09-16T06:25:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -590,7 +591,7 @@ class MicroBatchExecution(
    val newBatchesPlan = logicalPlan transform {
      // For v1 sources.
      case StreamingExecutionRelation(source, output, catalogTable) =>
-        newData.get(source).map { dataPlan =>
+        mutableNewData.get(source).map { dataPlan =>
          val hasFileMetadata = output.exists {


looking at the code, seems the problem is we resolve the metadata columns in every micro-batch. Shouldn't we only resolve it once?

It will require Source to indicate the request of metadata column and produce the logical plan accordingly when getBatch is called. My understanding is that DSv1 source does not have an interface to receive the information of which columns will be referred in actual query.

HeartSaVioR · 2022-09-16T07:43:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

@@ -590,7 +591,7 @@ class MicroBatchExecution(
    val newBatchesPlan = logicalPlan transform {
      // For v1 sources.
      case StreamingExecutionRelation(source, output, catalogTable) =>
-        newData.get(source).map { dataPlan =>
+        mutableNewData.get(source).map { dataPlan =>


While we are here, probably less intrusive change would be moving (L594 ~ L610) to L567. After the change we wouldn't need to make a change to newData here.

Thanks, I initially thought about that, but we need to know the output from StreamingExecutionRelation(source, output, catalogTable) to resolve _metadata right (L591 ~ L593)?

Yeah, you're right. I missed that.

Btw, looks like my change (tagging catalogTable into LogicalRelation) will also fall into this bug. Thanks for fixing this.

Np - an unintentional fix :-)
Thanks for helping!

Maybe, we may want to check the case of self-union / self-join to verify we really didn't break things. This works only when this condition is true leaf : source = 1 : 1 (otherwise we are overwriting the value in map), while the code comment of ProgressReporter tells there are counter cases.

Got it - could you share an example? In this case, does that mean the leaf : source = 1 : N?

The code comment actually doesn't say much and I'm speculating. Let's just try a best effort, self-union and self-join. df = spark.readStream... -> df.union(df) / df = spark.readStream... -> df.join(df)

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

AmplabJenkins · 2022-09-16T15:39:00Z

Can one of the admins verify this patch?

HeartSaVioR · 2022-09-18T23:06:10Z

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala

+      val df1 = spark.read.format("json")
+        .load(dir.getCanonicalPath + "/target/new-streaming-data-union")
+      // Verify self-union results
+      assert(streamQuery0.lastProgress.numInputRows == 2L)


streamQuery1

HeartSaVioR

+1 pending build.

HeartSaVioR · 2022-09-19T05:29:32Z

Thanks! Merging to master / 3.3.

HeartSaVioR · 2022-09-19T05:33:59Z

There's conflict in branch-3.3. @Yaohua628 Could you please craft a PR for branch-3.3? Thanks in advance!

Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348) we cannot find the plan and collect metrics correctly. This PR fixes this by replacing the initial `LogicalPlan` with the `LogicalPlan` containing the metadata column Bug fix. No Existing + New UTs Closes apache#37905 from Yaohua628/spark-40460. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

Yaohua628 · 2022-09-19T17:15:43Z

There's conflict in branch-3.3. @Yaohua628 Could you please craft a PR for branch-3.3? Thanks in advance!

Done! #37932 - Thank you

### What changes were proposed in this pull request? Cherry-picked from #37905 Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348) we cannot find the plan and collect metrics correctly. This PR fixes this by replacing the initial `LogicalPlan` with the `LogicalPlan` containing the metadata column ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing + New UTs Closes #37932 from Yaohua628/spark-40460-3-3. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting `_metadata` column. Because the logical plan from the batch and the actual planned logical plan are mismatched. So, [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348) we cannot find the plan and collect metrics correctly. This PR fixes this by replacing the initial `LogicalPlan` with the `LogicalPlan` containing the metadata column ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing + New UTs Closes apache#37905 from Yaohua628/spark-40460. Authored-by: yaohua <yaohua.zhao@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

fix metrics

995ba39

github-actions bot added SQL STRUCTURED STREAMING labels Sep 16, 2022

HeartSaVioR reviewed Sep 16, 2022

View reviewed changes

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileMetadataStructSuite.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Sep 16, 2022

View reviewed changes

HeartSaVioR reviewed Sep 16, 2022

View reviewed changes

HyukjinKwon reviewed Sep 16, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala Outdated Show resolved Hide resolved

Yaohua628 added 2 commits September 16, 2022 10:25

comments

046adc7

self union and join

881f867

HeartSaVioR reviewed Sep 18, 2022

View reviewed changes

oops

f19d7c9

HeartSaVioR approved these changes Sep 19, 2022

View reviewed changes

HeartSaVioR closed this in 946a960 Sep 19, 2022

Yaohua628 mentioned this pull request Sep 19, 2022

[SPARK-40460][SS][3.3] Fix streaming metrics when selecting _metadata #37932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40460][SS] Fix streaming metrics when selecting `_metadata` #37905

[SPARK-40460][SS] Fix streaming metrics when selecting `_metadata` #37905

Yaohua628 commented Sep 16, 2022 •

edited

Yaohua628 commented Sep 16, 2022

cloud-fan Sep 16, 2022

HeartSaVioR Sep 16, 2022 •

edited

HeartSaVioR Sep 16, 2022 •

edited

Yaohua628 Sep 16, 2022

HeartSaVioR Sep 16, 2022

Yaohua628 Sep 16, 2022

HeartSaVioR Sep 16, 2022

Yaohua628 Sep 16, 2022

HeartSaVioR Sep 17, 2022

AmplabJenkins commented Sep 16, 2022

HeartSaVioR Sep 18, 2022

Yaohua628 Sep 19, 2022

HeartSaVioR left a comment

HeartSaVioR commented Sep 19, 2022

HeartSaVioR commented Sep 19, 2022

Yaohua628 commented Sep 19, 2022

[SPARK-40460][SS] Fix streaming metrics when selecting _metadata #37905

[SPARK-40460][SS] Fix streaming metrics when selecting _metadata #37905

Conversation

Yaohua628 commented Sep 16, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Yaohua628 commented Sep 16, 2022

Choose a reason for hiding this comment

HeartSaVioR Sep 16, 2022 • edited

Choose a reason for hiding this comment

HeartSaVioR Sep 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Sep 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Sep 19, 2022

HeartSaVioR commented Sep 19, 2022

Yaohua628 commented Sep 19, 2022

[SPARK-40460][SS] Fix streaming metrics when selecting `_metadata` #37905

[SPARK-40460][SS] Fix streaming metrics when selecting `_metadata` #37905

Yaohua628 commented Sep 16, 2022 •

edited

HeartSaVioR Sep 16, 2022 •

edited

HeartSaVioR Sep 16, 2022 •

edited