[SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode #27209

HeartSaVioR · 2020-01-15T05:25:17Z

What changes were proposed in this pull request?

This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it.

Why are the changes needed?

Without the patch, the value for such metric is always 0.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added. Also manually tested with below query:

query

import spark.implicits._

spark.conf.set("spark.sql.shuffle.partitions", "5")

val df = spark.readStream
  .format("rate")
  .option("rowsPerSecond", 1000)
  .load()
  .withWatermark("timestamp", "5 seconds")
  .selectExpr("timestamp", "mod(value, 100) as mod", "value")
  .groupBy(window($"timestamp", "10 seconds"), $"mod")
  .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))

val query = df
  .writeStream
  .format("memory")
  .option("queryName", "test")
  .outputMode("append")
  .start()

query.awaitTermination()

before the patch

after the patch

…regation with append mode ### What changes were proposed in this pull request? This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it. ### Why are the changes needed? Without the patch, the value for such metric is always 0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Also manually tested with below query: > query ``` import spark.implicits._ spark.conf.set("spark.sql.shuffle.partitions", "5") val df = spark.readStream .format("rate") .option("rowsPerSecond", 1000) .load() .withWatermark("timestamp", "5 seconds") .selectExpr("timestamp", "mod(value, 100) as mod", "value") .groupBy(window($"timestamp", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) val query = df .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .start() query.awaitTermination() ``` > before the patch ![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png) > after the patch ![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png) Closes apache#26104 from HeartSaVioR/SPARK-29450. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

HeartSaVioR · 2020-01-15T05:37:55Z

cc. @gatorsmile
also cc. to @dongjoon-hyun who leads the efforts on releasing 2.4.5.

To provide some context, @gatorsmile found that #26104 actually fixed a regression (I wasn't aware) which seems to be broken at Spark 2.3.0, and asked about porting back the patch to 2.4 version line.
(#26104 (comment))

I'm not sure whether @gatorsmile wanted to deal with this in Spark 2.4.5, or it's just to ensure the new bugfix release of Spark 2.4 will include this. @gatorsmile Could you please clarify?

SparkQA · 2020-01-15T08:05:02Z

Test build #116749 has finished for PR 27209 at commit 25b2769.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-01-15T08:44:03Z

retest this, please

SparkQA · 2020-01-15T12:13:03Z

Test build #116774 has finished for PR 27209 at commit 25b2769.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…g aggregation with append mode ### What changes were proposed in this pull request? This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it. ### Why are the changes needed? Without the patch, the value for such metric is always 0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit test added. Also manually tested with below query: > query ``` import spark.implicits._ spark.conf.set("spark.sql.shuffle.partitions", "5") val df = spark.readStream .format("rate") .option("rowsPerSecond", 1000) .load() .withWatermark("timestamp", "5 seconds") .selectExpr("timestamp", "mod(value, 100) as mod", "value") .groupBy(window($"timestamp", "10 seconds"), $"mod") .agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value")) val query = df .writeStream .format("memory") .option("queryName", "test") .outputMode("append") .start() query.awaitTermination() ``` > before the patch ![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png) > after the patch ![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png) Closes #27209 from HeartSaVioR/SPARK-29450-branch-2.4. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-01-16T07:27:10Z

thanks, merging to 2.4!

HeartSaVioR · 2020-01-16T08:22:02Z

Thanks for reviewing and merging!

dongjoon-hyun · 2020-01-16T16:45:41Z

+1, late LGTM.

cloud-fan closed this Jan 16, 2020

HeartSaVioR deleted the SPARK-29450-branch-2.4 branch January 16, 2020 08:22

dongjoon-hyun added the STRUCTURED STREAMING label Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode #27209

[SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode #27209

HeartSaVioR commented Jan 15, 2020

HeartSaVioR commented Jan 15, 2020

SparkQA commented Jan 15, 2020

HeartSaVioR commented Jan 15, 2020

SparkQA commented Jan 15, 2020

cloud-fan commented Jan 16, 2020

HeartSaVioR commented Jan 16, 2020

dongjoon-hyun commented Jan 16, 2020

[SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode #27209

[SPARK-29450][SS][2.4] Measure the number of output rows for streaming aggregation with append mode #27209

Conversation

HeartSaVioR commented Jan 15, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HeartSaVioR commented Jan 15, 2020

SparkQA commented Jan 15, 2020

HeartSaVioR commented Jan 15, 2020

SparkQA commented Jan 15, 2020

cloud-fan commented Jan 16, 2020

HeartSaVioR commented Jan 16, 2020

dongjoon-hyun commented Jan 16, 2020