[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source #22402

cloud-fan · 2018-09-12T11:12:18Z

What changes were proposed in this pull request?

For self-join/self-union, Spark will produce a physical plan which has multiple DataSourceV2ScanExec instances referring to the same ReadSupport instance. In this case, the streaming source is indeed scanned multiple times, and the numInputRows metrics should be counted for each scan.

Actually we already have 2 test cases to verify the behavior:

StreamingQuerySuite.input row calculation with same V2 source used twice in self-join
KafkaMicroBatchSourceSuiteBase.ensure stream-stream self-join generates only one offset in log and correct metrics.

However, in these 2 tests, the expected result is different, which is super confusing. It turns out that, the first test doesn't trigger exchange reuse, so the source is scanned twice. The second test triggers exchange reuse, and the source is scanned only once.

This PR proposes to improve these 2 tests, to test with/without exchange reuse.

How was this patch tested?

test only change

cloud-fan · 2018-09-12T11:24:22Z

cc @jose-torres @tdas @zsxwing

SparkQA · 2018-09-12T14:00:37Z

Test build #95985 has finished for PR 22402 at commit 9b44734.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-12T14:01:46Z

Test build #95984 has finished for PR 22402 at commit d2fc07f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-12T18:17:46Z

Test build #95994 has finished for PR 22402 at commit ab6e017.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-13T03:06:03Z

Test build #96016 has finished for PR 22402 at commit 8649347.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-13T03:43:11Z

retest this please

SparkQA · 2018-09-13T06:30:42Z

Test build #96019 has finished for PR 22402 at commit 8649347.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-13T07:37:49Z

...kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchSourceSuite.scala

-        true
-      }
-    )
+    withSQLConf(SQLConf.EXCHANGE_REUSE_ENABLED.key -> "false") {


turn off EXCHANGE_REUSE_ENABLED, to expose the self-join numRows double count bug.

SparkQA · 2018-09-13T11:14:00Z

Test build #96032 has finished for PR 22402 at commit d211d85.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-09-13T15:56:47Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

@@ -460,9 +460,9 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging wi
    val streamingInputDF = createSingleTriggerStreamingDF(streamingTriggerDF).toDF("value")

    val progress = getFirstProgress(streamingInputDF.join(streamingInputDF, "value"))
-    assert(progress.numInputRows === 20) // data is read multiple times in self-joins


So IIUC in this line, the EXCHANGE_REUSE_ENABLED == true, and its not read twice actually?

The exchange reuse is not triggered here, because the project of one side is eliminated. In the kafka test, we have a cast in the project so Spark doesn't eliminate project ay any side, and the trigger exchange reuse.

SparkQA · 2018-09-14T05:59:18Z

Test build #96056 has finished for PR 22402 at commit 0c661a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2018-09-17T23:09:25Z

LGTM

zsxwing · 2018-09-17T23:09:31Z

retest this please

SparkQA · 2018-09-18T02:12:20Z

Test build #96160 has finished for PR 22402 at commit 0c661a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-18T02:12:53Z

retest this please

SparkQA · 2018-09-18T05:04:18Z

Test build #96165 has finished for PR 22402 at commit 0c661a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-18T05:06:46Z

Test build #96163 has finished for PR 22402 at commit 0c661a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-18T07:23:27Z

retest this please

SparkQA · 2018-09-18T10:12:03Z

Test build #96172 has finished for PR 22402 at commit 0c661a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-19T10:54:37Z

retest this please

SparkQA · 2018-09-19T14:49:23Z

Test build #96244 has finished for PR 22402 at commit 0c661a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-19T16:30:56Z

thanks, merging to master!

fix streaming metrics for self-join

9b44734

cloud-fan force-pushed the bug branch from d2fc07f to 9b44734 Compare September 12, 2018 11:16

fix the test for union as well

ab6e017

fix v1 source as well

8649347

fix one more test

d211d85

cloud-fan commented Sep 13, 2018

View reviewed changes

xuanyuanking reviewed Sep 13, 2018

View reviewed changes

different fix

0c661a0

cloud-fan changed the title ~~[SPARK-25414][SS] The numInputRows metrics can be incorrect for streaming self-join~~ [SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source Sep 14, 2018

asfgit closed this in a71f6a1 Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source #22402

[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source #22402

cloud-fan commented Sep 12, 2018 •

edited

cloud-fan commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 13, 2018

cloud-fan commented Sep 13, 2018

SparkQA commented Sep 13, 2018

cloud-fan Sep 13, 2018

SparkQA commented Sep 13, 2018

xuanyuanking Sep 13, 2018

cloud-fan Sep 14, 2018

SparkQA commented Sep 14, 2018

zsxwing commented Sep 17, 2018

zsxwing commented Sep 17, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 18, 2018

SparkQA commented Sep 18, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 18, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 19, 2018

SparkQA commented Sep 19, 2018

cloud-fan commented Sep 19, 2018 •

edited

[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source #22402

[SPARK-25414][SS][TEST] make it clear that the numRows metrics should be counted for each scan of the source #22402

Conversation

cloud-fan commented Sep 12, 2018 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 12, 2018

SparkQA commented Sep 13, 2018

cloud-fan commented Sep 13, 2018

SparkQA commented Sep 13, 2018

cloud-fan Sep 13, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 13, 2018

xuanyuanking Sep 13, 2018

Choose a reason for hiding this comment

cloud-fan Sep 14, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 14, 2018

zsxwing commented Sep 17, 2018

zsxwing commented Sep 17, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 18, 2018

SparkQA commented Sep 18, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 18, 2018

SparkQA commented Sep 18, 2018

cloud-fan commented Sep 19, 2018

SparkQA commented Sep 19, 2018

cloud-fan commented Sep 19, 2018 • edited

cloud-fan commented Sep 12, 2018 •

edited

cloud-fan commented Sep 19, 2018 •

edited