[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

cloud-fan · 2019-07-17T15:27:03Z

What changes were proposed in this pull request?

By design, Scan represents a logical data scan, Batch/Stream represents a physical data scan.

However, this doesn't match reality. The logical plan(DataSourceV2Relation) contains Table and the phyiscal plan(BatchScanExec and friends) contains Scan. The operator pushdown happens at planning time, so Scan and Batch/Stream are always created together in the planner rules. That said, Table is the actual logical data scan.

Since there is not much can be separated from Scan and Batch/Stream, almost all the existing DS v2 implementations either implement Scan and Batch/Stream together, or use anonymous class to implement Scan.

In addition, the write side API has no such separation either: it's just WriterBuilder -> BatchWrite/StreamingWrite.

This PR proposes to merge Scan and Batch/Stream, to match the write side API: ScanBuilder -> BatchScan/MicroBatchScan/ContinuousScan.

How was this patch tested?

existing tests

cloud-fan · 2019-07-17T15:27:43Z

cc @marmbrus @rdblue @jose-torres @gengliangwang

jose-torres · 2019-07-17T15:36:09Z

This seems fine, but didn’t we decide a while back (Q4 18 I think) not to do it?

SparkQA · 2019-07-17T15:36:55Z

Test build #107789 has finished for PR 25180 at commit 439d6dc.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-17T17:30:19Z

@jose-torres seems there is a misunderstanding. This is on my TODO list a long time ago and I do want to finish it before 3.0.

SparkQA · 2019-07-17T17:49:16Z

Test build #107794 has finished for PR 25180 at commit af96a60.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-18T07:05:02Z

Test build #107824 has finished for PR 25180 at commit 878eaa5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-07-18T09:04:45Z

retest this please.

SparkQA · 2019-07-18T12:22:56Z

Test build #107841 has finished for PR 25180 at commit 878eaa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchScan.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketMicroBatchScan.scala

dongjoon-hyun · 2019-07-19T15:30:55Z

Could you review this change please, @rdblue ?

SparkQA · 2019-07-19T17:43:06Z

Test build #107906 has finished for PR 25180 at commit 83f2967.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-23T23:56:54Z

Sorry I haven't gotten to this yet. I had some unexpected travel and was out of the office. I should be able to take a look at this tomorrow.

dongjoon-hyun · 2019-07-24T00:16:52Z

Thank you, @rdblue .

cloud-fan · 2019-07-24T00:44:29Z

@rdblue cool, thanks!

SparkQA · 2019-07-24T07:05:01Z

Test build #108077 has finished for PR 25180 at commit 9c826f3.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-07-24T09:09:44Z

retest this please

SparkQA · 2019-07-24T14:14:32Z

Test build #108092 has finished for PR 25180 at commit 9c826f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-24T17:28:38Z

That said, Table is the actual logical data scan.

No, table is a source that can be scanned. A logical scan has filters and a projection.

The operator pushdown happens at planning time, so Scan and Batch/Stream are always created together in the planner rules.

The scan is created in DataSourceV2Strategy, but batch is a lazy field in BatchScanExec. There's no need for the planner strategy to know about the batch and stream objects at all.

This separation would be useful if we decide to move push-down into a batch in the optimizer. We've been discussing options for doing push-down earlier and being able to use stats in the optimizer. If we did that, then the separation between scan and batch/stream would support that. We would introduce a logical node that has a scan that is produced in the optimizer.

cloud-fan · 2019-07-26T02:22:06Z

I think the separation between Scan and Batch is still useless even if we move the operator pushdown to the optimizer. There is no extra information needed to convert a Scan to a Batch, which means if I have a class that implements Scan, there is no problem for me to implement Batch at the same time.

As a result, almost all the existing DS v2 implementations either implement Scan and Batch/Stream together, or use anonymous class to implement Scan. This makes me believe that we should remove this separation.

Conceptually, the physical scan is represented by InputPartition and PartitionReaderFactory, not the interface that creates them. It makes more sense to use a single interface to represent a logical scan, which creates InputPartition and PartitionReaderFactory.

cloud-fan · 2019-08-06T12:27:27Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/ContinuousScan.java

+ * An interface that defines how to scan the data from data source for continuous streaming
+ * processing.
+ *
+ * The scanning procedure is:


Hi @jose-torres , can you double-check if my explanation is correct?

cloud-fan · 2019-08-06T12:27:34Z

sql/catalyst/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchScan.java

+ * An interface that defines how to scan the data from data source for micro-batch streaming
+ * processing.
+ *
+ * The scanning procedure is:


Hi @jose-torres , can you double-check if my explanation is correct?

SparkQA · 2019-08-06T15:43:17Z

Test build #108717 has finished for PR 25180 at commit a7d0c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

github-actions · 2019-12-26T00:07:33Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

SparkQA · 2019-12-27T02:21:19Z

Test build #115830 has finished for PR 25180 at commit a7d0c55.

This patch fails build dependency tests.
This patch does not merge cleanly.
This patch adds no public classes.

cloud-fan force-pushed the merge branch from 439d6dc to af96a60 Compare July 17, 2019 17:28

dongjoon-hyun added the SQL label Jul 18, 2019

cloud-fan force-pushed the merge branch from af96a60 to 878eaa5 Compare July 18, 2019 04:23

dongjoon-hyun reviewed Jul 18, 2019

View reviewed changes

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchScan.scala Show resolved Hide resolved

dongjoon-hyun reviewed Jul 18, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 18, 2019

View reviewed changes

...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Jul 18, 2019

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketMicroBatchScan.scala Outdated Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-28423][SQL] merge Scan and Batch/Stream~~ [SPARK-28423][SQL] Merge Scan and Batch/Stream Jul 19, 2019

cloud-fan force-pushed the merge branch from 83f2967 to 9c826f3 Compare July 24, 2019 04:46

cloud-fan added 2 commits August 6, 2019 19:58

merge Scan and Batch/Stream

d2832ce

address comments

a7d0c55

cloud-fan force-pushed the merge branch from 9c826f3 to a7d0c55 Compare August 6, 2019 11:58

cloud-fan commented Aug 6, 2019

View reviewed changes

github-actions bot added the Stale label Dec 26, 2019

github-actions bot closed this Dec 27, 2019

HyukjinKwon reopened this Dec 27, 2019

cloud-fan closed this Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

cloud-fan commented Jul 17, 2019 •

edited

cloud-fan commented Jul 17, 2019

jose-torres commented Jul 17, 2019

SparkQA commented Jul 17, 2019

cloud-fan commented Jul 17, 2019

SparkQA commented Jul 17, 2019

SparkQA commented Jul 18, 2019

gengliangwang commented Jul 18, 2019

SparkQA commented Jul 18, 2019

dongjoon-hyun commented Jul 19, 2019

SparkQA commented Jul 19, 2019

rdblue commented Jul 23, 2019

dongjoon-hyun commented Jul 24, 2019

cloud-fan commented Jul 24, 2019

SparkQA commented Jul 24, 2019

cloud-fan commented Jul 24, 2019

SparkQA commented Jul 24, 2019

rdblue commented Jul 24, 2019

cloud-fan commented Jul 26, 2019

cloud-fan Aug 6, 2019

cloud-fan Aug 6, 2019

SparkQA commented Aug 6, 2019

github-actions bot commented Dec 26, 2019

SparkQA commented Dec 27, 2019

[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

Conversation

cloud-fan commented Jul 17, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jul 17, 2019

jose-torres commented Jul 17, 2019

SparkQA commented Jul 17, 2019

cloud-fan commented Jul 17, 2019

SparkQA commented Jul 17, 2019

SparkQA commented Jul 18, 2019

gengliangwang commented Jul 18, 2019

SparkQA commented Jul 18, 2019

dongjoon-hyun commented Jul 19, 2019

SparkQA commented Jul 19, 2019

rdblue commented Jul 23, 2019

dongjoon-hyun commented Jul 24, 2019

cloud-fan commented Jul 24, 2019

SparkQA commented Jul 24, 2019

cloud-fan commented Jul 24, 2019

SparkQA commented Jul 24, 2019

rdblue commented Jul 24, 2019

cloud-fan commented Jul 26, 2019

cloud-fan Aug 6, 2019

Choose a reason for hiding this comment

cloud-fan Aug 6, 2019

Choose a reason for hiding this comment

SparkQA commented Aug 6, 2019

github-actions bot commented Dec 26, 2019

SparkQA commented Dec 27, 2019

cloud-fan commented Jul 17, 2019 •

edited