Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 17, 2019

What changes were proposed in this pull request?

By design, Scan represents a logical data scan, Batch/Stream represents a physical data scan.

However, this doesn't match reality. The logical plan(DataSourceV2Relation) contains Table and the phyiscal plan(BatchScanExec and friends) contains Scan. The operator pushdown happens at planning time, so Scan and Batch/Stream are always created together in the planner rules. That said, Table is the actual logical data scan.

Since there is not much can be separated from Scan and Batch/Stream, almost all the existing DS v2 implementations either implement Scan and Batch/Stream together, or use anonymous class to implement Scan.

In addition, the write side API has no such separation either: it's just WriterBuilder -> BatchWrite/StreamingWrite.

This PR proposes to merge Scan and Batch/Stream, to match the write side API: ScanBuilder -> BatchScan/MicroBatchScan/ContinuousScan.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

@jose-torres
Copy link
Contributor

This seems fine, but didn’t we decide a while back (Q4 18 I think) not to do it?

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107789 has finished for PR 25180 at commit 439d6dc.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@jose-torres seems there is a misunderstanding. This is on my TODO list a long time ago and I do want to finish it before 3.0.

@SparkQA
Copy link

SparkQA commented Jul 17, 2019

Test build #107794 has finished for PR 25180 at commit af96a60.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107824 has finished for PR 25180 at commit 878eaa5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107841 has finished for PR 25180 at commit 878eaa5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28423][SQL] merge Scan and Batch/Stream [SPARK-28423][SQL] Merge Scan and Batch/Stream Jul 19, 2019
@dongjoon-hyun
Copy link
Member

Could you review this change please, @rdblue ?

@SparkQA
Copy link

SparkQA commented Jul 19, 2019

Test build #107906 has finished for PR 25180 at commit 83f2967.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jul 23, 2019

Sorry I haven't gotten to this yet. I had some unexpected travel and was out of the office. I should be able to take a look at this tomorrow.

@dongjoon-hyun
Copy link
Member

Thank you, @rdblue .

@cloud-fan
Copy link
Contributor Author

@rdblue cool, thanks!

@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108077 has finished for PR 25180 at commit 9c826f3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 24, 2019

Test build #108092 has finished for PR 25180 at commit 9c826f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jul 24, 2019

That said, Table is the actual logical data scan.

No, table is a source that can be scanned. A logical scan has filters and a projection.

The operator pushdown happens at planning time, so Scan and Batch/Stream are always created together in the planner rules.

The scan is created in DataSourceV2Strategy, but batch is a lazy field in BatchScanExec. There's no need for the planner strategy to know about the batch and stream objects at all.

This separation would be useful if we decide to move push-down into a batch in the optimizer. We've been discussing options for doing push-down earlier and being able to use stats in the optimizer. If we did that, then the separation between scan and batch/stream would support that. We would introduce a logical node that has a scan that is produced in the optimizer.

@cloud-fan
Copy link
Contributor Author

I think the separation between Scan and Batch is still useless even if we move the operator pushdown to the optimizer. There is no extra information needed to convert a Scan to a Batch, which means if I have a class that implements Scan, there is no problem for me to implement Batch at the same time.

As a result, almost all the existing DS v2 implementations either implement Scan and Batch/Stream together, or use anonymous class to implement Scan. This makes me believe that we should remove this separation.

Conceptually, the physical scan is represented by InputPartition and PartitionReaderFactory, not the interface that creates them. It makes more sense to use a single interface to represent a logical scan, which creates InputPartition and PartitionReaderFactory.

* An interface that defines how to scan the data from data source for continuous streaming
* processing.
*
* The scanning procedure is:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jose-torres , can you double-check if my explanation is correct?

* An interface that defines how to scan the data from data source for micro-batch streaming
* processing.
*
* The scanning procedure is:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jose-torres , can you double-check if my explanation is correct?

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108717 has finished for PR 25180 at commit a7d0c55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Dec 26, 2019
@github-actions github-actions bot closed this Dec 27, 2019
@HyukjinKwon HyukjinKwon reopened this Dec 27, 2019
@SparkQA
Copy link

SparkQA commented Dec 27, 2019

Test build #115830 has finished for PR 25180 at commit a7d0c55.

  • This patch fails build dependency tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this Dec 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants