New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28423][SQL] Merge Scan and Batch/Stream #25180
Conversation
This seems fine, but didn’t we decide a while back (Q4 18 I think) not to do it? |
Test build #107789 has finished for PR 25180 at commit
|
@jose-torres seems there is a misunderstanding. This is on my TODO list a long time ago and I do want to finish it before 3.0. |
Test build #107794 has finished for PR 25180 at commit
|
Test build #107824 has finished for PR 25180 at commit
|
retest this please. |
Test build #107841 has finished for PR 25180 at commit
|
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchScan.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Source.scala
Outdated
Show resolved
Hide resolved
...n/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousTextSocketSource.scala
Outdated
Show resolved
Hide resolved
...c/main/scala/org/apache/spark/sql/execution/streaming/sources/TextSocketMicroBatchScan.scala
Outdated
Show resolved
Hide resolved
Could you review this change please, @rdblue ? |
Test build #107906 has finished for PR 25180 at commit
|
Sorry I haven't gotten to this yet. I had some unexpected travel and was out of the office. I should be able to take a look at this tomorrow. |
Thank you, @rdblue . |
@rdblue cool, thanks! |
Test build #108077 has finished for PR 25180 at commit
|
retest this please |
Test build #108092 has finished for PR 25180 at commit
|
No, table is a source that can be scanned. A logical scan has filters and a projection.
The scan is created in This separation would be useful if we decide to move push-down into a batch in the optimizer. We've been discussing options for doing push-down earlier and being able to use stats in the optimizer. If we did that, then the separation between scan and batch/stream would support that. We would introduce a logical node that has a scan that is produced in the optimizer. |
I think the separation between As a result, almost all the existing DS v2 implementations either implement Conceptually, the physical scan is represented by |
* An interface that defines how to scan the data from data source for continuous streaming | ||
* processing. | ||
* | ||
* The scanning procedure is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jose-torres , can you double-check if my explanation is correct?
* An interface that defines how to scan the data from data source for micro-batch streaming | ||
* processing. | ||
* | ||
* The scanning procedure is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jose-torres , can you double-check if my explanation is correct?
Test build #108717 has finished for PR 25180 at commit
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
Test build #115830 has finished for PR 25180 at commit
|
What changes were proposed in this pull request?
By design,
Scan
represents a logical data scan,Batch
/Stream
represents a physical data scan.However, this doesn't match reality. The logical plan(
DataSourceV2Relation
) containsTable
and the phyiscal plan(BatchScanExec
and friends) containsScan
. The operator pushdown happens at planning time, soScan
andBatch
/Stream
are always created together in the planner rules. That said,Table
is the actual logical data scan.Since there is not much can be separated from
Scan
andBatch
/Stream
, almost all the existing DS v2 implementations either implementScan
andBatch
/Stream
together, or use anonymous class to implementScan
.In addition, the write side API has no such separation either: it's just
WriterBuilder
->BatchWrite
/StreamingWrite
.This PR proposes to merge
Scan
andBatch
/Stream
, to match the write side API:ScanBuilder
->BatchScan
/MicroBatchScan
/ContinuousScan
.How was this patch tested?
existing tests