API: Add BatchScan to Table#5922
Conversation
| } | ||
|
|
||
| @Test | ||
| public void testDataTableScan() { |
There was a problem hiding this comment.
I'll add more tests if we decide to go with this change.
| * @param columns column names | ||
| * @return a new scan based on this with the given projection columns | ||
| */ | ||
| default ThisT select(String... columns) { |
There was a problem hiding this comment.
Moved from TableScan to reuse in BatchScan.
rdblue
left a comment
There was a problem hiding this comment.
Overall this looks like the right direction to me. I'm glad it was not too much code to start being able to return other task types!
| * | ||
| * @return a batch scan for this table | ||
| */ | ||
| default BatchScan newBatchScan() { |
There was a problem hiding this comment.
I am wondering if batch is the most accurate name. batch jobs can also do incremental scan. To me, streaming/batch is more about execution mode.
This scan is contrasting to the incremental scans below. Incremental scan is meant for the difference between two snapshots (start, end). This scan is more like a snapshot scan. It scans the table using a specific snapshot (as point of view to the table). Just to bootstrap some brainstorming. what about newSnapshotScan?
There was a problem hiding this comment.
This method is also used by aggregate metadata tables like (all_manifests), which scan across all snapshots. That's why newSnapshotScan may be misleading.
I am coming from the Spark background that uses batch in its scan API. However, I am open to other names. Is there a good word with opposite meaning to incremental?
There was a problem hiding this comment.
We could call it V2TableScan but I am not a big fan of using versions in the name.
There was a problem hiding this comment.
yeah, V2TableScan is not very meaningful. I see why SnapshotScan is not appropriate for metadata table scan. I would be fine with BatchScan if we can't find better alternative.
This is a scan on the table or metadata table's state at some point of time. what about PointScan? or just BaseScan?
There was a problem hiding this comment.
I think BatchScan makes sense, although I do see the point that "batch" isn't really the opposite of "incremental". TableScan works the best, but that's already taken. I don't really care for SnapshotScan because that's too specific in a way that we don't really need to state (isn't it almost always a snapshot that gets scanned?).
I'm not sure what a better name would be, so I'd go with BatchScan.
There was a problem hiding this comment.
We usually call something BaseXXX if it is a default package-private implementation of an interface. I am also not sure how descriptive PointScan would be.
I spent the morning trying to come up with alternative names but I had a hard time. I'd say Scan or TableScan would work best but they are both taken.
There was a problem hiding this comment.
@stevenzwu, would you be OK with keeping it as BatchScan for now to unblock this change?
There was a problem hiding this comment.
yes. BatchScan seems to be the best option so far. I am totally fine.
336bec1 to
b37e1f8
Compare
|
Thanks for reviewing, @stevenzwu @rdblue! |
This PR adds a new scan type called
BatchScanthat is supposed to gradually replaceTableScan.The primary motivation for adding the new interface:
FileScanTask.CombinedScanTaskandDataTask(currently extendsFileScanTask) in the future.There is an ongoing effort to add the
position_deletesmetadata table, which requires the scan to produce a new task type that must be treated by readers in a special way. If we makeDeleteFileScanTaskextendFileScanTask, existing readers may break as they would treatDeleteFileScanTaskas a regularFileScanTask. It is pretty much the same situation we have today withDataTask, where readers do an explicit check if a task isDataTaskand then handle it in a special way.One downside of the new API is that we return a generic
ScanTaskso readers will have to check if they support a particular subtype. I think we can add new methods either toScanTaskGrouporBatchScanto make that validation easy.