Skip to content

API: Add BatchScan to Table#5922

Merged
aokolnychyi merged 1 commit intoapache:masterfrom
aokolnychyi:batch-table-scan
Oct 14, 2022
Merged

API: Add BatchScan to Table#5922
aokolnychyi merged 1 commit intoapache:masterfrom
aokolnychyi:batch-table-scan

Conversation

@aokolnychyi
Copy link
Contributor

This PR adds a new scan type called BatchScan that is supposed to gradually replace TableScan.

The primary motivation for adding the new interface:

  • An ability to produce tasks that are don't inherit FileScanTask.
  • An ability to deprecate CombinedScanTask and DataTask (currently extends FileScanTask) in the future.

There is an ongoing effort to add the position_deletes metadata table, which requires the scan to produce a new task type that must be treated by readers in a special way. If we make DeleteFileScanTask extend FileScanTask, existing readers may break as they would treat DeleteFileScanTask as a regular FileScanTask. It is pretty much the same situation we have today with DataTask, where readers do an explicit check if a task is DataTask and then handle it in a special way.

One downside of the new API is that we return a generic ScanTask so readers will have to check if they support a particular subtype. I think we can add new methods either to ScanTaskGroup or BatchScan to make that validation easy.

}

@Test
public void testDataTableScan() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add more tests if we decide to go with this change.

* @param columns column names
* @return a new scan based on this with the given projection columns
*/
default ThisT select(String... columns) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from TableScan to reuse in BatchScan.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks like the right direction to me. I'm glad it was not too much code to start being able to return other task types!

*
* @return a batch scan for this table
*/
default BatchScan newBatchScan() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if batch is the most accurate name. batch jobs can also do incremental scan. To me, streaming/batch is more about execution mode.

This scan is contrasting to the incremental scans below. Incremental scan is meant for the difference between two snapshots (start, end). This scan is more like a snapshot scan. It scans the table using a specific snapshot (as point of view to the table). Just to bootstrap some brainstorming. what about newSnapshotScan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also used by aggregate metadata tables like (all_manifests), which scan across all snapshots. That's why newSnapshotScan may be misleading.

I am coming from the Spark background that uses batch in its scan API. However, I am open to other names. Is there a good word with opposite meaning to incremental?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could call it V2TableScan but I am not a big fan of using versions in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts, @stevenzwu @rdblue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, V2TableScan is not very meaningful. I see why SnapshotScan is not appropriate for metadata table scan. I would be fine with BatchScan if we can't find better alternative.

This is a scan on the table or metadata table's state at some point of time. what about PointScan? or just BaseScan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BatchScan makes sense, although I do see the point that "batch" isn't really the opposite of "incremental". TableScan works the best, but that's already taken. I don't really care for SnapshotScan because that's too specific in a way that we don't really need to state (isn't it almost always a snapshot that gets scanned?).

I'm not sure what a better name would be, so I'd go with BatchScan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually call something BaseXXX if it is a default package-private implementation of an interface. I am also not sure how descriptive PointScan would be.

I spent the morning trying to come up with alternative names but I had a hard time. I'd say Scan or TableScan would work best but they are both taken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu, would you be OK with keeping it as BatchScan for now to unblock this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. BatchScan seems to be the best option so far. I am totally fine.

@aokolnychyi aokolnychyi merged commit c2f77aa into apache:master Oct 14, 2022
@aokolnychyi
Copy link
Contributor Author

Thanks for reviewing, @stevenzwu @rdblue!

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants