Spark 3.3: Choose readers based on task types #6345

aokolnychyi · 2022-12-02T00:28:16Z

This PR adds SparkPartitionReaderFactory that creates readers based on tasks in input partitions.

aokolnychyi · 2022-12-02T00:40:45Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

 import org.slf4j.LoggerFactory;

-class BatchDataReader extends BaseBatchReader<FileScanTask> {
+class BatchDataReader extends BaseBatchReader<FileScanTask>


This class was only used as PartitionReader in SparkScan, where we extended it, mixed PartitionReader and called the implementation as BatchReader. After adding a common reader factory, we may have multiple batch readers now. That's why BatchDataReader seemed like a more accurate name than BatchReader. As there were no other places that used this class, I decided to implement PartitionReader directly here.

Any feedback is welcome. See SparkScan below for old usage.

aokolnychyi · 2022-12-02T00:42:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

+  }
+
  BatchDataReader(
-      ScanTaskGroup<FileScanTask> task,


Most other readers have another order of parameters: table, taskGroup, expectedSchema, caseSensitive.

aokolnychyi · 2022-12-02T00:44:13Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkChangelogScan.java

  public Batch toBatch() {
-    return new SparkChangelogBatch(
-        spark, table, readConf, taskGroups(), expectedSchema, hashCode());
+    return new SparkBatch(sparkContext, table, readConf, taskGroups(), expectedSchema, hashCode());


If you take a look at SparkChangelogBatch, it was identical to SparkBatch, except the reader factory. Since we are adding a common factory for all tasks, it seemed appropriate to always leverage one class.

aokolnychyi · 2022-12-02T00:44:57Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitionReaderFactory.java

+
+    SparkInputPartition partition = (SparkInputPartition) inputPartition;
+
+    if (partition.allTasksOfType(FileScanTask.class)) {


@szehon-ho, this is where you would check the type of tasks and select your reader.

aokolnychyi · 2022-12-02T00:46:57Z

cc @flyrain @szehon-ho @RussellSpitzer @rdblue

szehon-ho · 2022-12-02T01:21:45Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkInputPartition.java

    return (ScanTaskGroup<T>) taskGroup;
  }

+  public <T extends ScanTask> boolean allTasksOfType(Class<T> javaClass) {


Curious question here, do we assume a single task type for every taskGroup or not?

Just wondering if we can optimize it to just check the type of taskGroup.

The idea is to be able to check whether all tasks conform to a particular known parent type. For instance, for changelog tasks we only check if all are tasks implement ChangelogScanTask. Then the reader itself may downcast it to a particular child type (e.g. ChangelogRowReader).

We can't check the type of taskGroup due to Java type erasure.

Do we allow a taskgroup to have tasks of different types? I didn't think so. But I don't think it costs us that much to just check all of them instead of just the first element.

Makes sense. Yea my question was Russell's question, if there's one to one relation always (defined by TaskGroup). But I guess there's nothing in TaskGroup preventing different tasks that are subclass of T, although that would be strange.

That's true, ScanTaskGroup may have arbitrary tasks.

RussellSpitzer · 2022-12-02T19:02:41Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitionReaderFactory.java

+
+  @Override
+  public boolean supportColumnarReads(InputPartition inputPartition) {
+    return batchSize > 1;


Not sure how this relates to vectorized reads?

I see this was in the old code, I guess we keep it

Correct, I just copied what we had before. Our SparkBatch decides whether vectorized reads are supported and passes a batch size of > 1 if supported, 0 otherwise.

How about adding methods .vectorized(boolean).batchSize(int)? It would be more code, but cleaner logically.

RussellSpitzer · 2022-12-02T19:13:53Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkPartitionReaderFactory.java

+class SparkPartitionReaderFactory implements PartitionReaderFactory {
+  private final int batchSize;
+
+  SparkPartitionReaderFactory(int batchSize) {


Maybe we should just doc here that a batch size > 1 will create a ColumnarReader , just seems like a bit of a magic parameter here

What about creating separate batch and non-batch reader factories? I just copied the existing code but having separate classes seems more natural than checking if batchSize is > 1.

Thoughts, @RussellSpitzer @szehon-ho?

RussellSpitzer

This all makes sense to me. @szehon-ho is closer to the implementation here though so I'll defer to him on if this makes sense for his delete file reader implementation

szehon-ho

Yep, I'll still have to see how my new table will fit (a bit behind), but refactor looks good to me, and we can revisit if needed

flyrain

Looks pretty good. Thanks @aokolnychyi!

aokolnychyi · 2022-12-02T23:51:35Z

I've split the factory into two. I feel that eliminates the confusion pretty well.

aokolnychyi · 2022-12-02T23:54:32Z

Thanks for reviewing, @szehon-ho @RussellSpitzer @flyrain!

Spark 3.3: Choose readers based on task types

85d1f80

github-actions bot added the spark label Dec 2, 2022

aokolnychyi commented Dec 2, 2022

View reviewed changes

szehon-ho reviewed Dec 2, 2022

View reviewed changes

RussellSpitzer reviewed Dec 2, 2022

View reviewed changes

RussellSpitzer approved these changes Dec 2, 2022

View reviewed changes

szehon-ho approved these changes Dec 2, 2022

View reviewed changes

flyrain approved these changes Dec 2, 2022

View reviewed changes

Review

4857c97

aokolnychyi merged commit 7fd9ded into apache:master Dec 2, 2022

sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 10, 2023

Spark 3.3: Choose readers based on task types (apache#6345)

ddfb010


		SparkInputPartition partition = (SparkInputPartition) inputPartition;

		if (partition.allTasksOfType(FileScanTask.class)) {

Spark 3.3: Choose readers based on task types #6345

Spark 3.3: Choose readers based on task types #6345

Uh oh!

Conversation

aokolnychyi commented Dec 2, 2022

Uh oh!

aokolnychyi Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Dec 2, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

flyrain left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Dec 2, 2022

Uh oh!

aokolnychyi commented Dec 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi Dec 2, 2022 •

edited

Loading

aokolnychyi Dec 2, 2022 •

edited

Loading

aokolnychyi Dec 2, 2022 •

edited

Loading

szehon-ho Dec 2, 2022 •

edited

Loading

flyrain Dec 2, 2022 •

edited

Loading