Spark: Adopt the new Scan Task APIs in Spark Readers by flyrain · Pull Request #5248 · apache/iceberg

flyrain · 2022-07-11T17:11:14Z

This PR mainly adopts the change in #5077. I'm also experimenting the Change log reader to demonstrate how the new reader hierarchy works.
cc @aokolnychyi @rdblue @RussellSpitzer @stevenzwu @szehon-ho @karuppayya

flyrain · 2022-07-11T17:12:42Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java

+import org.apache.spark.rdd.InputFileBlockHolder;
+import org.apache.spark.sql.catalyst.InternalRow;
+
+public class ChangelogRowReader {


This class is only for demonstrating, I will remove it before merging, and will file a separated PR for change log reader.

flyrain · 2022-07-11T17:21:05Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

+          Stream<ContentFile> stream = Stream.of(contentScanTask.file());
+          if (contentScanTask.isFileScanTask()) {
+            stream = Stream.concat(stream, contentScanTask.asFileScanTask().deletes().stream());
+          } else if (contentScanTask instanceof AddedRowsScanTask) {
+            stream = Stream.concat(stream, ((AddedRowsScanTask) contentScanTask).deletes().stream());
+          } else if (contentScanTask instanceof DeletedDataFileScanTask) {
+            stream = Stream.concat(stream, ((DeletedDataFileScanTask) contentScanTask).existingDeletes().stream());
+          } else if (contentScanTask instanceof DeletedRowsScanTask) {
+            stream = Stream.concat(stream, ((DeletedRowsScanTask) contentScanTask).addedDeletes().stream());
+            stream = Stream.concat(stream, ((DeletedRowsScanTask) contentScanTask).existingDeletes().stream());
+          }


I'm not happy with this down casting. To eliminate it, we can have a method returning all related content files in class ContentScanTask. For example,

The method returns data file as well as all delete files in FileScanTask.

The method returns data file, addedDeletes files, existingDeletes files and in DeletedRowsScanTask

The method name could be allContentFiles(), or relatedContentFiles(), etc

Yeah, this is not good. I'd be up exposing something like referencedDataFiles or dataFiles with referencedDeleteFiles or deleteFiles to avoid ? in public API.

Maybe, we can even even add those methods to ScanTask?

I'm OK with these solutions. Adding them to ScanTask also makes sense.
For inputFile map in the BaseDataReader, we don't actually need to know if it is a DataFile or DeleteFile. But to avoid ? in the public APIs, we still need two methods, one for DataFile and another for DeleteFile. Just wondering if we can remove the type parameter in interface ContentFile<F> so that we can use ContentFile in this use case.
Here is another use case to unify DataFile and DeleteFile, #4142 (review).

aokolnychyi · 2022-07-19T15:43:30Z

Let me take a look now.

szehon-ho

Took a preliminary look and made some comments

core/src/main/java/org/apache/iceberg/util/PartitionUtil.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

szehon-ho · 2022-07-19T21:03:16Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

+          return stream;
+        });
+
+    dataFileStream


A little outside scope of change but now we are making a variable, how about,

Map<String, ByteBuffer> keyMetadata = dataFileStream.collect.toMap(file -> file.key, file -> file.value)

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/RowDataRewriter.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

core/src/main/java/org/apache/iceberg/util/PartitionUtil.java

aokolnychyi · 2022-07-20T00:33:09Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

 * @param <T> is the Java class returned by this reader whose objects contain one or more rows.
 */
-abstract class BaseDataReader<T> implements Closeable {
+abstract class BaseDataReader<T, CST extends ContentScanTask<?>, G extends ScanTaskGroup<CST>>


To be honest, I am not convinced we have to restrict this to ContentScanTask. Also, keep in mind that different changelog tasks can be packed into the same task group. That's why the reader should not be restricted to strictly one task type.

I'd consider using ScanTask and changing the hierarchy a bit. This can can become BaseReader. I also don't think we need the second type parameter. We can just work with ScanTaskGroup<ST>.

abstract class BaseReader<T, ST extends ScanTask> implements Closeable { ... BaseReader(Table table, ScanTaskGroup<ST> taskGroup) { this.table = table; this.tasks = taskGroup.tasks().iterator(); this.inputFiles = inputFiles(taskGroup); this.currentIterator = CloseableIterator.empty(); } ... }

Then I'd consider adding BaseRowReader like this (also no data in the name).

abstract class BaseRowReader<ST extends ScanTask> extends BaseReader<InternalRow, ST> { private final Schema tableSchema; private final Schema expectedSchema; private final String nameMapping; private final boolean caseSensitive; BaseRowReader(Table table, ScanTaskGroup<ST> taskGroup, Schema expectedSchema, boolean caseSensitive) { super(table, taskGroup); this.tableSchema = table.schema(); this.expectedSchema = expectedSchema; this.nameMapping = table.properties().get(TableProperties.DEFAULT_NAME_MAPPING); this.caseSensitive = caseSensitive; } protected Schema tableSchema() { return tableSchema; } protected Schema expectedSchema() { return expectedSchema; } protected CloseableIterable<InternalRow> newIterable(InputFile file, FileFormat format, long start, long length, Expression residual, Schema projection, Map<Integer, ?> idToConstant) { switch (format) { case PARQUET: return newParquetIterable(file, start, length, residual, projection, idToConstant); case AVRO: return newAvroIterable(file, start, length, projection, idToConstant); case ORC: return newOrcIterable(file, start, length, residual, projection, idToConstant); default: throw new UnsupportedOperationException("Cannot read unknown format: " + format); } } ... }

Then the existing RowDataReader won't need to change a lot.

class RowDataReader extends BaseRowReader<FileScanTask> { ... }

Finally, we will have ChangelogRowReader capable of reading all types of changelog tasks like this.

class ChangelogRowReader extends BaseRowReader<ChangelogScanTask> { ... }

I agree that the second type parameter isn't needed. Thanks for the suggestion.

If we want to use ScanTask in the BaseReader/BaseDataReader, we need to either

move some common logic to its subclasses, e.g. getInputFile(), constantsMap() , which need the method from ContentScanTask.

add new methods in ScanTask. e.g. file()

aokolnychyi · 2022-07-20T00:37:21Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

+          Stream<ContentFile> stream = Stream.of(contentScanTask.file());
+          if (contentScanTask.isFileScanTask()) {
+            stream = Stream.concat(stream, contentScanTask.asFileScanTask().deletes().stream());
+          } else if (contentScanTask instanceof AddedRowsScanTask) {
+            stream = Stream.concat(stream, ((AddedRowsScanTask) contentScanTask).deletes().stream());
+          } else if (contentScanTask instanceof DeletedDataFileScanTask) {
+            stream = Stream.concat(stream, ((DeletedDataFileScanTask) contentScanTask).existingDeletes().stream());
+          } else if (contentScanTask instanceof DeletedRowsScanTask) {
+            stream = Stream.concat(stream, ((DeletedRowsScanTask) contentScanTask).addedDeletes().stream());
+            stream = Stream.concat(stream, ((DeletedRowsScanTask) contentScanTask).existingDeletes().stream());
+          }


Yeah, this is not good. I'd be up exposing something like referencedDataFiles or dataFiles with referencedDeleteFiles or deleteFiles to avoid ? in public API.

Maybe, we can even even add those methods to ScanTask?

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseRowReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkDeleteFilter.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

aokolnychyi

This looks almost ready to go. I had a question about constructing an empty delete filter for batch reads instead of passing null and a few nits.

aokolnychyi · 2022-07-26T00:27:57Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

-    protected StructLike asStructLike(InternalRow row) {
-      return asStructLike.wrap(row);
-    }
+    SparkDeleteFilter deleteFilter = new SparkDeleteFilter(filePath, task.deletes());


I think this changed the previous behavior where we would not construct a delete filter if the list of deletes if empty. Shall we still pass null if deletes are empty just to avoid surprises? I am not sure there will be any performance degradation but it seems safer to keep the old behavior.

Agreed that it is safer to not change the behavior.

aokolnychyi · 2022-07-26T00:29:29Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

+    this.batchSize = batchSize;
+  }
+
+  protected CloseableIterable<ColumnarBatch> newBatchIterable(InputFile location, FileFormat format,


nit: I know it was like this in the old implementation but what about renaming location -> file?

aokolnychyi · 2022-07-26T00:43:30Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

+    }
+  }
+
+  private CloseableIterable<ColumnarBatch> newParquetIterable(InputFile location, long start, long length,


nit: Same here

aokolnychyi · 2022-07-26T00:55:26Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

  }

-  abstract CloseableIterator<T> open(FileScanTask task);
+  abstract CloseableIterator<T> open(TaskT task);


nit: In other classes (not very consistent), we usually put abstract methods immediately after the constructor so that it is obvious what children must implement. Since we are changing this line anyway and also adding one more abstract method, what about putting those two immediately after the constructor and making both either protected or package-private for consistency?

aokolnychyi · 2022-07-26T00:58:05Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

+      Stream<EncryptedInputFile> encryptedFiles = taskGroup.tasks().stream()
+          .flatMap(this::referencedFiles)
+          .map(file ->
+              EncryptedFiles.encryptedInput(table.io().newInputFile(file.path().toString()), file.keyMetadata()));


optional: You could define an helper method for constructing encrypted input files and fit this on one line.

Stream<EncryptedInputFile> encryptedFiles = taskGroup.tasks().stream() .flatMap(this::referencedFiles) .map(this::toEncryptedInputFile);

private EncryptedInputFile toEncryptedInputFile(ContentFile<?> file) { InputFile inputFile = table.io().newInputFile(file.path().toString()); return EncryptedFiles.encryptedInput(inputFile, file.keyMetadata()); }

aokolnychyi · 2022-07-26T01:14:37Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

    if (readSchema.findField(MetadataColumns.PARTITION_COLUMN_ID) != null) {
-      StructType partitionType = Partitioning.partitionType(table);
-      return PartitionUtil.constantsMap(task, partitionType, BaseDataReader::convertConstant);
+      Types.StructType partitionType = Partitioning.partitionType(table);


Did we add Types. on purpose?

No. Let me remove it.

aokolnychyi · 2022-07-26T16:40:36Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseReader.java

+  protected class SparkDeleteFilter extends DeleteFilter<InternalRow> {
+    private final InternalRowWrapper asStructLike;
+
+    SparkDeleteFilter(String filePath, List<DeleteFile> deletes, Schema requestedSchema) {


optional: You may remove this constructor by using expectedSchema() in EqualityDeleteRowReader that is available now via the base reader. It is a bit confusing that we compute the table schema using an instance var of the outer class yet accept a variable for expected schema even though we have access to the expected schema in this class.

Make sense. I was trying not to touch EqualityDeleteRowReader. Made the change.

aokolnychyi · 2022-07-26T16:42:08Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

+        .build();
+  }
+
+  private CloseableIterable<ColumnarBatch> newOrcIterable(InputFile location, long start, long length,


nit: Same here

aokolnychyi · 2022-07-26T16:49:06Z

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java

    StructInternalRow row = new StructInternalRow(readSchema.asStruct());
-    CloseableIterable<InternalRow> asSparkRows = CloseableIterable.transform(
-        task.asDataTask().rows(), row::setStruct);
+    CloseableIterable<InternalRow> asSparkRows = CloseableIterable.transform(task.asDataTask().rows(), row::setStruct);


nit: What about inlining to remove the useless temp var?

return CloseableIterable.transform(task.asDataTask().rows(), row::setStruct);

aokolnychyi · 2022-07-26T16:59:43Z

spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/source/TestBaseReader.java

 import static org.apache.iceberg.Files.localOutput;

-public class TestSparkBaseDataReader {
+public class TestBaseReader {


Is the rename required?

I'm not sure, but I prefer this renaming since we changed the class name from BaseDataReader to BaseReader.

aokolnychyi

LGTM pending tests.

aokolnychyi · 2022-07-26T19:59:53Z

Thanks for the change, @flyrain! Thanks for reviewing, @szehon-ho!

…che#5248)

flyrain added 4 commits July 7, 2022 01:12

Initial commit, to refactor the reader

c6cf7ff

Make RowDataReader and BatchDataReader more generic

dfbc4cb

Add changelog readers

a0bfbaa

Remove the type parameter DataFile

114f5c9

github-actions bot added core spark labels Jul 11, 2022

flyrain commented Jul 11, 2022

View reviewed changes

szehon-ho reviewed Jul 19, 2022

View reviewed changes

aokolnychyi reviewed Jul 20, 2022

View reviewed changes

flyrain added 2 commits July 20, 2022 11:19

Resolve comments.

f79f399

Refactor

696dfe2

github-actions bot added the API label Jul 21, 2022

flyrain added 4 commits July 21, 2022 13:10

Refactor for Batch reader

260e2eb

Remove the casting since it is not needed any more

71404e6

Refactor

320acf7

Refactor

2e2060a

flyrain commented Jul 21, 2022

View reviewed changes

spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/ChangelogRowReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Jul 22, 2022

View reviewed changes

flyrain added 3 commits July 22, 2022 15:25

Resolve comments.

80d49b4

Resolve comments.

92fb6b4

Resolve comments.

18013fe

github-actions bot removed the API label Jul 22, 2022

aokolnychyi reviewed Jul 26, 2022

View reviewed changes

Resolve comments.

266ae17

aokolnychyi approved these changes Jul 26, 2022

View reviewed changes

aokolnychyi merged commit d7b1a87 into apache:master Jul 26, 2022

flyrain mentioned this pull request Jul 27, 2022

Spark 3.2: Support different task types in readers #5363

Merged

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

[Cherry-Pick] Spark 3.3: Support different task types in readers (apa…

19bf1fc

…che#5248)

Conversation

flyrain commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Jul 19, 2022

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

flyrain commented Jul 11, 2022 •

edited

Loading

flyrain Jul 11, 2022 •

edited

Loading

aokolnychyi Jul 26, 2022 •

edited

Loading