Data: Add TCK tests for ReadBuilder in BaseFormatModelTests by Guosmilesmile · Pull Request #15633 · apache/iceberg

Guosmilesmile · 2026-03-14T06:04:15Z

This PR adds TCK test coverage for the ReadBuilder API in BaseFormatModelTests. The new tests verify the contract of each ReadBuilder configuration method across all supported file formats (Avro, Parquet, ORC) and data generators.

Part of #15415

Guosmilesmile · 2026-03-14T06:06:24Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+      throws IOException {
+
+    // Orc does not support batch reading.
+    assumeThat(fileFormat != FileFormat.ORC).isTrue();


iceberg/orc/src/main/java/org/apache/iceberg/orc/ORCFormatModel.java

Lines 265 to 268 in a50c2d9

public ReadBuilder<D, S> reuseContainers() {

this.reuseContainers = true;

return this;

}

I found in ORC internal don't have reuseContainers and only set an use no way. So I skip in this place.

Guosmilesmile · 2026-03-14T06:08:15Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+      throws IOException {
+
+    // Avro does not support batch reading.
+    assumeThat(fileFormat != FileFormat.AVRO).isTrue();


iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java

Lines 251 to 253 in a50c2d9

public ReadBuilder<D, S> recordsPerBatch(int numRowsPerBatch) {

throw new UnsupportedOperationException("Batch reading is not supported in Avro reader");

}

AVRO don't support recordsPerBatch

Guosmilesmile · 2026-03-14T06:09:09Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+    // Avro does not support filter push down; caseSensitive has no effect on it.
+    // Skip this test for Avro to avoid false failures.
+    assumeThat(fileFormat != FileFormat.AVRO).isTrue();


iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java

Lines 225 to 229 in a50c2d9

public ReadBuilder<D, S> caseSensitive(boolean caseSensitive) {

// Filtering is not supported in Avro reader, so case sensitivity does not matter

// This is not an error since filtering is best-effort.

return this;

}

AVRO don't support caseSensitive

Guosmilesmile · 2026-03-14T06:09:26Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+    // Avro does not support filter push down
+    // Skip this test for Avro to avoid false failures.
+    assumeThat(fileFormat != FileFormat.AVRO).isTrue();


iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java

Lines 232 to 236 in a50c2d9

public ReadBuilder<D, S> filter(Expression filter) {

// Filtering is not supported in Avro reader

// This is not an error since filtering is best-effort.

return this;

}

AVRO don't support filter

I think this would worth a letter to the dev list about how to handle the missing features. We should collect as many examples as we can, and ask the community what to do about it.

I don't like hiding them here.

Maybe the FileFormat enum can contain some methods like supportsFilter, supportsCaseSensitive?

Maybe we can have a matrix in the beginning of the test case like MISSING_FEATURES = ImmutableMap.of(FileFormat.AVRO, String[] {"filter", "caseSensitive"});

Maybe a separate test class for every format instead of storing this directly in the FileFormat enum?

I prefer 1 or 2 but open for other suggestions

I chose option 2. For now, this part is only used by the test cases, so I’m opting to close the loop in tests first.

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

pvary · 2026-03-16T09:07:23Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    Schema fullSchema = dataGenerator.schema();
+
+    List<Types.NestedField> columns = fullSchema.columns();
+    Schema projectedSchema = new Schema(columns.get(columns.size() - 1));


I have seen cases where the issue happened when the projection was in the beginning or in the middle of the column list.

Please create a projected schema where we skip some columns in the beginning and in the middle as well, like:

Original: (a, b, c, d, e)

Projected: (b, d)

This could be random with the same seed or whatever deterministic technic you would like to use.

I chang for the project part, select the odd-numbered fields.

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

pvary · 2026-03-16T09:20:57Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+    List<Types.NestedField> columns = fullSchema.columns();
+    Schema projectedSchema = new Schema(columns.get(columns.size() - 1));
+
+    List<Record> genericRecords = dataGenerator.generateRecords();


Do we want to run these tests agains all of the generators?
I expect that we will have a default schema for most of the tests, and we only want to run this on a single reasonably wide, and reasonably narrow schema.

pvary · 2026-03-16T09:23:37Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+
+    // Construct a filter condition that is smaller than the minimum value to achieve file-level
+    // filtering.
+    Types.NestedField firstField = schema.columns().get(0);


This heavily builds on the schema of the records generated by the generator.
Maybe run the tests for all of the FileFormats, but only with a single generator.

So the tests should be like:

@ParameterizedTest @FieldSource("FORMATS") void testReaderBuilderFilter(FileFormat fileFormat) { DataGenerator dataGenerator = new StructOfPrimitive(); // or DataGenerator.structOfPrimitive()

Add a default schema, and update the ReadBuilder-related test cases to test the default schema.

pvary · 2026-03-16T09:28:59Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+  /**
+   * Write with Generic Record, then read using split to verify that the split range is respected.
+   * Reading with a zero-length split at the end of the file should return no records, while reading
+   * with the full file range should return all records.


If we have a DataFile instead of the InputFile only then we have splitOffsets, and based on that we can verify if we have enough data to have multiple splits, and if so, we can check if reading the splits produces more than 0 and fewer then all records, and also reading all of the splits produces the expected records

The way to trigger multiple groups within a single DataFile varies across different file formats — such as row groups in Parquet, stripes in ORC, etc.

Would we trigger this by writing 1000 or more records? Though explicitly setting this size feels a bit awkward.

I'm not sure what is the best method for this.
We definitely need to check if there are multiple splits in the file before running a test.
Maybe we can start the generation check the number of splits and use a constant row number which is high enough to generate multiple splits? The check could state clearly, that not enough splits, increase the row number until we have enough splits?

I tried writing continuously, but before a new split was generated, the unit test failed due to running out of memory. I’m keeping the original approach here and haven’t made any changes for now.

pvary · 2026-03-16T09:32:40Z

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java

+   * Verifies the contract of recordsPerBatch: recordsPerBatch is a hint for vectorized readers. The
+   * total number of records returned must be unaffected regardless of the batch size value.
+   */
+  void testReaderBuilderRecordsPerBatch(FileFormat fileFormat, DataGenerator dataGenerator)


Are we testing non-vectorized readers with recordsPerBatch?
This seems strange. We might want to restrict this to vectorized models, an verify with a batch size bigger than 1.

Maybe we should throw an exception when the recordsPerBatch is set and the reader is non-vectorized.

You’re right—I should adjust this. Non-vectorized readers should throw an exception when recordsPerBatch is set. I’ll update it accordingly. Thanks for pointing it out.

Fix and the test for it could go in another PR

Right now Avro throws an exception. I think I should adjust my assertions to verify that it throws the exception as expected, instead of skipping the test.

Either that, or fix the other readers and add the test and the fix in another PR

github-actions bot added the data label Mar 14, 2026

Guosmilesmile commented Mar 14, 2026

View reviewed changes

Data: Add TCK tests for ReadBuilder in BaseFormatModelTests

41836f9

Guosmilesmile force-pushed the tck_read_builder branch from 472bdce to 41836f9 Compare March 14, 2026 06:16

pvary reviewed Mar 16, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated Show resolved Hide resolved

pvary reviewed Mar 16, 2026

View reviewed changes

data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated Show resolved Hide resolved

pvary reviewed Mar 16, 2026

View reviewed changes

Address Comment

9634c18

	public ReadBuilder<D, S> reuseContainers() {
	this.reuseContainers = true;
	return this;
	}

	public ReadBuilder<D, S> recordsPerBatch(int numRowsPerBatch) {
	throw new UnsupportedOperationException("Batch reading is not supported in Avro reader");
	}

	public ReadBuilder<D, S> caseSensitive(boolean caseSensitive) {
	// Filtering is not supported in Avro reader, so case sensitivity does not matter
	// This is not an error since filtering is best-effort.
	return this;
	}

	public ReadBuilder<D, S> filter(Expression filter) {
	// Filtering is not supported in Avro reader
	// This is not an error since filtering is best-effort.
	return this;
	}

Conversation

Guosmilesmile commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Guosmilesmile commented Mar 14, 2026 •

edited

Loading