Skip to content

Data: Add TCK tests for ReadBuilder in BaseFormatModelTests#15633

Open
Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Guosmilesmile:tck_read_builder
Open

Data: Add TCK tests for ReadBuilder in BaseFormatModelTests#15633
Guosmilesmile wants to merge 2 commits intoapache:mainfrom
Guosmilesmile:tck_read_builder

Conversation

@Guosmilesmile
Copy link
Contributor

@Guosmilesmile Guosmilesmile commented Mar 14, 2026

This PR adds TCK test coverage for the ReadBuilder API in BaseFormatModelTests. The new tests verify the contract of each ReadBuilder configuration method across all supported file formats (Avro, Parquet, ORC) and data generators.

Part of #15415

@github-actions github-actions bot added the data label Mar 14, 2026
throws IOException {

// Orc does not support batch reading.
assumeThat(fileFormat != FileFormat.ORC).isTrue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ReadBuilder<D, S> reuseContainers() {
this.reuseContainers = true;
return this;
}

I found in ORC internal don't have reuseContainers and only set an use no way. So I skip in this place.

throws IOException {

// Avro does not support batch reading.
assumeThat(fileFormat != FileFormat.AVRO).isTrue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ReadBuilder<D, S> recordsPerBatch(int numRowsPerBatch) {
throw new UnsupportedOperationException("Batch reading is not supported in Avro reader");
}

AVRO don't support recordsPerBatch


// Avro does not support filter push down; caseSensitive has no effect on it.
// Skip this test for Avro to avoid false failures.
assumeThat(fileFormat != FileFormat.AVRO).isTrue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ReadBuilder<D, S> caseSensitive(boolean caseSensitive) {
// Filtering is not supported in Avro reader, so case sensitivity does not matter
// This is not an error since filtering is best-effort.
return this;
}

AVRO don't support caseSensitive


// Avro does not support filter push down
// Skip this test for Avro to avoid false failures.
assumeThat(fileFormat != FileFormat.AVRO).isTrue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public ReadBuilder<D, S> filter(Expression filter) {
// Filtering is not supported in Avro reader
// This is not an error since filtering is best-effort.
return this;
}

AVRO don't support filter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would worth a letter to the dev list about how to handle the missing features. We should collect as many examples as we can, and ask the community what to do about it.

I don't like hiding them here.

  1. Maybe the FileFormat enum can contain some methods like supportsFilter, supportsCaseSensitive?
  2. Maybe we can have a matrix in the beginning of the test case like MISSING_FEATURES = ImmutableMap.of(FileFormat.AVRO, String[] {"filter", "caseSensitive"});
  3. Maybe a separate test class for every format instead of storing this directly in the FileFormat enum?

I prefer 1 or 2 but open for other suggestions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose option 2. For now, this part is only used by the test cases, so I’m opting to close the loop in tests first.

Schema fullSchema = dataGenerator.schema();

List<Types.NestedField> columns = fullSchema.columns();
Schema projectedSchema = new Schema(columns.get(columns.size() - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen cases where the issue happened when the projection was in the beginning or in the middle of the column list.

Please create a projected schema where we skip some columns in the beginning and in the middle as well, like:

  • Original: (a, b, c, d, e)
  • Projected: (b, d)

This could be random with the same seed or whatever deterministic technic you would like to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chang for the project part, select the odd-numbered fields.

List<Types.NestedField> columns = fullSchema.columns();
Schema projectedSchema = new Schema(columns.get(columns.size() - 1));

List<Record> genericRecords = dataGenerator.generateRecords();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to run these tests agains all of the generators?
I expect that we will have a default schema for most of the tests, and we only want to run this on a single reasonably wide, and reasonably narrow schema.


// Construct a filter condition that is smaller than the minimum value to achieve file-level
// filtering.
Types.NestedField firstField = schema.columns().get(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This heavily builds on the schema of the records generated by the generator.
Maybe run the tests for all of the FileFormats, but only with a single generator.

So the tests should be like:

  @ParameterizedTest
  @FieldSource("FORMATS")
  void testReaderBuilderFilter(FileFormat fileFormat) {
    DataGenerator dataGenerator = new StructOfPrimitive(); // or DataGenerator.structOfPrimitive()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a default schema, and update the ReadBuilder-related test cases to test the default schema.

/**
* Write with Generic Record, then read using split to verify that the split range is respected.
* Reading with a zero-length split at the end of the file should return no records, while reading
* with the full file range should return all records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a DataFile instead of the InputFile only then we have splitOffsets, and based on that we can verify if we have enough data to have multiple splits, and if so, we can check if reading the splits produces more than 0 and fewer then all records, and also reading all of the splits produces the expected records

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to trigger multiple groups within a single DataFile varies across different file formats — such as row groups in Parquet, stripes in ORC, etc.

Would we trigger this by writing 1000 or more records? Though explicitly setting this size feels a bit awkward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is the best method for this.
We definitely need to check if there are multiple splits in the file before running a test.
Maybe we can start the generation check the number of splits and use a constant row number which is high enough to generate multiple splits? The check could state clearly, that not enough splits, increase the row number until we have enough splits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried writing continuously, but before a new split was generated, the unit test failed due to running out of memory. I’m keeping the original approach here and haven’t made any changes for now.

* Verifies the contract of recordsPerBatch: recordsPerBatch is a hint for vectorized readers. The
* total number of records returned must be unaffected regardless of the batch size value.
*/
void testReaderBuilderRecordsPerBatch(FileFormat fileFormat, DataGenerator dataGenerator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we testing non-vectorized readers with recordsPerBatch?
This seems strange. We might want to restrict this to vectorized models, an verify with a batch size bigger than 1.

Maybe we should throw an exception when the recordsPerBatch is set and the reader is non-vectorized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right—I should adjust this. Non-vectorized readers should throw an exception when recordsPerBatch is set. I’ll update it accordingly. Thanks for pointing it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix and the test for it could go in another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now Avro throws an exception. I think I should adjust my assertions to verify that it throws the exception as expected, instead of skipping the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that, or fix the other readers and add the test and the fix in another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants