Skip to content

Data: Add TCK coverage for reader default values#16638

Open
joyhaldar wants to merge 9 commits into
apache:mainfrom
joyhaldar:tck-default-values
Open

Data: Add TCK coverage for reader default values#16638
joyhaldar wants to merge 9 commits into
apache:mainfrom
joyhaldar:tck-default-values

Conversation

@joyhaldar
Copy link
Copy Markdown
Contributor

@joyhaldar joyhaldar commented May 31, 2026

Adds the reader default-value tests from DataTestBase into the Base Format model TCK:

  • testDefaultValues
  • testNullDefaultValue
  • testNestedDefaultValue
  • testMapNestedDefaultValue
  • testListNestedDefaultValue
  • testMissingRequiredWithoutDefault.

Adds DataGenerators.PrimitiveDefaults and the following tests:

  • testPrimitiveDefaultValues
  • testPrimitiveDefaultValuesNotApplied

Adds a supportsTime() hook, defaults true, gates the TIME column, TestSparkFormatModel overrides it to false. FIXED excluded with a TODO.

Removes testReaderSchemaEvolutionNewColumnWithDefault.

@github-actions github-actions Bot added the data label May 31, 2026
Comment thread data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated
.map(generator -> Arguments.of(format, generator)))
.toList();

private static final List<Arguments> PRIMITIVE_TYPES_AND_DEFAULTS =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a DataGenerator for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix is (type, default-literal) pairs and the test asserts the reader injects exactly that literal, so the type has to stay paired with its expected value. DataGenerator is one schema() + random rows, so it can't carry that pairing, at least as I understand the current interface.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a generator with a schema with default values, and create 2 tests, one with nulls where we check for the default values, and one with random data (not nulls), and check for the generated data?

Copy link
Copy Markdown
Contributor Author

@joyhaldar joyhaldar Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added DataGenerators.PrimitiveDefaults and two tests, testPrimitiveDefaultValues writes only id so the defaulted columns are absent and the reader injects the defaults, testPrimitiveDefaultValuesNotApplied writes random values into all columns and checks the round trip.

I replaced the old testReaderSchemaEvolutionNewColumnWithDefault + the type and default matrix, which I removed.


@ParameterizedTest
@FieldSource("FILE_FORMATS")
void testDefaultValues(FileFormat fileFormat) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test ? Seem the same like testSchemaEvolutionAddColumn?

Copy link
Copy Markdown
Contributor Author

@joyhaldar joyhaldar Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testSchemaEvolutionAddColumn adds plain optional columns that read back as null. testDefaultValues instead checks that columns with defaults get the default injected, and that a written value isn't overwritten by its default, and there's some overlap there with the new testPrimitiveDefaultValues. @pvary @Guosmilesmile do you think testDefaultValues is worth keeping, or should I remove it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense ,we can keep it for now.

expectedNested.setField("missing_inner_float", -0.0F);
expected.setField("nested", expectedNested);
}
return expected;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.copy("value_str", val.getField("value_str"), "value_int", 34)));
expected.setField("nested_map", rebuilt);
}
return expected;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line. Please check all the code .Thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.collect(Collectors.toList());
expected.setField("nested_list", rebuilt);
}
return expected;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line .

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

List<Record> genericRecords = RandomGenericData.generate(writeSchema, 10, 1L);
writeGenericRecords(fileFormat, writeSchema, genericRecords);

Schema expectedSchema =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse writeSchema to create expectedSchema?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted idField and dataField and shared them between the write and expected schemas.

List<Record> genericRecords = RandomGenericData.generate(writeSchema, 10, 1L);
writeGenericRecords(fileFormat, writeSchema, genericRecords);

Schema expectedSchema =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same above.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted idField and dataField and shared them between the write and expected schemas.

Comment on lines +101 to +102
// TODO: include TIME once the engine readers support it.
// TODO: include FIXED once Spark supports it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 3 implementations:

  • Generic
  • Flink
  • Spark

Which one supports which of these types?
Shall we create specific test methods for these?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIME: works on Flink; fails on Spark, java.lang.UnsupportedOperationException: Unsupported logical type: TIME_MICROS.

FIXED: works on Flink; fails on Spark,java.lang.ClassCastException: class [B cannot be cast to class java.nio.ByteBuffer.

Since both work on Flink and only Spark fails, including them would mean running on Flink but skipping on Spark. I'm not sure of the best way to gate that per-engine here, open to suggestions.

Or we could keep both excluded and just update the TODO to note it's Spark specific, or I could look into the Spark side TIME and FIXED handling to see if it can be fixed, though I'm not familiar with those code paths yet and would need to dig in (which I don't mind).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume both works when we use Generic readers and writers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another PR we should add TestGenericFormatModel and exclude duplicated tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIME: works on Flink; fails on Spark, java.lang.UnsupportedOperationException: Unsupported logical type: TIME_MICROS.

Maybe #16665 could be interesting here. We plan to use supportsTime for this in the meantime which could be overridden by TestSparkFormatModel

FIXED: works on Flink; fails on Spark,java.lang.ClassCastException: class [B cannot be cast to class java.nio.ByteBuffer.

Is it a bug? Maybe a separate PR could be good where we discuss what is the expected behavior. In the meantime we can add a TODO.

This is also highlighted in #15795

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for TIME there's #16665 and #15795 (adds the supportsTime() hook). And FIXED doesn't have a fix yet.

Should we wait for those two to land and build on them, or ship this PR with TODOs for now (i am fine with waiting)? I could look into the FIXED issue in the meantime.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't wait.
Add the same supportsTime method as defined in #15795 and the faster one wins, the second one rebases 😄
Keep a TODO for the bytebuffer stuff. This is for default values, not the primitive type tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added supportsTime() matching #15795, gated TIME on it, Spark overrides false.

FIXED left out with a TODO.

Copy link
Copy Markdown
Contributor Author

@joyhaldar joyhaldar Jun 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking into the FIXED ClassCastException, the TCK's InternalRowConverter assumed a FIXED value is always a ByteBuffer and cast it directly. But a generic Record actually holds FIXED as a byte[] IIUC, so casting byte[] to ByteBuffer throws the exception. It never showed up before because the TCK had no FIXED column, so that branch never ran on a real FIXED value, adding one as part of this PR is what surfaced it.

The fix checks the value type, handling byte[], ByteBuffer, and GenericData.Fixed, all yielding byte[], which is what Spark's InternalRow expects IIUC, mirroring what TestHelpers does.

Should I keep this here and drop the TODO, or put it into a seperate PR and keep the TODO in this one?

@github-actions github-actions Bot added the spark label Jun 4, 2026
@joyhaldar joyhaldar marked this pull request as ready for review June 5, 2026 04:32

@ParameterizedTest
@FieldSource("FILE_FORMATS")
void testDefaultValues(FileFormat fileFormat) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense ,we can keep it for now.

Comment on lines +1106 to +1119
Schema readSchema =
supportsTime()
? DataGenerators.PrimitiveDefaults.READ_SCHEMA
: TypeUtil.selectNot(
DataGenerators.PrimitiveDefaults.READ_SCHEMA,
Set.of(
DataGenerators.PrimitiveDefaults.READ_SCHEMA
.findField("time_with_default")
.fieldId()));

List<Record> sourceRecords = RandomGenericData.generate(readSchema, 10, 1L);
writeGenericRecords(fileFormat, readSchema, sourceRecords);

readAndAssertEngineRecords(fileFormat, readSchema, sourceRecords, record -> record);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation in testPrimitiveDefaultValues seems to be quite similar. Can we extract the same parts from there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into primitiveDefaultsReadSchema(), shared by both tests.

.build());

assertThatThrownBy(
() -> readAndAssertGenericRecords(fileFormat, expectedSchema, genericRecords))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only checks the generic reader component. Do we need to add tests for the engine-related components as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, switched it to readAndAssertEngineRecords, throws the same IllegalArgumentException.

return schema;
}

return TypeUtil.selectNot(schema, Set.of(schema.findField("time_with_default").fieldId()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this without relying on the field name?
Collect the fieldIds where the type is not supported? See #15795 supportedSchema for inspiration.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @rambleraptor's solution which removes the need for supportsTime even more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now collecting the fieldIds by type instead of the name:

Set<Integer> unsupportedFieldIds =
    schema.columns().stream()
        .filter(field -> field.type().typeId() == Type.TypeID.TIME)
        .map(Types.NestedField::fieldId)
        .collect(Collectors.toSet());
return TypeUtil.selectNot(schema, unsupportedFieldIds);

About dropping supportsTime() entirely via @rambleraptor's filterUnsupported/excludeColumnsContaining, since #15795 adds those same methods to BaseFormatModelTests, I didn't want to duplicate them here and collide. Should we move to that once #15795 lands? Please let me know what you think.

joyhaldar and others added 4 commits June 5, 2026 16:47
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Co-authored-by: Joy Haldar <joy.haldar@target.com>
@joyhaldar joyhaldar force-pushed the tck-default-values branch from 6d1a5f8 to aede51b Compare June 5, 2026 11:30
joyhaldar and others added 3 commits June 6, 2026 11:45
Co-authored-by: Joy Haldar <joy.haldar@target.com>
Co-authored-by: Joy Haldar <joy.haldar@target.com>
@joyhaldar joyhaldar force-pushed the tck-default-values branch from add7874 to b8ba50d Compare June 6, 2026 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants