Data: Add TCK coverage for reader default values by joyhaldar · Pull Request #16638 · apache/iceberg

joyhaldar · 2026-05-31T13:02:13Z

Adds the reader default-value tests from DataTestBase into the Base Format model TCK:

testDefaultValues
testNullDefaultValue
testNestedDefaultValue
testMapNestedDefaultValue
testListNestedDefaultValue
testMissingRequiredWithoutDefault.

Adds DataGenerators.PrimitiveDefaults and the following tests:

testPrimitiveDefaultValues
testPrimitiveDefaultValuesNotApplied

Adds a supportsTime() hook, defaults true, gates the TIME column, TestSparkFormatModel overrides it to false. FIXED excluded with a TODO.

Removes testReaderSchemaEvolutionNewColumnWithDefault.

pvary · 2026-06-01T09:39:44Z

                      .map(generator -> Arguments.of(format, generator)))
          .toList();

+  private static final List<Arguments> PRIMITIVE_TYPES_AND_DEFAULTS =


Could we use a DataGenerator for this?

The matrix is (type, default-literal) pairs and the test asserts the reader injects exactly that literal, so the type has to stay paired with its expected value. DataGenerator is one schema() + random rows, so it can't carry that pairing, at least as I understand the current interface.

Can we create a generator with a schema with default values, and create 2 tests, one with nulls where we check for the default values, and one with random data (not nulls), and check for the generated data?

Added DataGenerators.PrimitiveDefaults and two tests, testPrimitiveDefaultValues writes only id so the defaulted columns are absent and the reader injects the defaults, testPrimitiveDefaultValuesNotApplied writes random values into all columns and checks the round trip.

I replaced the old testReaderSchemaEvolutionNewColumnWithDefault + the type and default matrix, which I removed.

Guosmilesmile · 2026-06-02T06:36:50Z

+
+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  void testDefaultValues(FileFormat fileFormat) throws IOException {


Do we need this test ? Seem the same like testSchemaEvolutionAddColumn?

testSchemaEvolutionAddColumn adds plain optional columns that read back as null. testDefaultValues instead checks that columns with defaults get the default injected, and that a written value isn't overwritten by its default, and there's some overlap there with the new testPrimitiveDefaultValues. @pvary @Guosmilesmile do you think testDefaultValues is worth keeping, or should I remove it?

Make sense ,we can keep it for now.

Guosmilesmile · 2026-06-02T06:39:06Z

+            expectedNested.setField("missing_inner_float", -0.0F);
+            expected.setField("nested", expectedNested);
+          }
+          return expected;


nit: new line

Guosmilesmile · 2026-06-02T06:41:26Z

+                                .copy("value_str", val.getField("value_str"), "value_int", 34)));
+            expected.setField("nested_map", rebuilt);
+          }
          return expected;


nit: new line. Please check all the code .Thanks

Guosmilesmile · 2026-06-02T06:41:41Z

+                    .collect(Collectors.toList());
+            expected.setField("nested_list", rebuilt);
+          }
+          return expected;


nit: new line .

Guosmilesmile · 2026-06-02T06:43:54Z

+    List<Record> genericRecords = RandomGenericData.generate(writeSchema, 10, 1L);
+    writeGenericRecords(fileFormat, writeSchema, genericRecords);
+
+    Schema expectedSchema =


Can we reuse writeSchema to create expectedSchema?

Extracted idField and dataField and shared them between the write and expected schemas.

Guosmilesmile · 2026-06-02T06:45:01Z

+    List<Record> genericRecords = RandomGenericData.generate(writeSchema, 10, 1L);
+    writeGenericRecords(fileFormat, writeSchema, genericRecords);
+
+    Schema expectedSchema =


The same above.

Extracted idField and dataField and shared them between the write and expected schemas.

pvary · 2026-06-04T08:12:13Z

+  // TODO: include TIME once the engine readers support it.
+  // TODO: include FIXED once Spark supports it.


We have 3 implementations:

Generic

Flink

Spark

Which one supports which of these types?
Shall we create specific test methods for these?

TIME: works on Flink; fails on Spark, java.lang.UnsupportedOperationException: Unsupported logical type: TIME_MICROS.

FIXED: works on Flink; fails on Spark,java.lang.ClassCastException: class [B cannot be cast to class java.nio.ByteBuffer.

Since both work on Flink and only Spark fails, including them would mean running on Flink but skipping on Spark. I'm not sure of the best way to gate that per-engine here, open to suggestions.

Or we could keep both excluded and just update the TODO to note it's Spark specific, or I could look into the Spark side TIME and FIXED handling to see if it can be fixed, though I'm not familiar with those code paths yet and would need to dig in (which I don't mind).

I assume both works when we use Generic readers and writers

In another PR we should add TestGenericFormatModel and exclude duplicated tests

TIME: works on Flink; fails on Spark, java.lang.UnsupportedOperationException: Unsupported logical type: TIME_MICROS.

Maybe #16665 could be interesting here. We plan to use supportsTime for this in the meantime which could be overridden by TestSparkFormatModel

FIXED: works on Flink; fails on Spark,java.lang.ClassCastException: class [B cannot be cast to class java.nio.ByteBuffer.

Is it a bug? Maybe a separate PR could be good where we discuss what is the expected behavior. In the meantime we can add a TODO.

This is also highlighted in #15795

So for TIME there's #16665 and #15795 (adds the supportsTime() hook). And FIXED doesn't have a fix yet.

Should we wait for those two to land and build on them, or ship this PR with TODOs for now (i am fine with waiting)? I could look into the FIXED issue in the meantime.

I wouldn't wait.
Add the same supportsTime method as defined in #15795 and the faster one wins, the second one rebases 😄
Keep a TODO for the bytebuffer stuff. This is for default values, not the primitive type tests

I have added supportsTime() matching #15795, gated TIME on it, Spark overrides false.

FIXED left out with a TODO.

I was looking into the FIXED ClassCastException, the TCK's InternalRowConverter assumed a FIXED value is always a ByteBuffer and cast it directly. But a generic Record actually holds FIXED as a byte[] IIUC, so casting byte[] to ByteBuffer throws the exception. It never showed up before because the TCK had no FIXED column, so that branch never ran on a real FIXED value, adding one as part of this PR is what surfaced it.

The fix checks the value type, handling byte[], ByteBuffer, and GenericData.Fixed, all yielding byte[], which is what Spark's InternalRow expects IIUC, mirroring what TestHelpers does.

Should I keep this here and drop the TODO, or put it into a seperate PR and keep the TODO in this one?

Guosmilesmile · 2026-06-05T05:29:10Z

+
+  @ParameterizedTest
+  @FieldSource("FILE_FORMATS")
+  void testDefaultValues(FileFormat fileFormat) throws IOException {


Make sense ,we can keep it for now.

Guosmilesmile · 2026-06-05T05:39:19Z

+    Schema readSchema =
+        supportsTime()
+            ? DataGenerators.PrimitiveDefaults.READ_SCHEMA
+            : TypeUtil.selectNot(
+                DataGenerators.PrimitiveDefaults.READ_SCHEMA,
+                Set.of(
+                    DataGenerators.PrimitiveDefaults.READ_SCHEMA
+                        .findField("time_with_default")
+                        .fieldId()));
+
+    List<Record> sourceRecords = RandomGenericData.generate(readSchema, 10, 1L);
+    writeGenericRecords(fileFormat, readSchema, sourceRecords);
+
+    readAndAssertEngineRecords(fileFormat, readSchema, sourceRecords, record -> record);


The situation in testPrimitiveDefaultValues seems to be quite similar. Can we extract the same parts from there?

Extracted into primitiveDefaultsReadSchema(), shared by both tests.

Guosmilesmile · 2026-06-05T05:41:39Z

+                .build());
+
+    assertThatThrownBy(
+            () -> readAndAssertGenericRecords(fileFormat, expectedSchema, genericRecords))


This test only checks the generic reader component. Do we need to add tests for the engine-related components as well?

You are right, switched it to readAndAssertEngineRecords, throws the same IllegalArgumentException.

pvary · 2026-06-05T10:48:32Z

+      return schema;
+    }
+
+    return TypeUtil.selectNot(schema, Set.of(schema.findField("time_with_default").fieldId()));


Could we do this without relying on the field name?
Collect the fieldIds where the type is not supported? See #15795 supportedSchema for inspiration.

I like @rambleraptor's solution which removes the need for supportsTime even more.

I am now collecting the fieldIds by type instead of the name:

Set<Integer> unsupportedFieldIds = schema.columns().stream() .filter(field -> field.type().typeId() == Type.TypeID.TIME) .map(Types.NestedField::fieldId) .collect(Collectors.toSet()); return TypeUtil.selectNot(schema, unsupportedFieldIds);

About dropping supportsTime() entirely via @rambleraptor's filterUnsupported/excludeColumnsContaining, since #15795 adds those same methods to BaseFormatModelTests, I didn't want to duplicate them here and collide. Should we move to that once #15795 lands? Please let me know what you think.

Co-authored-by: Joy Haldar <joy.haldar@target.com>

github-actions Bot added the data label May 31, 2026

joyhaldar commented May 31, 2026

View reviewed changes

Comment thread data/src/test/java/org/apache/iceberg/data/BaseFormatModelTests.java Outdated

pvary reviewed Jun 1, 2026

View reviewed changes

Guosmilesmile reviewed Jun 2, 2026

View reviewed changes

pvary reviewed Jun 4, 2026

View reviewed changes

github-actions Bot added the spark label Jun 4, 2026

joyhaldar marked this pull request as ready for review June 5, 2026 04:32

Guosmilesmile reviewed Jun 5, 2026

View reviewed changes

pvary reviewed Jun 5, 2026

View reviewed changes

joyhaldar and others added 4 commits June 5, 2026 16:47

Data: Add TCK coverage for reader default values

f8a1084

Co-authored-by: Joy Haldar <joy.haldar@target.com>

Data: Add TCK coverage for reader default values

8e251e7

Co-authored-by: Joy Haldar <joy.haldar@target.com>

Data: Add TCK coverage for reader default values

7b22311

Co-authored-by: Joy Haldar <joy.haldar@target.com>

address review comments

aede51b

Co-authored-by: Joy Haldar <joy.haldar@target.com>

joyhaldar force-pushed the tck-default-values branch from 6d1a5f8 to aede51b Compare June 5, 2026 11:30

joyhaldar and others added 3 commits June 6, 2026 11:45

remove unused imports after rebase

298b4ea

Co-authored-by: Joy Haldar <joy.haldar@target.com>

trigger CI

0b29b9e

collect unsupported fieldIds by type instead of field name

b8ba50d

Co-authored-by: Joy Haldar <joy.haldar@target.com>

joyhaldar force-pushed the tck-default-values branch from add7874 to b8ba50d Compare June 6, 2026 06:45

joyhaldar and others added 2 commits June 6, 2026 20:56

Spark: Add FIXED type support to File Format API TCK

507ab46

Spark: Backport FIXED TCK support to 3.5 and 4.0

b0618ed

Co-authored-by: Joy Haldar <joy.haldar@target.com>

		// TODO: include TIME once the engine readers support it.
		// TODO: include FIXED once Spark supports it.

Conversation

joyhaldar commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joyhaldar Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

joyhaldar commented May 31, 2026 •

edited

Loading

joyhaldar Jun 4, 2026 •

edited

Loading

joyhaldar Jun 4, 2026 •

edited

Loading

joyhaldar Jun 7, 2026 •

edited

Loading