Skip to content

Conversation

@linliu-code
Copy link
Collaborator

@linliu-code linliu-code commented Dec 15, 2025

Describe the issue this Pull Request addresses

This PR combines mainly two PRs that fixing timestamp_millis logical type issue.

  1. fix(ingest): Repair affected logical timestamp milli tables #14161
  2. feat(metadata): Improve Logical Type Handling on Col Stats #13711

Summary and Changelog

The below is the PR description from #14161.

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Impact

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary.

Risk Level

High,
extensive testing was done and functional tests were added.

Documentation Update

#14100

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@linliu-code linliu-code changed the base branch from master to branch-0.x December 15, 2025 20:34
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from ac2916a to 5ef5773 Compare December 15, 2025 20:40
@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Dec 15, 2025
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 13 times, most recently from 0c4e026 to 79c4a88 Compare December 16, 2025 03:22
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 2 times, most recently from 8583da1 to 0c7b7b9 Compare December 24, 2025 00:18
@linliu-code linliu-code marked this pull request as ready for review December 24, 2025 00:19
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 6 times, most recently from fcbe23c to 20ada07 Compare December 30, 2025 09:56
Copy link
Collaborator

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.

However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.

Is this limitation fixed in older hudi tables?

Comment on lines +60 to +61
// NOTE: Those are not supported in Avro 1.8.2 (used by Spark 2)
// Only add conversions if they're available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate the fix and added tests with spark 2? I am not sure if CI covers it by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we only make the conversion for Spark3.4+.

@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 4 times, most recently from 8c011e0 to ac33414 Compare December 31, 2025 07:39
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 70ca944 to 699e63b Compare January 11, 2026 22:01
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 699e63b to 65f6fdd Compare January 12, 2026 19:34
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 996080f to cc1e856 Compare January 12, 2026 22:13
@yihua yihua self-assigned this Jan 14, 2026
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have started reviewing the patch. Will keep sharing my reviews in smaller chunks so that you can start addressing them.

Can you confirm that we are not fixing nested fields for the logical ts issue?
Can we test 1.1.1 as well towards this. we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc.

<exclusions>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced to resolve some conflicts. Will check if we can avoid this or due to some flakiness.

<groupId>org.pentaho</groupId>
<artifactId>*</artifactId>
</exclusion>
<exclusion>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you hep me understand the necessity of this code change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are due to some dependency conflicts. Most likely since we use spark3.5 for Azure CI. I can remove these dependency change to see which compilation or tests fail.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing few more feedback

}
}

public final List<String> getPartitionNames() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this?
not related to logical ts fixes right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably saw it in some test failures. Will remove it to see if any test failed.

@linliu-code
Copy link
Collaborator Author

I have started reviewing the patch. Will keep sharing my reviews in smaller chunks so that you can start addressing them.

Can you confirm that we are not fixing nested fields for the logical ts issue?
The repair logic supports the nested fields.

Can we test 1.1.1 as well towards this?
I think so. 1.1.0 should have the fix. Do you mean 1.0.3?

we need to understand what it takes to get it fixed. Or atleast call it out in documentation on whats expected behavior in 1.1.1, in 0.15.1 and 0.14.2 etc.
Right, for time travel queries, it could give wrong results when the as.of.instant is from the old Hudi version.

@linliu-code linliu-code changed the title [MINOR] Fix logical type issue for timestamp columns fix: Fix logical type issue for timestamp columns Jan 28, 2026
@linliu-code
Copy link
Collaborator Author

@linliu-code Thanks for working on this! The PR contains a few changes which are not part of https://github.com/apache/hudi/pull/14161/files. Can we add description about how the fix works for older hudi tables. Also the original PR mentions a limitation.
Will do.

However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros.

Is this limitation fixed in older hudi tables?

Sure. This limitation has been fixed.

@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from b30dc60 to 833aac2 Compare January 29, 2026 18:05
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 0b4aa51 to b16342b Compare January 29, 2026 21:24
requestedSchema = readerSchema;
}
// Set configuration for timestamp_millis type repair.
storage.getConf().set(ENABLE_LOGICAL_TIMESTAMP_REPAIR, Boolean.toString(AvroSchemaUtils.hasTimestampMillisField(readerSchema)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to read the config from driver doing this.
if its not set, then we can parse the schema and set it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan , I will focus on other comments first.

this.forceFullScan = forceFullScan;
this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema;
this.enableOptimizedLogBlocksScan = enableOptimizedLogBlocksScan;
this.enableLogicalTimestampFieldRepair = readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check if hadoopConf already contains the info and fetch it from there.

Also, can we populate the value in hadoopConf only and use it to pass around. I wanted to add passing individual boolean flags like we are currently doing in this patch.

this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField, InternalSchema.getEmptyInternalSchema());
this(storage, logFile, readerSchema, bufferSize, reverseReader, enableRecordLookups, keyField,
InternalSchema.getEmptyInternalSchema(),
readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try removing this.
lets always lookup in hadoop conf/storageConf.

And we should try and set the value in the driver before invoking these classes

val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns)

// Identify timestamp-millis columns from the Avro schema to skip from filter translation
// (even if they're in the index, they may have been indexed before the fix and should not be used for filtering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to a separate method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and did we add UTs or functional tests(not end to end) directly against data skipping layer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add UTs first, and see how to add FTs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT added.

val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns)

// Identify timestamp-millis columns from the Avro schema to skip from filter translation
// (even if they're in the index, they may have been indexed before the fix and should not be used for filtering)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and did we add UTs or functional tests(not end to end) directly against data skipping layer.

@linliu-code
Copy link
Collaborator Author

#17601 (comment)

We can add some.

@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch 2 times, most recently from 2f1b385 to 5cce8c7 Compare January 30, 2026 21:00
@linliu-code linliu-code force-pushed the branch-0.x-with-logic_types_fix branch from 5cce8c7 to 7a664d3 Compare January 30, 2026 22:00
@hudi-bot
Copy link
Collaborator

hudi-bot commented Feb 2, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants