Skip to content

feat(schema): Migrate SchemaProviders in Hudi-Utilities to use HoodieSchema#14364

Merged
the-other-tim-brown merged 19 commits intoapache:masterfrom
the-other-tim-brown:14276-schema-providers
Dec 4, 2025
Merged

feat(schema): Migrate SchemaProviders in Hudi-Utilities to use HoodieSchema#14364
the-other-tim-brown merged 19 commits intoapache:masterfrom
the-other-tim-brown:14276-schema-providers

Conversation

@the-other-tim-brown
Copy link
Contributor

@the-other-tim-brown the-other-tim-brown commented Nov 26, 2025

Describe the issue this Pull Request addresses

Migrates the SchemaProviders to provide the source and target schemas as HoodieSchema instead of Avro.
Addresses: #14276

Summary and Changelog

  • Update the method signatures for SchemaProvider to output HoodieSchema instead of the Avro schema
  • Update the SchemaPostProcessor to operate on HoodieSchema instead of the Avro schema
  • Added utilities to aide in testing and updated relevant test cases

Impact

Moves the repo towards operating entirely on HoodieSchema which will allow us to expand the types that Hudi supports and other schema related features

Risk Level

Low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Nov 26, 2025
@the-other-tim-brown the-other-tim-brown marked this pull request as ready for review December 1, 2025 15:03
@voonhous voonhous changed the title feat(schema): Migrate SchemProviders in Hudi-Utilities to use HoodieSchema feat(schema): Migrate SchemaProviders in Hudi-Utilities to use HoodieSchema Dec 3, 2025
this.formatAdapter = new SourceFormatAdapter(source, this.errorTableWriter, Option.of(props));

Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceSchema());
Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceHoodieSchema().toAvroSchema());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, schemaProvider#getSourceHoodieSchema() might return null. potential NPE (base class implementation) here?

Suggested change:

() -> Option.ofNullable(schemaProvider.getSourceHoodieSchema()).map(HoodieSchema::toAvroSchema);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating this

MercifulJsonConverter.clearCache(inputBatch.getSchemaProvider().getSourceSchema().getFullName());
AvroConvertor convertor = new AvroConvertor(inputBatch.getSchemaProvider().getSourceSchema(), isFieldNameSanitizingEnabled(), getInvalidCharMask());
MercifulJsonConverter.clearCache(inputBatch.getSchemaProvider().getSourceHoodieSchema().getFullName());
AvroConvertor convertor = new AvroConvertor(inputBatch.getSchemaProvider().getSourceHoodieSchema().toAvroSchema(), isFieldNameSanitizingEnabled(), getInvalidCharMask());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential NPE here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AvroConverter assumes that the schema is non-null today

this.formatAdapter = new SourceFormatAdapter(source, this.errorTableWriter, Option.of(props));

Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceSchema());
Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceHoodieSchema()).map(HoodieSchema::toAvroSchema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess changing schemaSupplier to Supplier<Option> is in subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, do we have a ticket for the StreamSync code or should I make one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I added it. Can you kindly add it. thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17480 -added this


@Override
public Schema getSourceSchema() {
public HoodieSchema getSourceHoodieSchema() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While, we are the handling external implementation of SchemaProvider correctly, we are not preventing breakages when users extend from these concrete implementations which unfortunately is present. If users extend FilebasedSchemaProvider and override "public Schema getTargetSchema", then their implementation will silently not be used.

I guess it is safe to retain the existing implementation of getSourceSchema and getTargetSchema in each implementation of SchemaProvider with deprecated flag and remove in subsequent release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Should we just delegate for now on the actual implementations? We can have the integration point with the rest of the code be through HoodieSchema while keeping these calls delegate to the older methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense 👍

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@the-other-tim-brown
Copy link
Contributor Author

Azure CI is passing but status is not updated
image

@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 4, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@the-other-tim-brown the-other-tim-brown merged commit 49938cb into apache:master Dec 4, 2025
195 of 204 checks passed
@the-other-tim-brown the-other-tim-brown deleted the 14276-schema-providers branch December 4, 2025 18:16
the-other-tim-brown added a commit to the-other-tim-brown/hudi that referenced this pull request Dec 5, 2025
…Schema (apache#14364)

* start migrating schema provider classes

* make code compile

* fix test setup

* fix ExampleDataSchemaProvider

* fix integ-test module

* handle null schema in deduce schema step

* fix getField and default value handling, fix test setups

* mainatain existing APIs for compatiblity with user provided code

* fix compile issue

* fix file writer integ test

* throw exception instead of returning null

* fix default handling, add comments

* PR feedback

* Make existing schema providers use deprecated methods to ensure users don't have failures on upgrade to 1.2.o

* mark older method as deprecated and delegate to it for PostProcessor

* fix schemaproviderwithpostprocessor

* fix integ-test

* fix integ test

* make delegating schema provider impl final
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants