feat(schema): Migrate SchemaProviders in Hudi-Utilities to use HoodieSchema#14364
Conversation
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaProvider.java
Outdated
Show resolved
Hide resolved
| this.formatAdapter = new SourceFormatAdapter(source, this.errorTableWriter, Option.of(props)); | ||
|
|
||
| Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceSchema()); | ||
| Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceHoodieSchema().toAvroSchema()); |
There was a problem hiding this comment.
IIUC, schemaProvider#getSourceHoodieSchema() might return null. potential NPE (base class implementation) here?
Suggested change:
() -> Option.ofNullable(schemaProvider.getSourceHoodieSchema()).map(HoodieSchema::toAvroSchema);There was a problem hiding this comment.
Updating this
| MercifulJsonConverter.clearCache(inputBatch.getSchemaProvider().getSourceSchema().getFullName()); | ||
| AvroConvertor convertor = new AvroConvertor(inputBatch.getSchemaProvider().getSourceSchema(), isFieldNameSanitizingEnabled(), getInvalidCharMask()); | ||
| MercifulJsonConverter.clearCache(inputBatch.getSchemaProvider().getSourceHoodieSchema().getFullName()); | ||
| AvroConvertor convertor = new AvroConvertor(inputBatch.getSchemaProvider().getSourceHoodieSchema().toAvroSchema(), isFieldNameSanitizingEnabled(), getInvalidCharMask()); |
There was a problem hiding this comment.
The AvroConverter assumes that the schema is non-null today
142ce0e to
0ed42d1
Compare
0ed42d1 to
e92db40
Compare
| this.formatAdapter = new SourceFormatAdapter(source, this.errorTableWriter, Option.of(props)); | ||
|
|
||
| Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceSchema()); | ||
| Supplier<Option<Schema>> schemaSupplier = schemaProvider == null ? Option::empty : () -> Option.ofNullable(schemaProvider.getSourceHoodieSchema()).map(HoodieSchema::toAvroSchema); |
There was a problem hiding this comment.
I guess changing schemaSupplier to Supplier<Option> is in subsequent PR.
There was a problem hiding this comment.
Yes, do we have a ticket for the StreamSync code or should I make one?
There was a problem hiding this comment.
I don't think I added it. Can you kindly add it. thanks
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaPostProcessor.java
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public Schema getSourceSchema() { | ||
| public HoodieSchema getSourceHoodieSchema() { |
There was a problem hiding this comment.
While, we are the handling external implementation of SchemaProvider correctly, we are not preventing breakages when users extend from these concrete implementations which unfortunately is present. If users extend FilebasedSchemaProvider and override "public Schema getTargetSchema", then their implementation will silently not be used.
I guess it is safe to retain the existing implementation of getSourceSchema and getTargetSchema in each implementation of SchemaProvider with deprecated flag and remove in subsequent release.
There was a problem hiding this comment.
That's a good point. Should we just delegate for now on the actual implementations? We can have the integration point with the rest of the code be through HoodieSchema while keeping these calls delegate to the older methods.
There was a problem hiding this comment.
Yes, that makes sense 👍
… don't have failures on upgrade to 1.2.o
…Schema (apache#14364) * start migrating schema provider classes * make code compile * fix test setup * fix ExampleDataSchemaProvider * fix integ-test module * handle null schema in deduce schema step * fix getField and default value handling, fix test setups * mainatain existing APIs for compatiblity with user provided code * fix compile issue * fix file writer integ test * throw exception instead of returning null * fix default handling, add comments * PR feedback * Make existing schema providers use deprecated methods to ensure users don't have failures on upgrade to 1.2.o * mark older method as deprecated and delegate to it for PostProcessor * fix schemaproviderwithpostprocessor * fix integ-test * fix integ test * make delegating schema provider impl final

Describe the issue this Pull Request addresses
Migrates the SchemaProviders to provide the source and target schemas as HoodieSchema instead of Avro.
Addresses: #14276
Summary and Changelog
SchemaProviderto outputHoodieSchemainstead of the Avro schemaSchemaPostProcessorto operate onHoodieSchemainstead of the Avro schemaImpact
Moves the repo towards operating entirely on HoodieSchema which will allow us to expand the types that Hudi supports and other schema related features
Risk Level
Low
Documentation Update
Contributor's checklist