refactor(flink): Refactor RowData writer factory to use HoodieSchema#18873
Conversation
93fdbc9 to
7a82e9f
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This refactor consolidates the Flink RowData writer factory to use HoodieSchema at the API boundary, removing the duplicate RowType-specific dispatch and aligning with the parent HoodieFileWriterFactory pattern. The reflective constructor signature change is documented as the main risk, and the IO factory dispatch correctly routes HoodieRecordType.FLINK back to the row-data factory through both Hadoop and Flink IO factory paths. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming inconsistency to clean up — the base and subclass use different names for the same HoodieSchema parameter.
cc @yihua
| protected final Configuration hadoopConf; | ||
|
|
||
| public RowDataParquetWriteSupport(RowType rowType, Configuration config) { | ||
| public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) { |
There was a problem hiding this comment.
🤖 nit: could you rename hoodieSchema to just schema? The type HoodieSchema already carries the Hoodie prefix, so the parameter name ends up redundant — and the subclass HoodieRowDataParquetWriteSupport uses schema for the same argument, so aligning them would be more consistent.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
7a82e9f to
beb3eb6
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR cleanly removes the duplicate RowType-specific dispatch in HoodieRowDataFileWriterFactory and routes HoodieRowDataCreateHandle through the parent static HoodieFileWriterFactory.getFileWriter, with HoodieSchema as the boundary. The IO factory dispatch (via both HoodieFlinkIOFactory and HoodieHadoopIOFactory's reflective FLINK case) correctly resolves back to HoodieRowDataFileWriterFactory, and .getNonNullType() properly handles the case where convertToSchema would wrap a nullable top-level rowType in a UNION. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and clarity nits below.
cc @yihua
| protected final Configuration hadoopConf; | ||
|
|
||
| public RowDataParquetWriteSupport(RowType rowType, Configuration config) { | ||
| public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) { |
There was a problem hiding this comment.
🤖 nit: the parameter name hoodieSchema is a bit redundant given its type is already HoodieSchema — could you rename it to just schema to match the naming used in HoodieRowDataParquetWriteSupport's constructor?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| storagePath, | ||
| hoodieTable.getStorage(), | ||
| config, | ||
| HoodieSchemaConverter.convertToSchema(rowType).getNonNullType(), |
There was a problem hiding this comment.
🤖 nit: the inline .getNonNullType() chain makes it a bit non-obvious why the non-null variant is needed here — could you extract this to a local variable (e.g. HoodieSchema schema = HoodieSchemaConverter.convertToSchema(rowType).getNonNullType();) or add a short comment explaining the intent?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18873 +/- ##
============================================
- Coverage 68.80% 68.79% -0.01%
+ Complexity 29151 29134 -17
============================================
Files 2515 2515
Lines 139984 139973 -11
Branches 17196 17196
============================================
- Hits 96310 96291 -19
- Misses 35899 35904 +5
- Partials 7775 7778 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
The Flink RowData writer factory had a RowType-specific writer creation path that duplicated the parent
HoodieFileWriterFactoryschema-based dispatch. This madeHoodieRowDataFileWriterFactorydiffer from the common writer factory mainly by passingRowTypethrough several methods, even though the common API already usesHoodieSchema.This PR refactors the Flink RowData writer path to use
HoodieSchemaat the factory and write-support boundaries, derivingRowTypeonly inside implementations that still need it for RowData parquet or Lance writing.Summary and Changelog
HoodieRowDataFileWriterFactory#getFileWriterand private RowType format dispatcher.HoodieSchema-basednewParquetFileWriteroverrides.RowDataParquetWriteSupportandHoodieRowDataParquetWriteSupportconstructors to acceptHoodieSchema, derivingRowTypeinternally.(Configuration, HoodieSchema, BloomFilter).HoodieRowDataCreateHandleandTestHoodieRowDataParquetConfigInjectorto use schema-based writer creation.Impact
Risk Level
low. The main risk is constructor signature change for custom classes configured through
hoodie.parquet.flink.rowdata.write.support.class. Verification included targeted Flink client tests and compile validation with the Flink 2.1 profile.Documentation Update
none
Contributor's checklist