refactor(flink): Refactor RowData writer factory to use HoodieSchema by cshuo · Pull Request #18873 · apache/hudi

cshuo · 2026-05-28T08:58:07Z

Describe the issue this Pull Request addresses

The Flink RowData writer factory had a RowType-specific writer creation path that duplicated the parent HoodieFileWriterFactory schema-based dispatch. This made HoodieRowDataFileWriterFactory differ from the common writer factory mainly by passing RowType through several methods, even though the common API already uses HoodieSchema.

This PR refactors the Flink RowData writer path to use HoodieSchema at the factory and write-support boundaries, deriving RowType only inside implementations that still need it for RowData parquet or Lance writing.

Summary and Changelog

Removed the RowType-specific HoodieRowDataFileWriterFactory#getFileWriter and private RowType format dispatcher.
Updated RowData parquet writer creation to use the parent HoodieSchema-based newParquetFileWriter overrides.
Changed RowDataParquetWriteSupport and HoodieRowDataParquetWriteSupport constructors to accept HoodieSchema, deriving RowType internally.
Updated reflective loading of Flink RowData parquet write support to use (Configuration, HoodieSchema, BloomFilter).
Updated HoodieRowDataCreateHandle and TestHoodieRowDataParquetConfigInjector to use schema-based writer creation.

Impact

Functional impact: No intended behavior change for RowData parquet or Lance writes.
Maintainability: Reduces duplicate RowType-specific factory dispatch and aligns Flink RowData writer creation with the common writer factory API.
Extensibility: Keeps schema as the factory/write-support boundary, making future writer paths less dependent on Flink-specific types.

Risk Level

low. The main risk is constructor signature change for custom classes configured through hoodie.parquet.flink.rowdata.write.support.class. Verification included targeted Flink client tests and compile validation with the Flink 2.1 profile.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This refactor consolidates the Flink RowData writer factory to use HoodieSchema at the API boundary, removing the duplicate RowType-specific dispatch and aligning with the parent HoodieFileWriterFactory pattern. The reflective constructor signature change is documented as the main risk, and the IO factory dispatch correctly routes HoodieRecordType.FLINK back to the row-data factory through both Hadoop and Flink IO factory paths. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming inconsistency to clean up — the base and subclass use different names for the same HoodieSchema parameter.

cc @yihua

hudi-agent · 2026-05-28T09:14:27Z

  protected final Configuration hadoopConf;

-  public RowDataParquetWriteSupport(RowType rowType, Configuration config) {
+  public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) {


🤖 nit: could you rename hoodieSchema to just schema? The type HoodieSchema already carries the Hoodie prefix, so the parameter name ends up redundant — and the subclass HoodieRowDataParquetWriteSupport uses schema for the same argument, so aligning them would be more consistent.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR cleanly removes the duplicate RowType-specific dispatch in HoodieRowDataFileWriterFactory and routes HoodieRowDataCreateHandle through the parent static HoodieFileWriterFactory.getFileWriter, with HoodieSchema as the boundary. The IO factory dispatch (via both HoodieFlinkIOFactory and HoodieHadoopIOFactory's reflective FLINK case) correctly resolves back to HoodieRowDataFileWriterFactory, and .getNonNullType() properly handles the case where convertToSchema would wrap a nullable top-level rowType in a UNION. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and clarity nits below.

cc @yihua

hudi-agent · 2026-05-28T11:31:55Z

  protected final Configuration hadoopConf;

-  public RowDataParquetWriteSupport(RowType rowType, Configuration config) {
+  public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) {


🤖 nit: the parameter name hoodieSchema is a bit redundant given its type is already HoodieSchema — could you rename it to just schema to match the naming used in HoodieRowDataParquetWriteSupport's constructor?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-28T11:31:55Z

+        storagePath,
+        hoodieTable.getStorage(),
+        config,
+        HoodieSchemaConverter.convertToSchema(rowType).getNonNullType(),


🤖 nit: the inline .getNonNullType() chain makes it a bit non-obvious why the non-null variant is needed here — could you extract this to a local variable (e.g. HoodieSchema schema = HoodieSchemaConverter.convertToSchema(rowType).getNonNullType();) or add a short comment explaining the intent?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-05-28T12:37:16Z

CI report:

beb3eb6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-28T12:37:19Z

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.79%. Comparing base (853cbef) to head (beb3eb6).

Files with missing lines	Patch %	Lines
...io/storage/row/HoodieRowDataFileWriterFactory.java	50.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18873      +/-   ##
============================================
- Coverage     68.80%   68.79%   -0.01%     
+ Complexity    29151    29134      -17     
============================================
  Files          2515     2515              
  Lines        139984   139973      -11     
  Branches      17196    17196              
============================================
- Hits          96310    96291      -19     
- Misses        35899    35904       +5     
- Partials       7775     7778       +3

Flag	Coverage Δ
common-and-other-modules	`44.32% <87.50%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.90% <ø> (+0.05%)`	⬆️
spark-client-hadoop-common	`48.22% <ø> (ø)`
spark-java-tests	`49.35% <ø> (-0.02%)`	⬇️
spark-scala-tests	`45.26% <ø> (+<0.01%)`	⬆️
utilities	`37.43% <ø> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...hudi/io/storage/row/HoodieRowDataCreateHandle.java	`86.99% <100.00%> (+0.21%)`	⬆️
.../storage/row/HoodieRowDataParquetWriteSupport.java	`100.00% <100.00%> (ø)`
...udi/io/storage/row/RowDataParquetWriteSupport.java	`88.23% <100.00%> (ø)`
...io/storage/row/HoodieRowDataFileWriterFactory.java	`70.58% <50.00%> (+6.75%)`	⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cshuo force-pushed the refactor_writer_factory branch from 93fdbc9 to 7a82e9f Compare May 28, 2026 09:07

hudi-agent reviewed May 28, 2026

View reviewed changes

refactor(flink): Refactor RowData writer factory to use HoodieSchema

beb3eb6

cshuo force-pushed the refactor_writer_factory branch from 7a82e9f to beb3eb6 Compare May 28, 2026 10:13

hudi-agent reviewed May 28, 2026

View reviewed changes

danny0405 approved these changes May 28, 2026

View reviewed changes

danny0405 merged commit 588998f into apache:master May 28, 2026
58 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(flink): Refactor RowData writer factory to use HoodieSchema#18873

refactor(flink): Refactor RowData writer factory to use HoodieSchema#18873
danny0405 merged 1 commit into
apache:masterfrom
cshuo:refactor_writer_factory

cshuo commented May 28, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 28, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 28, 2026

Uh oh!

hudi-agent May 28, 2026

Uh oh!

hudi-bot commented May 28, 2026

Uh oh!

codecov-commenter commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

cshuo commented May 28, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 28, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 28, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 28, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 28, 2026

CI report:

Uh oh!

codecov-commenter commented May 28, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants