Skip to content

refactor(flink): Refactor RowData writer factory to use HoodieSchema#18873

Merged
danny0405 merged 1 commit into
apache:masterfrom
cshuo:refactor_writer_factory
May 28, 2026
Merged

refactor(flink): Refactor RowData writer factory to use HoodieSchema#18873
danny0405 merged 1 commit into
apache:masterfrom
cshuo:refactor_writer_factory

Conversation

@cshuo
Copy link
Copy Markdown
Collaborator

@cshuo cshuo commented May 28, 2026

Describe the issue this Pull Request addresses

The Flink RowData writer factory had a RowType-specific writer creation path that duplicated the parent HoodieFileWriterFactory schema-based dispatch. This made HoodieRowDataFileWriterFactory differ from the common writer factory mainly by passing RowType through several methods, even though the common API already uses HoodieSchema.

This PR refactors the Flink RowData writer path to use HoodieSchema at the factory and write-support boundaries, deriving RowType only inside implementations that still need it for RowData parquet or Lance writing.

Summary and Changelog

  • Removed the RowType-specific HoodieRowDataFileWriterFactory#getFileWriter and private RowType format dispatcher.
  • Updated RowData parquet writer creation to use the parent HoodieSchema-based newParquetFileWriter overrides.
  • Changed RowDataParquetWriteSupport and HoodieRowDataParquetWriteSupport constructors to accept HoodieSchema, deriving RowType internally.
  • Updated reflective loading of Flink RowData parquet write support to use (Configuration, HoodieSchema, BloomFilter).
  • Updated HoodieRowDataCreateHandle and TestHoodieRowDataParquetConfigInjector to use schema-based writer creation.

Impact

  • Functional impact: No intended behavior change for RowData parquet or Lance writes.
  • Maintainability: Reduces duplicate RowType-specific factory dispatch and aligns Flink RowData writer creation with the common writer factory API.
  • Extensibility: Keeps schema as the factory/write-support boundary, making future writer paths less dependent on Flink-specific types.

Risk Level

low. The main risk is constructor signature change for custom classes configured through hoodie.parquet.flink.rowdata.write.support.class. Verification included targeted Flink client tests and compile validation with the Flink 2.1 profile.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@cshuo cshuo force-pushed the refactor_writer_factory branch from 93fdbc9 to 7a82e9f Compare May 28, 2026 09:07
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This refactor consolidates the Flink RowData writer factory to use HoodieSchema at the API boundary, removing the duplicate RowType-specific dispatch and aligning with the parent HoodieFileWriterFactory pattern. The reflective constructor signature change is documented as the main risk, and the IO factory dispatch correctly routes HoodieRecordType.FLINK back to the row-data factory through both Hadoop and Flink IO factory paths. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming inconsistency to clean up — the base and subclass use different names for the same HoodieSchema parameter.

cc @yihua

protected final Configuration hadoopConf;

public RowDataParquetWriteSupport(RowType rowType, Configuration config) {
public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename hoodieSchema to just schema? The type HoodieSchema already carries the Hoodie prefix, so the parameter name ends up redundant — and the subclass HoodieRowDataParquetWriteSupport uses schema for the same argument, so aligning them would be more consistent.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@cshuo cshuo force-pushed the refactor_writer_factory branch from 7a82e9f to beb3eb6 Compare May 28, 2026 10:13
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR cleanly removes the duplicate RowType-specific dispatch in HoodieRowDataFileWriterFactory and routes HoodieRowDataCreateHandle through the parent static HoodieFileWriterFactory.getFileWriter, with HoodieSchema as the boundary. The IO factory dispatch (via both HoodieFlinkIOFactory and HoodieHadoopIOFactory's reflective FLINK case) correctly resolves back to HoodieRowDataFileWriterFactory, and .getNonNullType() properly handles the case where convertToSchema would wrap a nullable top-level rowType in a UNION. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and clarity nits below.

cc @yihua

protected final Configuration hadoopConf;

public RowDataParquetWriteSupport(RowType rowType, Configuration config) {
public RowDataParquetWriteSupport(HoodieSchema hoodieSchema, Configuration config) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the parameter name hoodieSchema is a bit redundant given its type is already HoodieSchema — could you rename it to just schema to match the naming used in HoodieRowDataParquetWriteSupport's constructor?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

storagePath,
hoodieTable.getStorage(),
config,
HoodieSchemaConverter.convertToSchema(rowType).getNonNullType(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the inline .getNonNullType() chain makes it a bit non-obvious why the non-null variant is needed here — could you extract this to a local variable (e.g. HoodieSchema schema = HoodieSchemaConverter.convertToSchema(rowType).getNonNullType();) or add a short comment explaining the intent?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.79%. Comparing base (853cbef) to head (beb3eb6).

Files with missing lines Patch % Lines
...io/storage/row/HoodieRowDataFileWriterFactory.java 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18873      +/-   ##
============================================
- Coverage     68.80%   68.79%   -0.01%     
+ Complexity    29151    29134      -17     
============================================
  Files          2515     2515              
  Lines        139984   139973      -11     
  Branches      17196    17196              
============================================
- Hits          96310    96291      -19     
- Misses        35899    35904       +5     
- Partials       7775     7778       +3     
Flag Coverage Δ
common-and-other-modules 44.32% <87.50%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.90% <ø> (+0.05%) ⬆️
spark-client-hadoop-common 48.22% <ø> (ø)
spark-java-tests 49.35% <ø> (-0.02%) ⬇️
spark-scala-tests 45.26% <ø> (+<0.01%) ⬆️
utilities 37.43% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...hudi/io/storage/row/HoodieRowDataCreateHandle.java 86.99% <100.00%> (+0.21%) ⬆️
.../storage/row/HoodieRowDataParquetWriteSupport.java 100.00% <100.00%> (ø)
...udi/io/storage/row/RowDataParquetWriteSupport.java 88.23% <100.00%> (ø)
...io/storage/row/HoodieRowDataFileWriterFactory.java 70.58% <50.00%> (+6.75%) ⬆️

... and 13 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danny0405 danny0405 merged commit 588998f into apache:master May 28, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants