feat(flink): write/read (only unshredded) variant to Flink parquet file writers/readers using Flink's Variant type#18539
feat(flink): write/read (only unshredded) variant to Flink parquet file writers/readers using Flink's Variant type#18539kbuci wants to merge 5 commits into
Conversation
92cb6af to
ad65566
Compare
37f47b6 to
01b9b5a
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! This PR threads HoodieSchema through the Flink Parquet write path so unshredded Variant columns can be detected and written with the canonical Parquet layout, and unblocks getVariant on the read side via BinaryVariant reflection. A few correctness concerns worth double-checking in the inline comments — notably the broad catch (Exception e) around the 4-arg write-support reflection (which can silently fall back and lose Variant schema info) and null handling in the new getVariant implementations. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability nits below.
| writeSupport = (HoodieRowDataParquetWriteSupport) ReflectionUtils.loadClass( | ||
| writeSupportClass, | ||
| new Class<?>[] {Configuration.class, RowType.class, BloomFilter.class}, | ||
| conf, rowType, filter); |
There was a problem hiding this comment.
🤖 This catch (Exception e) is quite broad — it falls back to the 3-arg constructor not just when the 4-arg signature is missing, but also if the 4-arg constructor itself throws (e.g. a runtime issue inside RowDataParquetWriteSupport/ParquetSchemaConverter while building the schema for a Variant column). In that case we'd silently downgrade to a 3-arg writer that ignores HoodieSchema, so a Variant column would be emitted as a plain ROW rather than the canonical Variant Parquet group — silent schema corruption. Could you narrow this to NoSuchMethodException (or check getConstructor upfront) so genuine instantiation errors propagate? Same pattern at line 76-90 in newParquetFileWriter(... OutputStream ...).
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| public Variant getVariant(int i) { | ||
| throw new UnsupportedOperationException("Variant is not supported yet."); | ||
| public Variant getVariant(int ordinal) { | ||
| RowData variantRow = row.getRow(rebaseOrdinal(ordinal), 2); |
There was a problem hiding this comment.
🤖 Could a null Variant cell reach this path? row.getRow(rebaseOrdinal(ordinal), 2) will return null when the underlying field is null, and HoodieVariant.fromRowData(null) would then NPE on variantRow.isNullAt(0). Most other accessors here (getString, getDecimal, etc.) guard with isNullAt at the call site, but it might be safer to either return null up front when row.isNullAt(rebaseOrdinal(ordinal)) is true, or null-check inside fromRowData. Same concern in BootstrapRowData.getVariant.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| public Variant getVariant(int i) { | ||
| throw new UnsupportedOperationException("Variant is not supported yet."); | ||
| public Variant getVariant(int pos) { | ||
| throw new UnsupportedOperationException( |
There was a problem hiding this comment.
🤖 Curious about the read path here — the PR description says read support is added, but ColumnarGroupRowData (vectorized read) and ColumnarGroupArrayData still throw on getVariant. Is the columnar/vectorized reader path expected to be exercised for Variant columns, and if so, will callers fall back to getRow(pos, 2) correctly? It might be worth either implementing this consistently or documenting which read path supports Variant in this PR vs. follow-ups.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * <p>Variant is stored as a struct with two binary fields: metadata and value. | ||
| * Field order follows the Parquet spec and Iceberg convention (metadata first, value second).</p> | ||
| */ | ||
| public static HoodieVariant fromRowData(org.apache.flink.table.data.RowData variantRow) { |
There was a problem hiding this comment.
🤖 nit: could you add a proper import org.apache.flink.table.data.RowData; and use the simple name here? A fully-qualified class name in a method signature is unusual enough that readers will wonder if there's a deliberate module-boundary reason — if there is, a brief inline comment would help; if not, the import keeps it consistent with the rest of the file.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * Resolves the HoodieSchema for a field at the given index, unwrapping nullable unions. | ||
| * Returns null if the field list is null or the index is out of bounds. | ||
| */ | ||
| private static HoodieSchema resolveFieldSchema(List<HoodieSchemaField> fields, int index, String fieldName) { |
There was a problem hiding this comment.
🤖 nit: fieldName is accepted here but never read — the resolution is purely positional via index. Could you either drop the parameter or actually use it (e.g. as a cross-check or name-based fallback)? As written it gives callers the impression that the name is being validated, which could mask field-ordering bugs silently.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| @Override | ||
| public Object convert(HoodieSchema schema, Object object) { | ||
| if (schema.getType() == HoodieSchemaType.VARIANT) { |
There was a problem hiding this comment.
why not add a new field converter for variant data type?
| this.rowType = rowType; | ||
| this.hadoopConf = new Configuration(config); | ||
| this.schema = ParquetSchemaConverter.convertToParquetMessageType("flink_schema", rowType); | ||
| this.schema = ParquetSchemaConverter.convertToParquetMessageType( |
There was a problem hiding this comment.
is it because the flink RowType can not represent the variant type with integrity?
There was a problem hiding this comment.
Yeah I couldn't figure out a way to like "annotate"/"tag" a Flink type to indicate that it should be a variant, though I considered RawType/DistinctType
To be transparent, I'm still new to Flink so if there's a different/out of the box approach we can use to elegantly have the Flink type "indicate" that its a variant (without having to pass Hoodie avro schema) then I would much rather do that.
There was a problem hiding this comment.
Actually based on other comments #18539 (comment) let me see if we can avoid having to rely on RowType and Hoodie schema passing (by limiting support to Flink 2.1)
|
Thanks for adding this. I had one question about the integration with Flink native From the current implementation, it looks like Hudi I’m wondering how this is expected to work with Flink 2.1+ native My understanding is that we probably need to keep these two layers separate:
And we can just throw unsupported exception for flink versions before 2.1. Could you clarify whether the current |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR adds unshredded Variant write/read support for Flink by threading HoodieSchema through the Parquet write path and intercepting Variant types in the Avro converters, with shredded Variant explicitly blocked. One edge case worth double-checking around the isVariantGroup heuristic on the read path. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of duplication nits below — same write-support reflection logic appears in two methods, and BinaryVariant reflection is reimplemented twice across the variant codepaths.
| if (inner.getType() != HoodieSchemaType.NULL) { | ||
| return inner.getType(); | ||
| } | ||
| } |
There was a problem hiding this comment.
🤖 Could this misidentify a regular user struct as a Variant? Any group with two BINARY fields literally named metadata and value will match here, and the read-side then forces both fields to notNull() and drops any other field metadata. Until the LogicalTypeAnnotation.variantType() annotation is emitted (the TODO above), would it be worth at least also requiring both fields to be REQUIRED (matching what convertVariantToParquetType writes) and the group to have exactly 2 fields? Otherwise an existing parquet file with optional binary metadata + optional binary value could be silently re-interpreted as a Variant ROW with non-null fields, which would mismatch the actual nullability of the data.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| String writeSupportClass = config.getStringOrDefault(HoodieStorageConfig.HOODIE_PARQUET_FLINK_ROW_DATA_WRITE_SUPPORT_CLASS); | ||
| HoodieRowDataParquetWriteSupport writeSupport; | ||
| if (ReflectionUtils.hasConstructor(writeSupportClass, | ||
| new Class<?>[] {Configuration.class, RowType.class, BloomFilter.class, Option.class})) { |
There was a problem hiding this comment.
🤖 nit: this 4-arg-with-fallback-to-3-arg writeSupport construction block is duplicated almost verbatim in the other newParquetFileWriter overload below (~line 156). Could you extract a small private helper like loadWriteSupport(conf, rowType, filter, hoodieSchema, configClass) so both call sites share one implementation? Easier to maintain when the constructor signature evolves.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Oh good point, currently in this PR it also is user-facing, but I think your right and we shouldn't do it this way. When I made this PR I was trying to keep the existing Flink logic of treating variant as Flink |
8cb53b8 to
49c1833
Compare
1e72dc2 to
133edc7
Compare
@voonhous What are your thoughts on this #18539 (comment) (not allowing HUDI readers/writers with Flink versions older than 2.1 to read/write variant). I personally agree and think that, instead of trying to fallback to using |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR adds Flink read/write support for unshredded Variant via reflection-based detection of Flink 2.1+'s native VariantType. One nullability handling issue worth double-checking in the inline comments — please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small naming and simplification suggestions below.
| "VARIANT type is only supported in Flink 2.1+. " | ||
| + "VariantType class not found on the classpath."); | ||
| } | ||
| dataType = variantDataType.notNull(); |
There was a problem hiding this comment.
🤖 I think this .notNull() makes the variant case inconsistent with the other group branches (ARRAY/MAP/ROW above) which produce a nullable dataType and rely on the trailing if (REQUIRED) dataType.notNull() at line 192. Since that trailing check only adds notNull and never makes a type nullable, an OPTIONAL variant group ends up as a non-nullable VariantType in the resulting RowField. Could you drop the .notNull() here so OPTIONAL → nullable / REQUIRED → notNull works the same way as ROW?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| // to match the BinaryVariant(value, metadata) constructor. | ||
| GroupType variantGroup = physicalType.asGroupType(); | ||
| int valueIdx = variantGroup.getFieldIndex(HoodieSchema.Variant.VARIANT_VALUE_FIELD); | ||
| int metadataIdx = variantGroup.getFieldIndex(HoodieSchema.Variant.VARIANT_METADATA_FIELD); |
There was a problem hiding this comment.
🤖 nit: variantDescs only holds descriptors for the value field specifically — could you rename it to valueDescs (and variantDescs.get(0) becomes valueDescs.get(0) on the next line) to match what it actually contains? As-is it reads like it covers the full variant group, which could trip up a future reader.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * fields named {@code metadata} and {@code value}. | ||
| */ | ||
| public static boolean isVariantGroup(GroupType groupType, LogicalTypeAnnotation logicalType) { | ||
| if (hasVariantAnnotation(logicalType)) { |
There was a problem hiding this comment.
🤖 nit: the if/return true / return false pattern here could be collapsed to return hasVariantAnnotation(logicalType) || isVariantByStructure(groupType); — saves a few lines and reads more naturally for a boolean predicate method.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * Looks up the HoodieSchema for a named field within a record schema. | ||
| * Returns null if the record schema is null or does not contain the field. | ||
| */ | ||
| static HoodieSchema resolveFieldSchema(HoodieSchema recordSchema, String fieldName) { |
There was a problem hiding this comment.
I ideally didn't want to make this expansive change in this PR, but I wanted to make sure that Flink would correctly infer parquet variant group written by Spark 4.0 (when parquet variant annotations aren't available), without always assuming that a parquet group with metadata,value is a variant
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! This adds Flink read/write support for unshredded Variant data with reflection-based handling for the Flink 1.20 vs 2.1+ split, schema-driven variant detection for Spark 4.0 compatibility, and clear error messages for unsupported configurations (shredded, nested vectorized, pre-2.1 Flink). After tracing through the variant detection paths (annotation + schema-driven), the vectorized read child ordering, the null-handling wrapper around the AvroToRowData/RowDataToAvro converters, and the TableSchemaResolver propagation in ClusteringOperator and FlinkRowDataReaderContext, no new issues were flagged from this automated pass beyond what prior rounds have already raised — a Hudi committer or PMC member can take it from here for a final review. A few readability suggestions around an unused parameter and reflection caching consistency.
cc @yihua
| GroupType groupType, LogicalTypeAnnotation logicalType, HoodieSchemaType schemaHint) { | ||
| if (hasVariantAnnotation(logicalType)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
🤖 nit: groupType is unused in isVariantGroup — only logicalType and schemaHint are read. Could you either drop the parameter or wire it into a structural check, since right now its presence implies it's part of the detection logic?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * or {@code null} if the class is not on the classpath (pre-2.1 Flink). | ||
| * The reflection result is cached so the class lookup happens at most once per JVM. | ||
| * Only called during schema conversion (cold path), never per row. | ||
| */ |
There was a problem hiding this comment.
🤖 nit: could you use the lazy-holder idiom here (like GetVariantHolder in AbstractHoodieRowData)? That avoids the synchronized on every call and keeps the two reflection-cache patterns in this PR consistent.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| if (recordSchema == null || recordSchema.getType() != HoodieSchemaType.RECORD) { | ||
| return null; | ||
| } | ||
| org.apache.hudi.common.util.Option<HoodieSchemaField> field = recordSchema.getField(fieldName); |
There was a problem hiding this comment.
🤖 nit: could you import org.apache.hudi.common.util.Option instead of fully-qualifying it inline? It would match the import style used elsewhere in this file.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds native Flink 2.1+ Variant support to the parquet write/read paths via reflection, while keeping the shared module compiled against Flink 1.20. The schema-driven variant detection (annotation OR HoodieSchema hint, with no structural fallback) is a nice safe design that avoids misidentifying regular ROW<metadata,value> columns. One small thread-safety question worth a look in the inline comment. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of consistency/dedup suggestions around the reflection-based VARIANT handling, otherwise the code reads cleanly.
| if (metadataMethod == null) { | ||
| metadataMethod = object.getClass().getMethod("getMetadata"); | ||
| valueMethod = object.getClass().getMethod("getValue"); | ||
| } |
There was a problem hiding this comment.
🤖 The lazy init of metadataMethod / valueMethod here isn't thread-safe — if two threads race in convert(), one could see metadataMethod != null after the first assignment but before valueMethod is set, and then NPE on valueMethod.invoke(...). In normal Flink (single-threaded operator task) this can't fire, but since the sibling AvroToRowDataConverters.createVariantConverter() already resolves its Constructor upfront against a known class, would it make sense to do the same here against BinaryVariant.class? That would also remove the per-call getMethod lookup on the first invocation.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| case RAW: | ||
| default: | ||
| if ("VARIANT".equals(logicalType.getTypeRoot().name())) { |
There was a problem hiding this comment.
🤖 nit: the "VARIANT".equals(logicalType.getTypeRoot().name()) check is duplicated across HoodieSchemaConverter, ParquetSchemaConverter, AvroToRowDataConverters, and RowDataToAvroConverters. Could you extract a small helper (e.g. HoodieSchemaConverter.isVariantLogicalType(LogicalType)) so the intent is obvious in one place and future tweaks (e.g. once a hard dependency on Flink 2.1+ lands) only touch one site?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| * already rejects shredded variants before a Flink type or converter is ever constructed, | ||
| * and Flink 2.1 itself only supports unshredded variants (FLIP-521). | ||
| */ | ||
| private static RowDataToAvroConverter createVariantConverter() { |
There was a problem hiding this comment.
🤖 nit: for consistency with AvroToRowDataConverters.createVariantConverter() (which resolves the BinaryVariant constructor once at converter-creation time), could we resolve getMetadata/getValue Methods eagerly here too? The current per-instance lazy resolution adds a null check on every call and diverges from the sibling helper without an obvious reason.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! This PR adds Flink read/write support for unshredded Variant data, with native VARIANT type on Flink 2.1+ via reflection (and clear UnsupportedOperationException on pre-2.1). The variant detection (annotation + schema-hint) and reflection patterns look reasonable to me, and the withTableSchema() plumbing has been applied at all current HoodieRowDataParquetReader call sites. Prior review rounds have already flagged the key concerns I identified. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor consistency nit below; otherwise the code is clean and well-documented.
cc @yihua
| byte[] value = (byte[]) valueMethod.invoke(object); | ||
|
|
||
| final GenericRecord record = new GenericData.Record(schema.toAvroSchema()); | ||
| record.put("metadata", ByteBuffer.wrap(metadata)); |
There was a problem hiding this comment.
🤖 nit: could you use HoodieSchema.Variant.VARIANT_METADATA_FIELD and VARIANT_VALUE_FIELD here (and on the next line) instead of the raw "metadata"/"value" strings? ParquetSchemaConverter.convertVariantToParquetType already references those constants, so using them here keeps the field names consistent across the codebase.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
@danny0405 Thanks for sharing! I guess then it would be better if I wait for that to land and rebase this PR off of that right? @cshuo @danny0405 After taking a step back, I think we should first get consensus on how we want Flink in HUDI to always correctly infer the correct Flink type to use for Variant and Vector and Blob. Can we discuss on #18711 and I can work on and land initial PR(s) that does the needed "wiring" so that when we start tackling blob and vector we can use a similar approach as variant. Since although with Variant we are a bit "lucky" in the sense that parquet has an official variant annotation, it would be good to have a common solution and agree on what kind of "backwards compatibility" we want (like HUDI 1.3 Flink being able to read data written by Spark 4.0 HUDI 1.2 builds). |
04028a1 to
5a83603
Compare
…ers/readers using Flink's Variant type Co-authored-by: Cursor <cursoragent@cursor.com>
5a83603 to
56adff0
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18539 +/- ##
============================================
- Coverage 68.07% 58.25% -9.82%
+ Complexity 29108 25250 -3858
============================================
Files 2528 2528
Lines 141510 141517 +7
Branches 17552 17549 -3
============================================
- Hits 96329 82441 -13888
- Misses 37255 51987 +14732
+ Partials 7926 7089 -837
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@danny0405 @cshuo Based on our discussion in #18711 I updated this PR |
Describe the issue this Pull Request addresses
Add support for reading and writing unshredded Variant data types through the Flink write/read paths in Hudi. This enables Flink-based ingestion pipelines to both write new Variant data (e.g., via
PARSE_JSON()in Flink SQL) and read existing Variant data (including tables written by Spark 4.0 / PR #18036). Flink Variant writes produce the same canonical Parquet layout ({metadata: required binary, value: required binary}) as Spark, ensuring full cross-engine interoperability.On Flink 2.1+, Variant columns are exposed as native
VARIANTLogicalType (enabling SQL functions likePARSE_JSONand correctDESCRIBE TABLEoutput). Pre-2.1 Flink does not support Variant — all Variant code paths throwUnsupportedOperationExceptionwith a clear message indicating Flink 2.1+ is required (similar to how Spark pre-4.0 rejects Variant usage).Shredded Variant is explicitly blocked with clear error messages until full support is added in a follow-up.
We explicitly want to block unsupported Variant use cases for Flink:
with explicit error messages, to reduce issues with debugging or data correctness.
Summary and Changelog
Native Flink 2.1+ VARIANT type via
DataTypeAdapter:HoodieSchemaConverter.convertVariant()emits Flink's nativeVariantTypeon 2.1+ using the multi-versionDataTypeAdaptershim (compiled per-Flink-version inhudi-flink2.1.x). This avoids runtime reflection entirely —DataTypeAdapter.createVariantType()provides theVariantTypeinstance, and companion methods (createVariant(),getVariant(),getVariantMetadata(),getVariantValue()) handle RowData access. On pre-2.1 Flink, throwsUnsupportedOperationException.Reverse VARIANT mapping:
HoodieSchemaConverter.convertToSchema()handlesLogicalTypeRoot.VARIANT(by name string comparison, to avoid compile-time dependency on Flink 2.1) and maps it back toHoodieSchema.createVariant().Factory-based schema injection for Flink Parquet readers:
HoodieFileReaderFactorygains a newgetFileReader(config, path, schemaOption)overload and anewParquetFileReader(path, schemaOption)extension point (with a backward-compatible default that delegates to the oldnewParquetFileReader(path)). The Flink factory (HoodieRowDataFileReaderFactory) overrides the new method to pass theOption<HoodieSchema>to theHoodieRowDataParquetReaderconstructor. This ensures schema flows through at construction time — not via a post-construction setter — so that:FlinkRowDataReaderContext,ClusteringOperator) supplies the table schema for correct Variant/Blob/Vector type inference.HoodieMergeHelper,HoodieWriteMergeHandle) supplies the writer schema, fixing upsert failures where the reader previously couldn't derive the correct RowType for Variant columns.HoodieParquetDataBlock) supplies the writer schema from the block header.Option.empty()via the existing 2-arggetFileReader(config, path)and never invoke schema-dependent methods.No changes to Spark, Avro, or Trino factories — they inherit the default which delegates to their existing
newParquetFileReader(path)override.Fail-fast schema enforcement:
HoodieRowDataParquetReader.getSchema()andgetRowType()throwIllegalStateExceptionif the schema was not supplied at construction time. This prevents silent mis-inference of Variant/Blob/Vector columns as ordinary BYTES/ROW during record reading. The error message directs callers to use the schema-aware factory overload.Parquet schema conversion (write):
ParquetSchemaConverter.convertVariantToParquetType()produces the canonical unshredded Parquet Variant group ({metadata: required binary, value: required binary}).Defense-in-depth Variant annotation detection:
ParquetSchemaConverterdetects the ParquetVARIANTlogical type annotation (parquet-java 1.15.2+) via class-name string matching to avoid compile-time dependency. Shredded variants (withvalue_shreddedfield) are rejected with clear errors.Vectorized VARIANT read (Flink 2.1.x):
ParquetSplitReaderUtilinhudi-flink2.1.xaddscase VARIANT:increateColumnReader,createWritableColumnVector, andcreateVectorFromConstant. The reader produces aHeapRowColumnVectorwith child vectors ordered[value, metadata]to match Flink'sVectorizedColumnBatch.getVariant().Nested Variant support:
ARRAY<VARIANT>,MAP<STRING, VARIANT>) works through the Avro converter and Parquet schema write paths.ColumnarGroupRowData.getVariant(),ColumnarGroupArrayData.getVariant()) throwsUnsupportedOperationException.RowData ↔ Avro converters: Both
RowDataToAvroConvertersandAvroToRowDataConvertersdetectVARIANTLogicalTypeRoot at runtime (by name) and handle native FlinkVariantobjects viaDataTypeAdaptermethods. On pre-2.1 Flink the VARIANT case never fires because the schema conversion would have already thrown.Shredded Variant guards: All Flink Variant code paths —
HoodieSchemaConverter.convertVariant(),RowDataToAvroConverters.convertVariantToAvro(), andParquetSchemaConverter.isShreddedVariant()— throwUnsupportedOperationExceptionfor shredded Variant schemas until full support is added.Unit tests:
TestRowDataToAvroConvertersVariant(shredded rejection, unshredded Avro round-trip, nested variant in arrays/maps),TestHoodieSchemaConverter(shredded rejection, pre-2.1 rejection for standalone/nested/record variants),TestParquetSchemaConverter(annotation-based detection, shredded rejection with annotation, pre-2.1 rejection).Impact
VARIANTtype inDESCRIBE TABLEoutput and can use Variant SQL functions.UnsupportedOperationExceptionif they attempt to use Variant columns, with a clear message that Flink 2.1+ is required.HoodieSchemadeclaration.HoodieFileReaderFactoryschema overload is a cross-engine addition (inhudi-common) but only the Flink factory overrides it. Spark/Avro/Trino factories are untouched.Risk Level
Low. Changes are additive and scoped to the Flink Variant code path. Non-Variant schemas are completely unaffected — the new logic only activates when
HoodieSchemaType.VARIANTis detected. TheDataTypeAdaptershim avoids reflection entirely. The factorynewParquetFileReader(path, schemaOption)default delegates to the old method, preserving all existing behavior for non-Flink engines.Documentation Update
None. Variant type support is an internal storage capability that works transparently with existing Flink SQL / DataStream APIs.
Contributor's checklist