Skip to content

[SUPPORT] MERGE INTO with UPDATE */ INESRT * - new incoming columns dropped, automatic schema evolution feature #5899

@kazdy

Description

@kazdy

Describe the problem you faced

I've tried using MERGE INTO with UPDATE * and INSERT * statement with full schema evolution enabled.
I've noticed that during insert new columns from incoming batch (that do not exist in target table yet) are dropped and target schema is applied. No warnings nor failed writes.

Therefore can we as users automatically evolve schema on MERGE INTO operations?
I guess this should only be supported when we use update set * and insert * in merge operation.

Expected behavior

When incoming data is missing columns that already declared in target table these should be injected with default/null values.
When incoming data has new columns that are not yet declared in the target table, these should be added to the target table.
Case when incoming data has both missing columns and new columns, missing coluns should be injected with null/ default values, new columns should be added to the target table.

New columns should be reflected in metastore table schema.

Would be great to support complex types, and nested schemas.

Thread from dev mailing list as a reference:
https://lists.apache.org/thread/kr59hh7yqr2c1y33kzfv3n97h6ydbz9b

Environment Description

  • Hudi version : 0.11

  • Spark version : 3.2.0-amzn

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Metadata

Metadata

Assignees

Labels

area:schemaSchema evolution and data typespriority:mediumModerate impact; usability gaps

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions