Why does Hudi not support field deletions? #2331

brandon-stanley · 2020-12-13T19:25:29Z

Hi Hudi Team! I have a question about field deletions/schema evolution. The FAQ Documentation states the following:

Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.

While reading the Confluent Documentation that is linked above, I noticed that "Delete fields" is an "allowed change" for BACKWARDS compatible schemas. I assume that the Avro schema that is tracked within Hudi is BACKWARDS compatible and therefore should allow field deletions, but the FAQ Documentation states otherwise. Can you please clarify the following:

Why field deletions are not supported within Hudi?
Is there is a way to determine (and possibly update) the Avro schema compatibility type for a Hudi table?

The text was updated successfully, but these errors were encountered:

bvaradar · 2020-12-15T17:10:48Z

@prashantwason @nbalajee @satishkotha : Can you please look into this ?

prashantwason · 2020-12-16T20:25:38Z

I think the distinction is in UPDATE use-cases. Consider this scenario:

t1: Insert with Schema 1: file1.parquet is created and records have schema1

t2: Update with Schema 2: Suppose schema 2 has 1 field deleted. A single record is being updated. This will lead to file1.parquet being read and re-written (after update of the single record) into file2.parquet. But all records in file2.parquet would no have the deleted field.

Another scenario is possible where the deleted field is later added back with a different incompatible "type" (e.g. an int field was deleted and another field with same name but "string" type was added). This schema will have issues reading historical data within the dataset which was written with older schema.

If you want to delete field within a HUDI dataset, it may be simpler to copy the dataset using a new schema.

brandon-stanley · 2020-12-17T16:29:43Z

Thanks for your response @prashantwason.

Does this mean that the implementation of maintaining schemas within Hudi is more of a wrapper around Avro which has an additional check to ensure that there are no deletes because of the scenarios you listed above? I just want to confirm because the documentation that I've previously linked states that deletes are supported for BACKWARDS compatible schemas within Avro:

Cheers,

Brandon

prashantwason · 2020-12-18T17:50:08Z

Thats correct.

HUDI does not have a full schema management system. The schema to be used is provided at the time of the write where we validate that the schema being used for current write is compatible with the existing schema (from the previous writes). Hence, HUDI schema management is very simplistic compared to the documentation you have referred.

In producer-consumer systems, schema compatibility is a simpler job - by upgrading the producer and consumer code with newer schemas the schema can be changed - as all new data will be generated using a schema which both understand and there is no historical data with older schema version to be processed any longer. But within HUDI there are always versions of data saved with older schema and to continue to provide features like incremental read (which reads data over a time-range) and updates (old data can be changed), we have to restrict the schema modification.

tooptoop4 · 2020-12-18T19:03:17Z

@prashantwason does hudi support adding new columns or changing existing columns types (ie long to string) ?

prashantwason · 2020-12-18T22:40:56Z

Yes, adding new columns (fields in schema) is supported as long as they have default values specified. This is because the new fields will not be present in older records and hence cannot be populated directly on reading records from existing data.

The following field type changes are allowed:
old_type ->. new_type
int long
int float
int double
long double
float double
bytes string
string bytes

Code references:
https://github.com/apache/hudi/pull/2350/files
https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/SchemaCompatibility.java#L359

nsivabalan · 2021-01-24T21:11:29Z

@prashantwason : In lieu of this ticket, do you think we can update our documentation wrt schema evolution. If you don't mind can you take it up and fix our documentation. https://issues.apache.org/jira/browse/HUDI-1548

nsivabalan · 2021-02-06T17:37:51Z

CC @n3nash. schema evolution related ask.

n3nash · 2021-06-03T06:29:15Z

Closing this ticket, docs will be added as part of the JIRA. @brandon-stanley Feel free to re-open if needed

pratyakshsharma · 2022-04-05T19:16:54Z

Reason why Hudi does not support field deletions - https://hudi.apache.org/docs/troubleshooting#caused-by-orgapacheparquetioinvalidrecordexception-parquetavro-schema-mismatch-avro-field-col1-not-found

bvaradar assigned prashantwason Dec 21, 2020

nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label Feb 6, 2021

n3nash closed this as completed Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does Hudi not support field deletions? #2331

Why does Hudi not support field deletions? #2331

brandon-stanley commented Dec 13, 2020

bvaradar commented Dec 15, 2020

prashantwason commented Dec 16, 2020

brandon-stanley commented Dec 17, 2020

prashantwason commented Dec 18, 2020

tooptoop4 commented Dec 18, 2020

prashantwason commented Dec 18, 2020

nsivabalan commented Jan 24, 2021

nsivabalan commented Feb 6, 2021

n3nash commented Jun 3, 2021

pratyakshsharma commented Apr 5, 2022

Why does Hudi not support field deletions? #2331

Why does Hudi not support field deletions? #2331

Comments

brandon-stanley commented Dec 13, 2020

bvaradar commented Dec 15, 2020

prashantwason commented Dec 16, 2020

brandon-stanley commented Dec 17, 2020

prashantwason commented Dec 18, 2020

tooptoop4 commented Dec 18, 2020

prashantwason commented Dec 18, 2020

nsivabalan commented Jan 24, 2021

nsivabalan commented Feb 6, 2021

n3nash commented Jun 3, 2021

pratyakshsharma commented Apr 5, 2022