Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does Hudi not support field deletions? #2331

Closed
brandon-stanley opened this issue Dec 13, 2020 · 10 comments
Closed

Why does Hudi not support field deletions? #2331

brandon-stanley opened this issue Dec 13, 2020 · 10 comments
Assignees
Labels
priority:minor everything else; usability gaps; questions; feature reqs

Comments

@brandon-stanley
Copy link

Hi Hudi Team! I have a question about field deletions/schema evolution. The FAQ Documentation states the following:

Hudi uses Avro as the internal canonical representation for records, primarily due to its nice schema compatibility & evolution properties. This is a key aspect of having reliability in your ingestion or ETL pipelines. As long as the schema passed to Hudi (either explicitly in DeltaStreamer schema provider configs or implicitly by Spark Datasource's Dataset schemas) is backwards compatible (e.g no field deletes, only appending new fields to schema), Hudi will seamlessly handle read/write of old and new data and also keep the Hive schema up-to date.

While reading the Confluent Documentation that is linked above, I noticed that "Delete fields" is an "allowed change" for BACKWARDS compatible schemas. I assume that the Avro schema that is tracked within Hudi is BACKWARDS compatible and therefore should allow field deletions, but the FAQ Documentation states otherwise. Can you please clarify the following:

  1. Why field deletions are not supported within Hudi?
  2. Is there is a way to determine (and possibly update) the Avro schema compatibility type for a Hudi table?
@bvaradar
Copy link
Contributor

@prashantwason @nbalajee @satishkotha : Can you please look into this ?

@prashantwason
Copy link
Member

I think the distinction is in UPDATE use-cases. Consider this scenario:

t1: Insert with Schema 1: file1.parquet is created and records have schema1

t2: Update with Schema 2: Suppose schema 2 has 1 field deleted. A single record is being updated. This will lead to file1.parquet being read and re-written (after update of the single record) into file2.parquet. But all records in file2.parquet would no have the deleted field.

Another scenario is possible where the deleted field is later added back with a different incompatible "type" (e.g. an int field was deleted and another field with same name but "string" type was added). This schema will have issues reading historical data within the dataset which was written with older schema.

If you want to delete field within a HUDI dataset, it may be simpler to copy the dataset using a new schema.

@brandon-stanley
Copy link
Author

Thanks for your response @prashantwason.

Does this mean that the implementation of maintaining schemas within Hudi is more of a wrapper around Avro which has an additional check to ensure that there are no deletes because of the scenarios you listed above? I just want to confirm because the documentation that I've previously linked states that deletes are supported for BACKWARDS compatible schemas within Avro:

image

Cheers,

Brandon

@prashantwason
Copy link
Member

Thats correct.

HUDI does not have a full schema management system. The schema to be used is provided at the time of the write where we validate that the schema being used for current write is compatible with the existing schema (from the previous writes). Hence, HUDI schema management is very simplistic compared to the documentation you have referred.

In producer-consumer systems, schema compatibility is a simpler job - by upgrading the producer and consumer code with newer schemas the schema can be changed - as all new data will be generated using a schema which both understand and there is no historical data with older schema version to be processed any longer. But within HUDI there are always versions of data saved with older schema and to continue to provide features like incremental read (which reads data over a time-range) and updates (old data can be changed), we have to restrict the schema modification.

@tooptoop4
Copy link

@prashantwason does hudi support adding new columns or changing existing columns types (ie long to string) ?

@prashantwason
Copy link
Member

Yes, adding new columns (fields in schema) is supported as long as they have default values specified. This is because the new fields will not be present in older records and hence cannot be populated directly on reading records from existing data.

The following field type changes are allowed:
old_type ->. new_type
int long
int float
int double
long double
float double
bytes string
string bytes

Code references:
https://github.com/apache/hudi/pull/2350/files
https://github.com/rdblue/avro-java/blob/master/avro/src/main/java/org/apache/avro/SchemaCompatibility.java#L359

@nsivabalan
Copy link
Contributor

@prashantwason : In lieu of this ticket, do you think we can update our documentation wrt schema evolution. If you don't mind can you take it up and fix our documentation. https://issues.apache.org/jira/browse/HUDI-1548

@nsivabalan nsivabalan added the priority:minor everything else; usability gaps; questions; feature reqs label Feb 6, 2021
@nsivabalan
Copy link
Contributor

CC @n3nash. schema evolution related ask.

@n3nash
Copy link
Contributor

n3nash commented Jun 3, 2021

Closing this ticket, docs will be added as part of the JIRA. @brandon-stanley Feel free to re-open if needed

@n3nash n3nash closed this as completed Jun 3, 2021
@pratyakshsharma
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:minor everything else; usability gaps; questions; feature reqs
Projects
None yet
Development

No branches or pull requests

7 participants