-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790
base: master
Are you sure you want to change the base?
Conversation
cc @nsivabalan could you help to take a look, thank you |
@li36909 : can you fix the links in the description. guess its cuttoff.
|
@li36909 : IIUC, this patch is not about failing a bootstrap or upsert w/ timestamp. We are adding support for timestamp column by upgrading parquet version. If yes, please do fix the title of the patch. |
1832aa0
to
0996d9a
Compare
@nsivabalan thank you, I had change the link. let me check the UT fail first, then I will chagne the title |
there are some UT fail cause by: https://github.com/apache/parquet-mr/pull/747/files |
b76b16e
to
afbe55f
Compare
afbe55f
to
41aec71
Compare
@li36909 Thanks for your contribution! Queued up for review. |
@umehrot2 in case you have some time, please take a pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrading parquet version is not trivial. Might have to consider emr spark compatability as we might run into issues.
@n3nash @vinothchandar @umehrot2 : this patch is proposing to upgrade parquet version. May I know what all do we need to consider for such an upgrade.
cc @n3nash , could you please chime in here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@li36909 Could you please rebase and resolve the conflicts?
@li36909 : also would be good to put up a separate patch for upgrading parquet version. |
@xushiyan @alexeykudinkin : another patch thats awaiting parquet upgrade. |
With Spark 3, parquet version 1.12..2 is being used. Is this issue still happening with that setup ? |
@jonvex could you check if this is still a problem? |
@@ -96,7 +96,7 @@ protected Iterator<GenericRecord> getMergingIterator(HoodieTable<T, I, K, O> tab | |||
Configuration bootstrapFileConfig = new Configuration(table.getHadoopConf()); | |||
HoodieFileReader<GenericRecord> bootstrapReader = HoodieFileReaderFactory.<GenericRecord>getFileReader(bootstrapFileConfig, externalFilePath); | |||
Schema bootstrapReadSchema; | |||
if (externalSchemaTransformation) { | |||
if (externalSchemaTransformation || baseFile.getBootstrapBaseFile().isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should check if this is already fixed.
Tips
What is the purpose of the pull request
current when hudi bootstrap a parquet file, or upsert into a parquet file which contains timestmap column, it will fail because these issues:
At bootstrap operation, if the origin parquet file was written by a spark application, then spark will default save timestamp as int96(see spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github.com/apache/parquet-mr/pull/831/files)
after bootstrap, doing upsert will fail because we use hoodie schema to read origin parquet file. The schema is not match because hoodie schema treat timestamp as long and at origin file it’s Int96
after bootstrap, and partial update for a parquet file will fail, because we copy the old record and save by hoodie schema( we miss a convertFixedToLong operation like spark does)
Brief change log
(for example:)
Verify this pull request
add new UT, and also check by exists UTs
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.