[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

li36909 · 2021-04-08T14:04:23Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

current when hudi bootstrap a parquet file, or upsert into a parquet file which contains timestmap column, it will fail because these issues:

At bootstrap operation, if the origin parquet file was written by a spark application, then spark will default save timestamp as int96(see spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github.com/apache/parquet-mr/pull/831/files)
after bootstrap, doing upsert will fail because we use hoodie schema to read origin parquet file. The schema is not match because hoodie schema treat timestamp as long and at origin file it’s Int96
after bootstrap, and partial update for a parquet file will fail, because we copy the old record and save by hoodie schema( we miss a convertFixedToLong operation like spark does)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

add new UT, and also check by exists UTs

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

li36909 · 2021-04-08T14:07:24Z

cc @nsivabalan could you help to take a look, thank you

nsivabalan · 2021-04-08T14:26:13Z

@li36909 : can you fix the links in the description. guess its cuttoff.

parquet.avro.readInt96AsFixed=true, please check https://github https://github/.com/apache/parquet-mr/pull/831/files)

nsivabalan · 2021-04-08T15:04:54Z

@li36909 : IIUC, this patch is not about failing a bootstrap or upsert w/ timestamp. We are adding support for timestamp column by upgrading parquet version. If yes, please do fix the title of the patch.
Also, looks like we are upgrading the parquet version in this patch. @vinothchandar @n3nash @bvaradar : thoughts. is there any considerations required on this end.

li36909 · 2021-04-13T08:20:14Z

@nsivabalan thank you, I had change the link. let me check the UT fail first, then I will chagne the title

li36909 · 2021-04-13T08:22:26Z

there are some UT fail cause by: https://github.com/apache/parquet-mr/pull/747/files
at this pr, parquet set requiredSchema first then do filter, and when we run count() at spark morRelation, the requiredSchma is empty, then the filter result is empty.
I upgrade parquet version for spark also, and sprk has not problem, and I hand check the reson is that: spark add filter attributes to requiredSchema like this:
val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq ++ projects

… column

vinothchandar · 2021-04-19T22:51:35Z

@li36909 Thanks for your contribution! Queued up for review.

vinothchandar · 2021-04-19T22:51:48Z

@umehrot2 in case you have some time, please take a pass.

nsivabalan

Upgrading parquet version is not trivial. Might have to consider emr spark compatability as we might run into issues.
@n3nash @vinothchandar @umehrot2 : this patch is proposing to upgrade parquet version. May I know what all do we need to consider for such an upgrade.

vinothchandar · 2021-06-28T15:07:29Z

cc @n3nash , could you please chime in here.

codope

@li36909 Could you please rebase and resolve the conflicts?

nsivabalan · 2021-11-02T03:41:30Z

@li36909 : also would be good to put up a separate patch for upgrading parquet version.

hudi-bot · 2021-11-05T02:52:29Z

CI report:

41aec71 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-01-24T04:09:37Z

@xushiyan @alexeykudinkin : another patch thats awaiting parquet upgrade.

bvaradar · 2023-03-08T05:57:02Z

With Spark 3, parquet version 1.12..2 is being used. Is this issue still happening with that setup ?

yihua · 2024-09-10T14:54:17Z

@jonvex could you check if this is still a problem?

yihua · 2024-09-10T14:55:18Z

...udi-client-common/src/main/java/org/apache/hudi/table/action/commit/AbstractMergeHelper.java

@@ -96,7 +96,7 @@ protected Iterator<GenericRecord> getMergingIterator(HoodieTable<T, I, K, O> tab
    Configuration bootstrapFileConfig = new Configuration(table.getHadoopConf());
    HoodieFileReader<GenericRecord> bootstrapReader = HoodieFileReaderFactory.<GenericRecord>getFileReader(bootstrapFileConfig, externalFilePath);
    Schema bootstrapReadSchema;
-    if (externalSchemaTransformation) {
+    if (externalSchemaTransformation || baseFile.getBootstrapBaseFile().isPresent()) {


We should check if this is already fixed.

nsivabalan self-assigned this Apr 8, 2021

li36909 force-pushed the bootstrap_timestamp branch 2 times, most recently from 1832aa0 to 0996d9a Compare April 12, 2021 03:29

li36909 force-pushed the bootstrap_timestamp branch 4 times, most recently from b76b16e to afbe55f Compare April 14, 2021 03:17

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp…

41aec71

… column

li36909 force-pushed the bootstrap_timestamp branch from afbe55f to 41aec71 Compare April 14, 2021 07:47

vinothchandar assigned umehrot2 Apr 19, 2021

vinothchandar added the schema-and-data-types label Apr 19, 2021

nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021

nsivabalan reviewed May 24, 2021

View reviewed changes

codope requested changes Sep 8, 2021

View reviewed changes

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024

yihua reviewed Sep 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

li36909 commented Apr 8, 2021 •

edited

Loading

li36909 commented Apr 8, 2021

nsivabalan commented Apr 8, 2021

nsivabalan commented Apr 8, 2021 •

edited

Loading

li36909 commented Apr 13, 2021

li36909 commented Apr 13, 2021 •

edited

Loading

vinothchandar commented Apr 19, 2021

vinothchandar commented Apr 19, 2021

nsivabalan left a comment •

edited

Loading

vinothchandar commented Jun 28, 2021

codope left a comment

nsivabalan commented Nov 2, 2021

hudi-bot commented Nov 5, 2021

nsivabalan commented Jan 24, 2022

bvaradar commented Mar 8, 2023

yihua commented Sep 10, 2024

yihua Sep 10, 2024

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

Are you sure you want to change the base?

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

Conversation

li36909 commented Apr 8, 2021 • edited Loading

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

li36909 commented Apr 8, 2021

nsivabalan commented Apr 8, 2021

nsivabalan commented Apr 8, 2021 • edited Loading

li36909 commented Apr 13, 2021

li36909 commented Apr 13, 2021 • edited Loading

vinothchandar commented Apr 19, 2021

vinothchandar commented Apr 19, 2021

nsivabalan left a comment • edited Loading

Choose a reason for hiding this comment

vinothchandar commented Jun 28, 2021

codope left a comment

Choose a reason for hiding this comment

nsivabalan commented Nov 2, 2021

hudi-bot commented Nov 5, 2021

CI report:

nsivabalan commented Jan 24, 2022

bvaradar commented Mar 8, 2023

yihua commented Sep 10, 2024

yihua Sep 10, 2024

Choose a reason for hiding this comment

li36909 commented Apr 8, 2021 •

edited

Loading

nsivabalan commented Apr 8, 2021 •

edited

Loading

li36909 commented Apr 13, 2021 •

edited

Loading

nsivabalan left a comment •

edited

Loading