Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column #2790

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

li36909
Copy link
Contributor

@li36909 li36909 commented Apr 8, 2021

Tips

What is the purpose of the pull request

current when hudi bootstrap a parquet file, or upsert into a parquet file which contains timestmap column, it will fail because these issues:

  1. At bootstrap operation, if the origin parquet file was written by a spark application, then spark will default save timestamp as int96(see spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github.com/apache/parquet-mr/pull/831/files)

  2. after bootstrap, doing upsert will fail because we use hoodie schema to read origin parquet file. The schema is not match because hoodie schema treat timestamp as long and at origin file it’s Int96

  3. after bootstrap, and partial update for a parquet file will fail, because we copy the old record and save by hoodie schema( we miss a convertFixedToLong operation like spark does)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

add new UT, and also check by exists UTs

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@li36909
Copy link
Contributor Author

li36909 commented Apr 8, 2021

cc @nsivabalan could you help to take a look, thank you

@nsivabalan
Copy link
Contributor

@li36909 : can you fix the links in the description. guess its cuttoff.

parquet.avro.readInt96AsFixed=true, please check https://github https://github/.com/apache/parquet-mr/pull/831/files)

@nsivabalan
Copy link
Contributor

nsivabalan commented Apr 8, 2021

@li36909 : IIUC, this patch is not about failing a bootstrap or upsert w/ timestamp. We are adding support for timestamp column by upgrading parquet version. If yes, please do fix the title of the patch.
Also, looks like we are upgrading the parquet version in this patch. @vinothchandar @n3nash @bvaradar : thoughts. is there any considerations required on this end.

@nsivabalan nsivabalan self-assigned this Apr 8, 2021
@li36909 li36909 force-pushed the bootstrap_timestamp branch 2 times, most recently from 1832aa0 to 0996d9a Compare April 12, 2021 03:29
@li36909
Copy link
Contributor Author

li36909 commented Apr 13, 2021

@nsivabalan thank you, I had change the link. let me check the UT fail first, then I will chagne the title

@li36909
Copy link
Contributor Author

li36909 commented Apr 13, 2021

there are some UT fail cause by: https://github.com/apache/parquet-mr/pull/747/files
at this pr, parquet set requiredSchema first then do filter, and when we run count() at spark morRelation, the requiredSchma is empty, then the filter result is empty.
I upgrade parquet version for spark also, and sprk has not problem, and I hand check the reson is that: spark add filter attributes to requiredSchema like this:
val requiredExpressions: Seq[NamedExpression] = filterAttributes.toSeq ++ projects

@li36909 li36909 force-pushed the bootstrap_timestamp branch 4 times, most recently from b76b16e to afbe55f Compare April 14, 2021 03:17
@vinothchandar
Copy link
Member

@li36909 Thanks for your contribution! Queued up for review.

@vinothchandar
Copy link
Member

@umehrot2 in case you have some time, please take a pass.

@nsivabalan nsivabalan added the priority:major degraded perf; unable to move forward; potential bugs label May 11, 2021
Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrading parquet version is not trivial. Might have to consider emr spark compatability as we might run into issues.
@n3nash @vinothchandar @umehrot2 : this patch is proposing to upgrade parquet version. May I know what all do we need to consider for such an upgrade.

@vinothchandar
Copy link
Member

cc @n3nash , could you please chime in here.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li36909 Could you please rebase and resolve the conflicts?

@nsivabalan
Copy link
Contributor

@li36909 : also would be good to put up a separate patch for upgrading parquet version.

@hudi-bot
Copy link

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

@xushiyan @alexeykudinkin : another patch thats awaiting parquet upgrade.

@bvaradar
Copy link
Contributor

bvaradar commented Mar 8, 2023

With Spark 3, parquet version 1.12..2 is being used. Is this issue still happening with that setup ?

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
@yihua
Copy link
Contributor

yihua commented Sep 10, 2024

@jonvex could you check if this is still a problem?

@@ -96,7 +96,7 @@ protected Iterator<GenericRecord> getMergingIterator(HoodieTable<T, I, K, O> tab
Configuration bootstrapFileConfig = new Configuration(table.getHadoopConf());
HoodieFileReader<GenericRecord> bootstrapReader = HoodieFileReaderFactory.<GenericRecord>getFileReader(bootstrapFileConfig, externalFilePath);
Schema bootstrapReadSchema;
if (externalSchemaTransformation) {
if (externalSchemaTransformation || baseFile.getBootstrapBaseFile().isPresent()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if this is already fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:major degraded perf; unable to move forward; potential bugs schema-and-data-types size:M PR with lines of changes in (100, 300]
Projects
Status: 🚧 Needs Repro
Status: 🏗 Under discussion
Development

Successfully merging this pull request may close these issues.

8 participants