Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

Merged
merged 3 commits into from
Apr 10, 2023

Conversation

voonhous
Copy link
Member

@voonhous voonhous commented Mar 27, 2023

Change Logs

When lazy fetching partition path & file slice for HoodieFileIndex is used, date cannot be converted to the correct string representation.

This is the case as Spark store dates as an integer value representing the number of days that has past since 1970-01-01.

When rebuilding the partition path, this FS path could be rebuilt wrongly causing a partition to be empty, and hence, the query result to be empty/incorrect.

INFO DataSourceStrategy: Pruning directories with: isnotnull(country#80), isnotnull(par_date#81),(country#80 = ID),(par_date#81=19415)

...

INFO AbstractTableFileSystemView: Building file system view for partition (country=ID/par_date=19415) 

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Co-authored-by: Rex An <bonean131@gmail.com>
@voonhous
Copy link
Member Author

@boneanxs for visibility.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous Can you also attach the file system view log, how the partition looks like with this fix?

Comment on lines 123 to 127
// TLDR:
// execution order of [B] = pass
// execution order of [B, A] = pass
// execution order of [A] = fail
// execution order of [A, B] = fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this comment and add it inline where each part or the combination of different execution orders is being tested.

Copy link
Member Author

@voonhous voonhous Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, will remove these as these are just comments that were used to aid debugging.

Leaving these in the final PR will only confuse any other engineers working on this component.

I will also simplify the tests to only test the paths that will trigger the error.

@codope codope added priority:major degraded perf; unable to move forward; potential bugs spark Issues related to spark index labels Mar 27, 2023
@voonhous
Copy link
Member Author

@voonhous Can you also attach the file system view log, how the partition looks like with this fix?

After the fix, it should look like this:

INFO  org.apache.spark.sql.execution.datasources.DataSourceStrategy [] - Pruning directories with: isnotnull(date_par#166),isnotnull(country#165),(date_par#166 = 19417),(country#165 = ID)

...

org.apache.hudi.common.table.view.AbstractTableFileSystemView [] - Building file system view for partition (country=ID/date_par=2023-02-28)

Co-authored-by: Rex An <bonean131@gmail.com>
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous Can you check if #8280 also fixes this issue?

@voonhous
Copy link
Member Author

@voonhous Can you check if #8280 also fixes this issue?

Nope, these are 2 separate issues. #8280 is a schema evolution issue, while this PR fixes a partition pruning issue.

Copy link
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have already verified that the fix could work well and thanks for @voonhous fix.
cc @codope

@voonhous voonhous changed the title [HUDI-5989] Fix date conversion issue [HUDI-5989] Fix date conversion issue when performing partition pruning on Spark Mar 31, 2023
@voonhous
Copy link
Member Author

@codope CI is green, can you please help to review this, thank you!

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, can we add some test for custom timezone because the DataFormat is related.

@voonhous
Copy link
Member Author

Overall looks good, can we add some test for custom timezone because the DataFormat is related.

Done, I am not sure if what i am doing is correct. CMIIW, changing the timezone should have no impact on the date conversion here.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member Author

voonhous commented Apr 7, 2023

@danny0405 CI is green. Can we merge this in?

Another PR #8402 has been raised that fixes another issue around this code. Can we merge this in first so that there are no merge conflicts moving forward?

@danny0405 danny0405 merged commit d6ff3d6 into apache:master Apr 10, 2023
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
…ng on Spark (apache#8298)

When lazy fetching partition path & file slice for HoodieFileIndex is used, date cannot be converted to the correct string representation.

This is the case as Spark store dates as an integer value representing the number of days that has past since 1970-01-01.

When rebuilding the partition path, this FS path could be rebuilt wrongly causing a partition to be empty, and hence, the query result to be empty/incorrect.

Co-authored-by: Rex An <bonean131@gmail.com>
@voonhous voonhous deleted the HUDI-5989 branch April 28, 2023 07:34
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
…ng on Spark (apache#8298)

When lazy fetching partition path & file slice for HoodieFileIndex is used, date cannot be converted to the correct string representation.

This is the case as Spark store dates as an integer value representing the number of days that has past since 1970-01-01.

When rebuilding the partition path, this FS path could be rebuilt wrongly causing a partition to be empty, and hence, the query result to be empty/incorrect.

Co-authored-by: Rex An <bonean131@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
index priority:major degraded perf; unable to move forward; potential bugs schema-and-data-types spark Issues related to spark
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants