[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

voonhous · 2023-03-27T05:35:46Z

Change Logs

When lazy fetching partition path & file slice for HoodieFileIndex is used, date cannot be converted to the correct string representation.

This is the case as Spark store dates as an integer value representing the number of days that has past since 1970-01-01.

When rebuilding the partition path, this FS path could be rebuilt wrongly causing a partition to be empty, and hence, the query result to be empty/incorrect.

INFO DataSourceStrategy: Pruning directories with: isnotnull(country#80), isnotnull(par_date#81),(country#80 = ID),(par_date#81=19415)

...

INFO AbstractTableFileSystemView: Building file system view for partition (country=ID/par_date=19415)

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

Co-authored-by: Rex An <bonean131@gmail.com>

voonhous · 2023-03-27T05:38:40Z

@boneanxs for visibility.

codope

@voonhous Can you also attach the file system view log, how the partition looks like with this fix?

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

codope · 2023-03-27T05:49:00Z

...urce/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestLazyPartitionPathFetching.scala

+      // TLDR:
+      // execution order of [B] = pass
+      // execution order of [B, A] = pass
+      // execution order of [A] = fail
+      // execution order of [A, B] = fail


remove this comment and add it inline where each part or the combination of different execution orders is being tested.

Sorry, will remove these as these are just comments that were used to aid debugging.

Leaving these in the final PR will only confuse any other engineers working on this component.

I will also simplify the tests to only test the paths that will trigger the error.

voonhous · 2023-03-27T06:33:27Z

@voonhous Can you also attach the file system view log, how the partition looks like with this fix?

After the fix, it should look like this:

INFO  org.apache.spark.sql.execution.datasources.DataSourceStrategy [] - Pruning directories with: isnotnull(date_par#166),isnotnull(country#165),(date_par#166 = 19417),(country#165 = ID)

...

org.apache.hudi.common.table.view.AbstractTableFileSystemView [] - Building file system view for partition (country=ID/date_par=2023-02-28)

Co-authored-by: Rex An <bonean131@gmail.com>

codope

@voonhous Can you check if #8280 also fixes this issue?

voonhous · 2023-03-27T14:20:40Z

@voonhous Can you check if #8280 also fixes this issue?

Nope, these are 2 separate issues. #8280 is a schema evolution issue, while this PR fixes a partition pruning issue.

SteNicholas

LGTM. I have already verified that the fix could work well and thanks for @voonhous fix.
cc @codope

voonhous · 2023-03-31T00:26:02Z

@codope CI is green, can you please help to review this, thank you!

danny0405

Overall looks good, can we add some test for custom timezone because the DataFormat is related.

voonhous · 2023-03-31T11:14:03Z

Overall looks good, can we add some test for custom timezone because the DataFormat is related.

Done, I am not sure if what i am doing is correct. CMIIW, changing the timezone should have no impact on the date conversion here.

hudi-bot · 2023-03-31T17:34:30Z

CI report:

ea7352f Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous · 2023-04-07T09:18:58Z

@danny0405 CI is green. Can we merge this in?

Another PR #8402 has been raised that fixes another issue around this code. Can we merge this in first so that there are no merge conflicts moving forward?

…ng on Spark (apache#8298) When lazy fetching partition path & file slice for HoodieFileIndex is used, date cannot be converted to the correct string representation. This is the case as Spark store dates as an integer value representing the number of days that has past since 1970-01-01. When rebuilding the partition path, this FS path could be rebuilt wrongly causing a partition to be empty, and hence, the query result to be empty/incorrect. Co-authored-by: Rex An <bonean131@gmail.com>

[HUDI-5989] Fix date conversion issue

3ee99c4

Co-authored-by: Rex An <bonean131@gmail.com>

codope reviewed Mar 27, 2023

View reviewed changes

codope added priority:major degraded perf; unable to move forward; potential bugs spark Issues related to spark index labels Mar 27, 2023

Update tests to remove comments used for debugging

87898cd

Co-authored-by: Rex An <bonean131@gmail.com>

codope reviewed Mar 27, 2023

View reviewed changes

codope added the schema-and-data-types label Mar 27, 2023

SteNicholas approved these changes Mar 28, 2023

View reviewed changes

voonhous changed the title ~~[HUDI-5989] Fix date conversion issue~~ [HUDI-5989] Fix date conversion issue when performing partition pruning on Spark Mar 31, 2023

danny0405 reviewed Mar 31, 2023

View reviewed changes

Added test with timezone custom timezone

ea7352f

danny0405 approved these changes Apr 1, 2023

View reviewed changes

danny0405 merged commit d6ff3d6 into apache:master Apr 10, 2023

voonhous deleted the HUDI-5989 branch April 28, 2023 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

voonhous commented Mar 27, 2023 •

edited

Loading

voonhous commented Mar 27, 2023

codope left a comment

codope Mar 27, 2023

voonhous Mar 27, 2023 •

edited

Loading

voonhous commented Mar 27, 2023

codope left a comment

voonhous commented Mar 27, 2023

SteNicholas left a comment •

edited

Loading

voonhous commented Mar 31, 2023

danny0405 left a comment

voonhous commented Mar 31, 2023

hudi-bot commented Mar 31, 2023

voonhous commented Apr 7, 2023

[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

[HUDI-5989] Fix date conversion issue when performing partition pruning on Spark #8298

Conversation

voonhous commented Mar 27, 2023 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

voonhous commented Mar 27, 2023

codope left a comment

Choose a reason for hiding this comment

codope Mar 27, 2023

Choose a reason for hiding this comment

voonhous Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

voonhous commented Mar 27, 2023

codope left a comment

Choose a reason for hiding this comment

voonhous commented Mar 27, 2023

SteNicholas left a comment • edited Loading

Choose a reason for hiding this comment

voonhous commented Mar 31, 2023

danny0405 left a comment

Choose a reason for hiding this comment

voonhous commented Mar 31, 2023

hudi-bot commented Mar 31, 2023

CI report:

voonhous commented Apr 7, 2023

voonhous commented Mar 27, 2023 •

edited

Loading

voonhous Mar 27, 2023 •

edited

Loading

SteNicholas left a comment •

edited

Loading