[HUDI-6729] Fix get partition values from path for non-string type partition column by wecharyu · Pull Request #9484 · apache/hudi

wecharyu · 2023-08-19T10:51:29Z

Change Logs

When we enable hoodie.datasource.read.extract.partition.values.from.path to get partition values from path instead of data file, the exception throw if partition column is not string type.

This patch fix the issue by cast partition value string to target datatype, following Spark's approach.

Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
    at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:103)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt(rows.scala:41)
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getInt$(rows.scala:41)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getInt(rows.scala:195)
    at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:97)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:245)
    at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:264)
    at org.apache.spark.sql.execution.datasources.parquet.Spark32LegacyHoodieParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(Spark32LegacyHoodieParquetFileFormat.scala:314)
    at org.apache.hudi.HoodieDataSourceHelper$.$anonfun$buildHoodieParquetReader$1(HoodieDataSourceHelper.scala:67)
    at org.apache.hudi.HoodieBaseRelation.$anonfun$createBaseFileReader$2(HoodieBaseRelation.scala:602)
    at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680)
    at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$1(HoodieBaseRelation.scala:706)
    at org.apache.hudi.HoodieBaseRelation$.$anonfun$projectReader$2(HoodieBaseRelation.scala:711)
    at org.apache.hudi.HoodieBaseRelation$BaseFileReader.apply(HoodieBaseRelation.scala:680)
    at org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:96)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)

Impact

No

Risk level (write none, low medium or high below)

None

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

boneanxs · 2023-08-21T03:46:23Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala

-          }
-        }
+        val timeZoneId = conf.get("timeZone", sparkSession.sessionState.conf.sessionLocalTimeZone)
+        val rowValues = HoodieSparkUtils.parsePartitionColumnValues(


HoodieSparkUtils.parsePartitionColumnValues could return empty if it can't parse partition values, we better add an assertion here to ensure the values' size is equal to the number of partition columns.

…rtition column

…ark versions

…ception in HoodieBaseRelation#getPartitionColumnsAsInternalRowInternal

hudi-bot · 2023-08-22T10:30:34Z

CI report:

ff53ccb Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…rtition column (#9484) * reuse HoodieSparkUtils#parsePartitionColumnValues to support multi spark versions * assert parsed partition values from path * throw exception instead of return empty InternalRow when encounter exception in HoodieBaseRelation#getPartitionColumnsAsInternalRowInternal

…rtition column (apache#9484) * reuse HoodieSparkUtils#parsePartitionColumnValues to support multi spark versions * assert parsed partition values from path * throw exception instead of return empty InternalRow when encounter exception in HoodieBaseRelation#getPartitionColumnsAsInternalRowInternal

boneanxs reviewed Aug 21, 2023

View reviewed changes

danny0405 self-assigned this Aug 22, 2023

danny0405 added engine:spark Spark integration area:schema Schema evolution and data types labels Aug 22, 2023

wecharyu added 3 commits August 22, 2023 11:20

[HUDI-6729] Fix get partition values from path for non-string type pa…

ece75de

…rtition column

reuse HoodieSparkUtils#parsePartitionColumnValues to support multi sp…

f413a6f

…ark versions

assert parsed partition values from path

a41670e

wecharyu force-pushed the HUDI-6729 branch from 905cc6b to a41670e Compare August 22, 2023 03:22

throw exception instead of return empty InternalRow when encounter ex…

ff53ccb

…ception in HoodieBaseRelation#getPartitionColumnsAsInternalRowInternal

danny0405 approved these changes Aug 23, 2023

View reviewed changes

danny0405 merged commit d1f83de into apache:master Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6729] Fix get partition values from path for non-string type partition column#9484

[HUDI-6729] Fix get partition values from path for non-string type partition column#9484
danny0405 merged 4 commits intoapache:masterfrom
wecharyu:HUDI-6729

wecharyu commented Aug 19, 2023

Uh oh!

boneanxs Aug 21, 2023

Uh oh!

wecharyu Aug 22, 2023

Uh oh!

hudi-bot commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wecharyu commented Aug 19, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

boneanxs Aug 21, 2023

Choose a reason for hiding this comment

Uh oh!

wecharyu Aug 22, 2023

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Aug 22, 2023

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants