[HUDI-8552] Adding support to read input for clustering based on new FG reader by nsivabalan · Pull Request #12426 · apache/hudi

nsivabalan · 2024-12-05T02:28:28Z

Change Logs

Adding support to read input for clustering based on new FG reader
Fixed FileGroup reader to maintain a schema cache and avoid storing entire schema for every record in the spillable map.

Impact

Standardized file slice reading across all different code paths (table services)

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2024-12-05T03:01:17Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/SparkBroadcastManager.java

    // Do broadcast.
    sqlConfBroadcast = jsc.broadcast(sqlConf);
-    configurationBroadcast = jsc.broadcast(new SerializableConfiguration(jsc.hadoopConfiguration()));
+    // new Configuration() is critical so that we don't run into ConcurrentModificatonException


Can we abstract some utility method for such config fixes?

Cosmetic changes for MultipleSparkJobExecutionStrategy Fixing spark parquet reader SQLConf parsing issue fixing spillable map to contain schema Id instead of schema Add schema encode/decode for record metadata schema renaming INTERNAL_META_SCHEMA to INTERNAL_META_SCHEMA_ID Fixing compilation issues after rebase Fixing schema handling with delete records Fixing failing test Fixing sql config parsing issue

hudi-bot · 2024-12-05T11:17:30Z

CI report:

deda2f4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope · 2024-12-05T12:33:04Z

Landing this. The github action failed due to

Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_PARQUET_NANOS_AS_LONG()Lorg/apache/spark/internal/config/ConfigEntry;
  at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:73)
  at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.readBaseFile(HoodieFileGroupReaderBasedParquetFileFormat.scala:286)
  at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:204)

This only happens for Soark 3.3.1, while 3.3.2+ is passing. We will update the README with the supported version.

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 5, 2024

danny0405 reviewed Dec 5, 2024

View reviewed changes

danny0405 approved these changes Dec 5, 2024

View reviewed changes

nsivabalan mentioned this pull request Dec 5, 2024

[HUDI-8552] Adding support to read input for clustering based on new FG reader #12395

Closed

4 tasks

codope force-pushed the fileGroupReaderBasedClusteringInput branch from 4a18b88 to 1a2b767 Compare December 5, 2024 09:42

remove type check

deda2f4

codope merged commit 944f8e4 into apache:master Dec 5, 2024

codope mentioned this pull request Dec 5, 2024

[MINOR] Fix README and CI fir min Spark 3.3 patch version support #12431

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8552] Adding support to read input for clustering based on new FG reader #12426

[HUDI-8552] Adding support to read input for clustering based on new FG reader #12426
codope merged 2 commits intoapache:masterfrom
nsivabalan:fileGroupReaderBasedClusteringInput

nsivabalan commented Dec 5, 2024

Uh oh!

danny0405 Dec 5, 2024

Uh oh!

hudi-bot commented Dec 5, 2024

Uh oh!

codope commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nsivabalan commented Dec 5, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Dec 5, 2024

CI report:

Uh oh!

codope commented Dec 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants