Skip to content

[HUDI-8552] Adding support to read input for clustering based on new FG reader #12426

Merged
codope merged 2 commits intoapache:masterfrom
nsivabalan:fileGroupReaderBasedClusteringInput
Dec 5, 2024
Merged

[HUDI-8552] Adding support to read input for clustering based on new FG reader #12426
codope merged 2 commits intoapache:masterfrom
nsivabalan:fileGroupReaderBasedClusteringInput

Conversation

@nsivabalan
Copy link
Contributor

Change Logs

  • Adding support to read input for clustering based on new FG reader
  • Fixed FileGroup reader to maintain a schema cache and avoid storing entire schema for every record in the spillable map.

Impact

Standardized file slice reading across all different code paths (table services)

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 5, 2024
// Do broadcast.
sqlConfBroadcast = jsc.broadcast(sqlConf);
configurationBroadcast = jsc.broadcast(new SerializableConfiguration(jsc.hadoopConfiguration()));
// new Configuration() is critical so that we don't run into ConcurrentModificatonException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we abstract some utility method for such config fixes?

Cosmetic changes for MultipleSparkJobExecutionStrategy

Fixing spark parquet reader SQLConf parsing issue

fixing spillable map to contain schema Id instead of schema

Add schema encode/decode for record metadata schema

renaming INTERNAL_META_SCHEMA to INTERNAL_META_SCHEMA_ID

Fixing compilation issues after rebase

Fixing schema handling with delete records

Fixing failing test

Fixing sql config parsing issue
@codope codope force-pushed the fileGroupReaderBasedClusteringInput branch from 4a18b88 to 1a2b767 Compare December 5, 2024 09:42
@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 5, 2024

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope
Copy link
Member

codope commented Dec 5, 2024

Landing this. The github action failed due to

Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf$.LEGACY_PARQUET_NANOS_AS_LONG()Lorg/apache/spark/internal/config/ConfigEntry;
  at org.apache.spark.sql.execution.datasources.parquet.SparkParquetReaderBase.read(SparkParquetReaderBase.scala:73)
  at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.readBaseFile(HoodieFileGroupReaderBasedParquetFileFormat.scala:286)
  at org.apache.spark.sql.execution.datasources.parquet.HoodieFileGroupReaderBasedParquetFileFormat.$anonfun$buildReaderWithPartitionValues$3(HoodieFileGroupReaderBasedParquetFileFormat.scala:204)

This only happens for Soark 3.3.1, while 3.3.2+ is passing. We will update the README with the supported version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants