Skip to content

Comments

[HUDI-5987] Fix clustering on bootstrap table with row writer disabled#8342

Merged
yihua merged 4 commits intoapache:masterfrom
codope:fix-clustering-non-row-writer
Apr 20, 2023
Merged

[HUDI-5987] Fix clustering on bootstrap table with row writer disabled#8342
yihua merged 4 commits intoapache:masterfrom
codope:fix-clustering-non-row-writer

Conversation

@codope
Copy link
Member

@codope codope commented Mar 31, 2023

Change Logs

Clustering on a bootstrap table (METADATA_ONLY bootstrap mode) with row writer disabled did not show correct results. Only meta-fields were populated, while data columns were null. This PR fixes the bug. It adds a separate HoodieBootstrapFileReader that stitches the meta columns with the data columns.

Before this fix, snapshot query after clustering on the bootstrap table (check data columns):

+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+--------+--------------+-----+------+---------+---------+-------+-------+----+-----------+------------------+-------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |timestamp|_row_key|partition_path|rider|driver|begin_lat|begin_lon|end_lat|end_lon|fare|tip_history|_hoodie_is_deleted|datestr|
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+--------+--------------+-----+------+---------+---------+-------+-------+----+-----------+------------------+-------+
|00000000000001     |00000000000001_7_0  |trip_25           |datestr=2018          |57b0b650-582e-46be-97c1-24942abf9291-0_0-25-68_20230331123135604.parquet|null     |null    |null          |null |null  |null     |null     |null   |null   |null|null       |null              |null   |
|00000000000001     |00000000000001_7_1  |trip_26           |datestr=2018          |57b0b650-582e-46be-97c1-24942abf9291-0_0-25-68_20230331123135604.parquet|null     |null    |null          |null |null  |null     |null     |null   |null   |null|null       |null              |null   |
|00000000000001     |00000000000001_7_2  |trip_27           |datestr=2018          |57b0b650-582e-46be-97c1-24942abf9291-0_0-25-68_20230331123135604.parquet|null     |null    |null          |null |null  |null     |null     |null   |null   |null|null       |null              |null   |
|00000000000001     |00000000000001_7_3  |trip_28           |datestr=2018          |57b0b650-582e-46be-97c1-24942abf9291-0_0-25-68_20230331123135604.parquet|null     |null    |null          |null |null  |null     |null     |null   |null   |null|null       |null              |null   |
|00000000000001     |00000000000001_7_4  |trip_29           |datestr=2018          |57b0b650-582e-46be-97c1-24942abf9291-0_0-25-68_20230331123135604.parquet|null     |null    |null          |null |null  |null     |null     |null   |null   |null|null       |null              |null   |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+---------+--------+--------------+-----+------+---------+---------+-------+-------+----+-----------+------------------+-------+
only showing top 5 rows

After this fix:

+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+-------------+--------+--------------+--------+---------+------------------+-------------------+-------------------+-------------------+-------------------------+---------------------------+------------------+-------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                                                       |timestamp    |_row_key|partition_path|rider   |driver   |begin_lat         |begin_lon          |end_lat            |end_lon            |fare                     |tip_history                |_hoodie_is_deleted|datestr|
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+-------------+--------+--------------+--------+---------+------------------+-------------------+-------------------+-------------------+-------------------------+---------------------------+------------------+-------+
|00000000000001     |00000000000001_6_0  |trip_25           |datestr=2018          |d1ff344c-5f84-41a5-9005-118dca129b30-0_0-25-68_20230331123355809.parquet|1680265998688|trip_25 |1680265998688 |rider_25|driver_25|0.5533512237148628|0.5681057293830623 |0.07047592205107045|0.48492180271084273|[50.64985395541603, USD] |[[13.994074573153537, USD]]|false             |2018   |
|00000000000001     |00000000000001_6_1  |trip_26           |datestr=2018          |d1ff344c-5f84-41a5-9005-118dca129b30-0_0-25-68_20230331123355809.parquet|1680265998688|trip_26 |1680265998688 |rider_26|driver_26|0.5483446057519853|0.4787499068864025 |0.3438732794172973 |0.23687569228870997|[7.205379512650445, USD] |[[35.07044590983842, USD]] |false             |2018   |
|00000000000001     |00000000000001_6_2  |trip_27           |datestr=2018          |d1ff344c-5f84-41a5-9005-118dca129b30-0_0-25-68_20230331123355809.parquet|1680265998688|trip_27 |1680265998688 |rider_27|driver_27|0.537233960307539 |0.40689621626018646|0.9996340618547656 |0.5136491330606007 |[36.44767181591008, USD] |[[41.64897840565314, USD]] |false             |2018   |
|00000000000001     |00000000000001_6_3  |trip_28           |datestr=2018          |d1ff344c-5f84-41a5-9005-118dca129b30-0_0-25-68_20230331123355809.parquet|1680265998688|trip_28 |1680265998688 |rider_28|driver_28|0.1564015159732266|0.939421953589519  |0.5996707207519414 |0.6017343270411845 |[28.086841928072538, USD]|[[73.75301394035617, USD]] |false             |2018   |
|00000000000001     |00000000000001_6_4  |trip_29           |datestr=2018          |d1ff344c-5f84-41a5-9005-118dca129b30-0_0-25-68_20230331123355809.parquet|1680265998688|trip_29 |1680265998688 |rider_29|driver_29|0.8958509615328525|0.3775097084776675 |0.18696411628776888|0.2629518404213518 |[22.826042892060862, USD]|[[62.964192832973396, USD]]|false             |2018   |
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+-------------+--------+--------------+--------+---------+------------------+-------------------+-------------------+-------------------+-------------------------+---------------------------+------------------+-------+
only showing top 5 rows

Impact

A bug fix for bootstrap tables.

Risk level (write none, low medium or high below)

low

Only when the base file has a bootstrap path in clustering then only the HoodieBootstrapFileReader will be used.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope
Copy link
Member Author

codope commented Mar 31, 2023

Rebased on top of #8289 with some cleanup and fixes. Tested for all Spark versions supported by Hudi.

@codope
Copy link
Member Author

codope commented Apr 4, 2023

Screenshot 2023-04-04 at 5 24 58 PM

testHoodieClientBasicMultiWriterWithEarlyConflictDetection is flaky and it's being tracked in HUDI-5831

Comment on lines 338 to 339
String timeZoneId = jsc.getConf().get("timeZone", SQLConf.get().sessionLocalTimeZone());
boolean shouldValidateColumns = jsc.getConf().getBoolean("spark.sql.sources.validatePartitionColumns", true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be moved to closure and use hadoopConf.get() to avoid additional variables passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@Override
public Schema getSchema() {
return skeletonFileReader.getSchema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this used? I assume this only contains meta fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not used because we enforce the reader schema at the call site. But, I have changed it for and returning merged schema for completeness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S g

clusteringOpsPartition.forEachRemaining(clusteringOp -> {
try {
Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(writeConfig.getSchema()));
HoodieFileReader baseFileReader = HoodieFileReaderFactory.getReaderFactory(recordType).getFileReader(hadoopConf.get(), new Path(clusteringOp.getDataFilePath()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should skip this for bootstrap file group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually bootstrap file reader depends on both skeleton and data file reader. So, this becomes the skeleton file reader in the conditional block.

Schema readerSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(writeConfig.getSchema()));
HoodieFileReader baseFileReader = HoodieFileReaderFactory.getReaderFactory(recordType).getFileReader(hadoopConf.get(), new Path(clusteringOp.getDataFilePath()));
// handle bootstrap path
if (StringUtils.nonEmpty(clusteringOp.getBootstrapFilePath()) && StringUtils.nonEmpty(bootstrapBasePath)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide the same fix for MOR table, in readRecordsForGroupWithLogs(jsc, clusteringOps, instantTime)? E.g., clustering is applied to a bootstrap file group with bootstrap data file, skeleton file, and log files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Refactored and added a test to cover this scenario.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a good fix!

@yihua yihua merged commit 0451731 into apache:master Apr 20, 2023
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 20, 2023
apache#8342)

This commit fixes the bug where the clustering on a bootstrap table (METADATA_ONLY bootstrap mode) with row writer disabled did not show correct results. Only meta-fields were populated, while data columns were null.  The fix adds a separate HoodieBootstrapFileReader that stitches the meta columns with the data columns.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
apache#8342)

This commit fixes the bug where the clustering on a bootstrap table (METADATA_ONLY bootstrap mode) with row writer disabled did not show correct results. Only meta-fields were populated, while data columns were null.  The fix adds a separate HoodieBootstrapFileReader that stitches the meta columns with the data columns.
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
apache#8342)

This commit fixes the bug where the clustering on a bootstrap table (METADATA_ONLY bootstrap mode) with row writer disabled did not show correct results. Only meta-fields were populated, while data columns were null.  The fix adds a separate HoodieBootstrapFileReader that stitches the meta columns with the data columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:table-service Table services

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants