[HUDI-6118] Some fixes to improve the MDT and record index code base. by prashantwason · Pull Request #9106 · apache/hudi

prashantwason · 2023-06-30T17:03:48Z

[HUDI-6118] Some fixes to improve the MDT and record index code base.

Change Logs

Print MDT partition name instead of the enum tostring in logs
Use fsView.loadAllPartitions()
When publishing size metrics for MDT, only consider partitions which have been initialized
Fixed job status names
Limited logs which were printing the entire list of partitions. This is very verbose for datasets with large number of partitions
Added a config to reduce the max parallelism of record index initialization.
Changed defaults for MDT write configs to reasonable values
Added config for MDT logBlock size. Larger blocks are preferred to reduce lookup time.
Fixed the size metrics for MDT. These metrics should be set instead of incremented.

Impact

Fixes issues for the recently commited RI and MDT changes

Risk level (write none, low medium or high below)

Low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-07-01T10:38:09Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

  public Map<String, String> stats() {
-    return metrics.map(m -> m.getStats(true, metadataMetaClient, this)).orElse(new HashMap<>());
+    Set<String> allMetadataPartitionPaths = Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
+    return metrics.map(m -> m.getStats(true, metadataMetaClient, this, allMetadataPartitionPaths)).orElse(new HashMap<>());


Do we need to fetch the enabled partitions here instead of all?

HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient metaClient, HoodieTableMetadata metadata)

reloads the timeline.
can we move the reload to outside of the caller so that we don't reload for every MDT partition stats

Removed the reload of timeline. It is actually not required since the code is called right after commit where the metaClient is reloaded anyways.

danny0405 · 2023-07-01T10:40:07Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

        try {
-          LOG.info("Building file system view for partitions " + partitionSet);
+          if (partitionSet.size() < 100) {
+            LOG.info("Building file system view for partitions: " + partitionSet);


Can we just switch to LOG.debug, is the logging really useful?

yes, may be we should reconsider the freq of logging here. for eg, log every every 100 partitions or something. not sure we will gain much by logging this for every partition.

Converted to a debug log.

danny0405 · 2023-07-01T10:43:27Z

...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java

+    // File groups in each partition are fixed at creation time and we do not want them to be split into multiple files
+    // ever. Hence, we use a very large basefile size in metadata table. The actual size of the HFiles created will
+    // eventually depend on the number of file groups selected for each partition (See estimateFileGroupCount function)
+    final long maxHFileSizeBytes = 10 * 1024 * 1024 * 1024L; // 10GB


Can we define them as static constants instead?

danny0405 · 2023-07-01T10:43:42Z

...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java

+    // Keeping the log blocks as large as the log files themselves reduces the number of HFile blocks to be checked for
+    // presence of keys.
+    final long maxLogFileSizeBytes = writeConfig.getMetadataConfig().getMaxLogFileSize();
+    final long maxLogBlockSizeBytes = maxLogFileSizeBytes;


Removed. Moved the comment to where it is used.

...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java

danny0405

Thanks for the contribution @prashantwason , the cleaning strategy change for MDT is huge and can you elaborate the details here?

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java

nsivabalan · 2023-07-04T23:13:13Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

        try {
-          LOG.info("Building file system view for partitions " + partitionSet);
+          if (partitionSet.size() < 100) {
+            LOG.info("Building file system view for partitions: " + partitionSet);


yes, may be we should reconsider the freq of logging here. for eg, log every every 100 partitions or something. not sure we will gain much by logging this for every partition.

nsivabalan · 2023-07-04T23:15:47Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

  public Map<String, String> stats() {
-    return metrics.map(m -> m.getStats(true, metadataMetaClient, this)).orElse(new HashMap<>());
+    Set<String> allMetadataPartitionPaths = Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet());
+    return metrics.map(m -> m.getStats(true, metadataMetaClient, this, allMetadataPartitionPaths)).orElse(new HashMap<>());


HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient metaClient, HoodieTableMetadata metadata)

reloads the timeline.
can we move the reload to outside of the caller so that we don't reload for every MDT partition stats

prashantwason · 2023-07-11T22:56:50Z

@danny0405 @nsivabalan I think the cleaning strategy change for MDT is a bugfix because of the following enhancements:

Initial commit on the MDT will create hfiles
Rollbacks not actually rollback the MDT instead of adding a -f1, -f2 deltacommit

If we KEEP_LATEST_COMMITS then a wrong setting here would probably keep only a single HFile and that will limit the rollback. We cannot rollback the MDT beyond the last hfile as we will lose data.

danny0405 · 2023-07-12T02:20:21Z

then a wrong setting here would probably keep only a single HFile

Can we add some validation logic in metadata table write config builder and guard the correctness? To keep at least 2 version for each file group will also double the storage for metadata table.

prashantwason · 2023-07-17T17:42:05Z

@hudi-bot run azure

prashantwason · 2023-07-18T13:10:17Z

@hudi-bot run azure

prashantwason · 2023-07-19T18:17:06Z

@danny0405 @nsivabalan I have reverted the change to the cleaning policy. PTAL again.

nsivabalan

@prashantwason :
can we increase the value for DEFAULT_METADATA_CLEANER_COMMITS_RETAINED to 20.
I understand it may not give full coverage, but atleast some buffer during restore.

1. Print MDT partition name instead of the enum tostring in logs 2. Use fsView.loadAllPartitions() 3. When publishing size metrics for MDT, only consider partitions which have been initialized 4. Fixed job status names 5. Limited logs which were printing the entire list of partitions. This is very verbose for datasets with large number of partitions 6. Added a config to reduce the max parallelism of record index initialization. 7. Changed defaults for MDT write configs to reasonable values 8. Added config for MDT logBlock size. Larger blocks are preferred to reduce lookup time. 9. Fixed the size metrics for MDT. These metrics should be set instead of incremented.

Renamed withMaxInitParallelism as it is only for RI

danny0405 · 2023-07-21T04:43:28Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java

+  public static final ConfigProperty<Integer> RECORD_INDEX_MAX_PARALLELISM = ConfigProperty
+      .key(METADATA_PREFIX + ".max.init.parallelism")
+      .defaultValue(100000)
+      .sinceVersion("0.14.0")


Do we need a parallelism of 100000 ?

danny0405

+1, except a confusion why we need a max init parallelism for RLI with 100000?

hudi-bot · 2023-07-21T06:38:47Z

CI report:

ee2ba88 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

prashantwason · 2023-07-21T16:30:45Z

@danny0405 The max init values for other indexes are too low (See HUID 6553). Indexes are really useful for large datasets which have large number of partitions and files. Assume a large dataset with 100K+ files. The default parallelism of the index initialization in code is like 200 which would take HOURS for the indexes to be built. With a large parallelism:

The actual used parallelism is min(number_of_operations, 100,000)
So for small datasets, the lower value is used'
For larger datasets 100K is used.

We routinely have datasets with over 1M files in them (as large as 6M files). I have tested with various parallelism values and its not an exact science but somewhere around 100,000 was where I got the fastest bootstrap of the indexes. Very large parallelism causes OOM and memory issues on Spark.

If you leave the defaults to 200 -> many people would report timeouts building indexes on larger tables.

prashantwason requested a review from nsivabalan June 30, 2023 17:04

danny0405 reviewed Jul 1, 2023

View reviewed changes

...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java Outdated Show resolved Hide resolved

danny0405 requested changes Jul 1, 2023

View reviewed changes

nsivabalan reviewed Jul 4, 2023

View reviewed changes

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jul 4, 2023

prashantwason force-pushed the pw_testing_fixes branch from eb56e1b to 2c07b3e Compare July 11, 2023 22:53

nsivabalan approved these changes Jul 19, 2023

View reviewed changes

nsivabalan reviewed Jul 19, 2023

View reviewed changes

prashantwason and others added 5 commits July 20, 2023 17:14

Addressed review comments

6676187

Some metrics for reading from RI

6751ce9

Renamed withMaxInitParallelism as it is only for RI

Reverting cleaning policy change.

f8dfdc9

Fixing default values for cleaner commits retained with MDt

ee2ba88

nsivabalan force-pushed the pw_testing_fixes branch from d5c62b9 to ee2ba88 Compare July 21, 2023 00:15

danny0405 reviewed Jul 21, 2023

View reviewed changes

danny0405 approved these changes Jul 21, 2023

View reviewed changes

nsivabalan approved these changes Jul 21, 2023

View reviewed changes

danny0405 merged commit 629349c into apache:master Jul 22, 2023

Conversation

prashantwason commented Jun 30, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantwason commented Jul 11, 2023

Uh oh!

danny0405 commented Jul 12, 2023

Uh oh!

prashantwason commented Jul 17, 2023

Uh oh!

prashantwason commented Jul 18, 2023

Uh oh!

prashantwason commented Jul 19, 2023

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 21, 2023

CI report:

Uh oh!

prashantwason commented Jul 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants