[HUDI-6118] Some fixes to improve the MDT and record index code base.#9106
[HUDI-6118] Some fixes to improve the MDT and record index code base.#9106danny0405 merged 5 commits intoapache:masterfrom
Conversation
| public Map<String, String> stats() { | ||
| return metrics.map(m -> m.getStats(true, metadataMetaClient, this)).orElse(new HashMap<>()); | ||
| Set<String> allMetadataPartitionPaths = Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet()); | ||
| return metrics.map(m -> m.getStats(true, metadataMetaClient, this, allMetadataPartitionPaths)).orElse(new HashMap<>()); |
There was a problem hiding this comment.
Do we need to fetch the enabled partitions here instead of all?
There was a problem hiding this comment.
HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient metaClient, HoodieTableMetadata metadata)
reloads the timeline.
can we move the reload to outside of the caller so that we don't reload for every MDT partition stats
There was a problem hiding this comment.
Removed the reload of timeline. It is actually not required since the code is called right after commit where the metaClient is reloaded anyways.
| try { | ||
| LOG.info("Building file system view for partitions " + partitionSet); | ||
| if (partitionSet.size() < 100) { | ||
| LOG.info("Building file system view for partitions: " + partitionSet); |
There was a problem hiding this comment.
Can we just switch to LOG.debug, is the logging really useful?
There was a problem hiding this comment.
yes, may be we should reconsider the freq of logging here. for eg, log every every 100 partitions or something. not sure we will gain much by logging this for every partition.
There was a problem hiding this comment.
Converted to a debug log.
| // File groups in each partition are fixed at creation time and we do not want them to be split into multiple files | ||
| // ever. Hence, we use a very large basefile size in metadata table. The actual size of the HFiles created will | ||
| // eventually depend on the number of file groups selected for each partition (See estimateFileGroupCount function) | ||
| final long maxHFileSizeBytes = 10 * 1024 * 1024 * 1024L; // 10GB |
There was a problem hiding this comment.
Can we define them as static constants instead?
| // Keeping the log blocks as large as the log files themselves reduces the number of HFile blocks to be checked for | ||
| // presence of keys. | ||
| final long maxLogFileSizeBytes = writeConfig.getMetadataConfig().getMaxLogFileSize(); | ||
| final long maxLogBlockSizeBytes = maxLogFileSizeBytes; |
There was a problem hiding this comment.
Removed. Moved the comment to where it is used.
...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
Outdated
Show resolved
Hide resolved
danny0405
left a comment
There was a problem hiding this comment.
Thanks for the contribution @prashantwason , the cleaning strategy change for MDT is huge and can you elaborate the details here?
...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
Show resolved
Hide resolved
...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
Outdated
Show resolved
Hide resolved
...ient/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
Outdated
Show resolved
Hide resolved
| try { | ||
| LOG.info("Building file system view for partitions " + partitionSet); | ||
| if (partitionSet.size() < 100) { | ||
| LOG.info("Building file system view for partitions: " + partitionSet); |
There was a problem hiding this comment.
yes, may be we should reconsider the freq of logging here. for eg, log every every 100 partitions or something. not sure we will gain much by logging this for every partition.
| public Map<String, String> stats() { | ||
| return metrics.map(m -> m.getStats(true, metadataMetaClient, this)).orElse(new HashMap<>()); | ||
| Set<String> allMetadataPartitionPaths = Arrays.stream(MetadataPartitionType.values()).map(MetadataPartitionType::getPartitionPath).collect(Collectors.toSet()); | ||
| return metrics.map(m -> m.getStats(true, metadataMetaClient, this, allMetadataPartitionPaths)).orElse(new HashMap<>()); |
There was a problem hiding this comment.
HoodieMetadataMetrics.getStats(boolean detailed, HoodieTableMetaClient metaClient, HoodieTableMetadata metadata)
reloads the timeline.
can we move the reload to outside of the caller so that we don't reload for every MDT partition stats
eb56e1b to
2c07b3e
Compare
|
@danny0405 @nsivabalan I think the cleaning strategy change for MDT is a bugfix because of the following enhancements:
If we KEEP_LATEST_COMMITS then a wrong setting here would probably keep only a single HFile and that will limit the rollback. We cannot rollback the MDT beyond the last hfile as we will lose data. |
Can we add some validation logic in metadata table write config builder and guard the correctness? To keep at least 2 version for each file group will also double the storage for metadata table. |
|
@hudi-bot run azure |
1 similar comment
|
@hudi-bot run azure |
|
@danny0405 @nsivabalan I have reverted the change to the cleaning policy. PTAL again. |
nsivabalan
left a comment
There was a problem hiding this comment.
@prashantwason :
can we increase the value for DEFAULT_METADATA_CLEANER_COMMITS_RETAINED to 20.
I understand it may not give full coverage, but atleast some buffer during restore.
1. Print MDT partition name instead of the enum tostring in logs 2. Use fsView.loadAllPartitions() 3. When publishing size metrics for MDT, only consider partitions which have been initialized 4. Fixed job status names 5. Limited logs which were printing the entire list of partitions. This is very verbose for datasets with large number of partitions 6. Added a config to reduce the max parallelism of record index initialization. 7. Changed defaults for MDT write configs to reasonable values 8. Added config for MDT logBlock size. Larger blocks are preferred to reduce lookup time. 9. Fixed the size metrics for MDT. These metrics should be set instead of incremented.
Renamed withMaxInitParallelism as it is only for RI
d5c62b9 to
ee2ba88
Compare
| public static final ConfigProperty<Integer> RECORD_INDEX_MAX_PARALLELISM = ConfigProperty | ||
| .key(METADATA_PREFIX + ".max.init.parallelism") | ||
| .defaultValue(100000) | ||
| .sinceVersion("0.14.0") |
There was a problem hiding this comment.
Do we need a parallelism of 100000 ?
danny0405
left a comment
There was a problem hiding this comment.
+1, except a confusion why we need a max init parallelism for RLI with 100000?
|
@danny0405 The max init values for other indexes are too low (See HUID 6553). Indexes are really useful for large datasets which have large number of partitions and files. Assume a large dataset with 100K+ files. The default parallelism of the index initialization in code is like 200 which would take HOURS for the indexes to be built. With a large parallelism:
We routinely have datasets with over 1M files in them (as large as 6M files). I have tested with various parallelism values and its not an exact science but somewhere around 100,000 was where I got the fastest bootstrap of the indexes. Very large parallelism causes OOM and memory issues on Spark. If you leave the defaults to 200 -> many people would report timeouts building indexes on larger tables. |
[HUDI-6118] Some fixes to improve the MDT and record index code base.
Change Logs
Impact
Fixes issues for the recently commited RI and MDT changes
Risk level (write none, low medium or high below)
Low
Documentation Update
None
Contributor's checklist