-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-5348] Cache file slices in HoodieBackedTableMetadata #7436
[HUDI-5348] Cache file slices in HoodieBackedTableMetadata #7436
Conversation
@@ -379,7 +386,8 @@ private HoodieRecord<HoodieMetadataPayload> composeRecord(GenericRecord avroReco | |||
private Map<Pair<String, FileSlice>, List<String>> getPartitionFileSliceToKeysMapping(final String partitionName, final List<String> keys) { | |||
// Metadata is in sync till the latest completed instant on the dataset | |||
List<FileSlice> latestFileSlices = | |||
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, partitionName); | |||
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking we can cache the file slices also similar to how we cache the file readers. I don't see a reason for file slices to change unless there is a change in timeline on which case entire FileSystemView will be refreshed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the FileSystemView (MDFSV) caches the entities and so we are good.
apache#7436) (apache#187) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Change Logs
As of now, we only cache the log file reader inside
HoodieBackedTableMetadata
. Each time the metadata table is looked up withgetRecordByKey
orgetRecordsByKeyPrefixes
inHoodieBackedTableMetadata
, the corresponding MT partition is listed throughHoodieTableMetadataUtil.getPartitionLatestMergedFileSlices
because a file system view is constructed each time. This causes repeated FS list calls on MT partitions and increases the latency for reading metadata table and listing files for data table, affecting Presto query latency for example (sample S3 access log from Presto below for listingfiles
partition in MT).This PR makes the changes to cache the file system view of the metadata table and thus the latest file slices at the partition level for metadata table inside
HoodieBackedTableMetadata
to avoid FS list calls.Impact
This PR avoids repeated file listing on the metadata table and thus reduces the latency for reading metadata table. This reduces the latency of the overall metadata-table-based file listing and thus improves the query performance.
Risk level
low
Documentation Update
N/A
Contributor's checklist