Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5348] Cache file slices in HoodieBackedTableMetadata #7436

Merged
merged 1 commit into from
Dec 13, 2022

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Dec 12, 2022

Change Logs

As of now, we only cache the log file reader inside HoodieBackedTableMetadata. Each time the metadata table is looked up with getRecordByKey or getRecordsByKeyPrefixes in HoodieBackedTableMetadata, the corresponding MT partition is listed through HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices because a file system view is constructed each time. This causes repeated FS list calls on MT partitions and increases the latency for reading metadata table and listing files for data table, affecting Presto query latency for example (sample S3 access log from Presto below for listing files partition in MT).

2022-11-24T22:06:43.009Z	INFO	hive-hive-2	org.apache.hudi.common.table.view.AbstractTableFileSystemView	Building file system view for partition (files)
2022-11-24T22:06:43.009Z	DEBUG	hive-hive-2	com.amazonaws.request	Sending Request: GET https://<redacted>.s3.us-east-2.amazonaws.com / Parameters: ({"prefix":["<redacted>/store_sales/.hoodie/metadata/files/"],"delimiter":["/"],"encoding-type":["url"]}Headers: (amz-sdk-invocation-id: 9e963ae0-f2e4-738e-691f-073c5a43264d, Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
2022-11-24T22:06:43.022Z	DEBUG	hive-hive-2	com.amazonaws.request	Received successful response: 200, AWS Request ID: Y4KHZHYVG7SSB0J4

This PR makes the changes to cache the file system view of the metadata table and thus the latest file slices at the partition level for metadata table inside HoodieBackedTableMetadata to avoid FS list calls.

Impact

This PR avoids repeated file listing on the metadata table and thus reduces the latency for reading metadata table. This reduces the latency of the overall metadata-table-based file listing and thus improves the query performance.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua assigned yihua and nsivabalan and unassigned yihua Dec 12, 2022
@yihua yihua added priority:blocker metadata metadata table release-0.12.2 Patches targetted for 0.12.2 labels Dec 12, 2022
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@@ -379,7 +386,8 @@ private HoodieRecord<HoodieMetadataPayload> composeRecord(GenericRecord avroReco
private Map<Pair<String, FileSlice>, List<String>> getPartitionFileSliceToKeysMapping(final String partitionName, final List<String> keys) {
// Metadata is in sync till the latest completed instant on the dataset
List<FileSlice> latestFileSlices =
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(metadataMetaClient, partitionName);
HoodieTableMetadataUtil.getPartitionLatestMergedFileSlices(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking we can cache the file slices also similar to how we cache the file readers. I don't see a reason for file slices to change unless there is a change in timeline on which case entire FileSystemView will be refreshed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the FileSystemView (MDFSV) caches the entities and so we are good.

@nsivabalan nsivabalan merged commit 13a8e5c into apache:master Dec 13, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata metadata table priority:blocker release-0.12.2 Patches targetted for 0.12.2
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants