Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5345] Avoid fs.exists calls for metadata table in HFileBootstrapIndex #7404

Merged

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Dec 7, 2022

Change Logs

When instantiating the file system view of Hudi, the HFileBootstrapIndex is also instantiated, which includes two fs.exists calls to check if the bootstrap index is present. This can be completely avoided for the file system view built for reading the metadata table, as the metadata table never uses a bootstrap index.

This PR adds a check on the base path of the table in HFileBootstrapIndex and avoids the fs.exists calls if it is a metadata table.

Below is an example log from Presto showing the FS calls to S3 when instantiating HFileBootstrapIndex.

2022-11-24T22:06:42.979Z	DEBUG	hive-hive-1	com.amazonaws.request	Sending Request: HEAD https://<redacted>.s3.us-east-2.amazonaws.com <redacted>/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile Headers: (amz-sdk-invocation-id: 45caf5e0-6647-d12d-f40b-eabe66add479, Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
2022-11-24T22:06:42.989Z	DEBUG	hive-hive-1	com.amazonaws.request	Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: G9DQ3ZB656TBSPXK; S3 Extended Request ID: XLcukfeUa9gmmVSEWpk3ciemV5lhiGcf8gxkewhlmJVNV6sZGqAl0Pi7o4H7LTzAFQKZDVVditQ=), S3 Extended Request ID: XLcukfeUa9gmmVSEWpk3ciemV5lhiGcf8gxkewhlmJVNV6sZGqAl0Pi7o4H7LTzAFQKZDVVditQ=
2022-11-24T22:06:42.990Z	DEBUG	hive-hive-1	com.amazonaws.request	Sending Request: HEAD https://<redacted>.s3.us-east-2.amazonaws.com <redacted>/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/ Headers: (amz-sdk-invocation-id: 31a4b33c-a381-054d-5323-b41181be1a04, Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
2022-11-24T22:06:43.000Z	DEBUG	hive-hive-1	com.amazonaws.request	Received error response: com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: G9DTM2Z9MYSQBV7G; S3 Extended Request ID: m8M6/eGdNShGwOccPoJfMFdgZLtUQ0esU20ZIfszLUSRJsv0NX+dYtcPLBa+4ucNyfHrvf9RL7Y=), S3 Extended Request ID: m8M6/eGdNShGwOccPoJfMFdgZLtUQ0esU20ZIfszLUSRJsv0NX+dYtcPLBa+4ucNyfHrvf9RL7Y=
2022-11-24T22:06:43.000Z	DEBUG	hive-hive-1	com.amazonaws.request	Sending Request: GET https://<redacted>.s3.us-east-2.amazonaws.com / Parameters: ({"prefix":["<redacted>/store_sales/.hoodie/metadata/.hoodie/.aux/.bootstrap/.partitions/00000000-0000-0000-0000-000000000000-0_1-0-1_00000000000001.hfile/"],"delimiter":["/"],"max-keys":["1"],"encoding-type":["url"]}Headers: (amz-sdk-invocation-id: 45e2ddc4-aa04-1ec8-9181-e66555efb874, Content-Type: application/octet-stream, User-Agent: , aws-sdk-java/1.11.697 Linux/5.4.219-126.411.amzn2.x86_64 OpenJDK_64-Bit_Server_VM/25.342-b07 java/1.8.0_342 vendor/Oracle_Corporation, presto, ) 
2022-11-24T22:06:43.013Z	DEBUG	hive-hive-1	com.amazonaws.request	Received successful response: 200, AWS Request ID: Y4KXTF61RZAM1D6N

Impact

This PR avoids fs.exists calls and reduces latency for instantiating the file system view for the metadata table. For S3 as the storage, 3 requests are avoided, as shown above, which saves at least 40ms.

This affects the file listing of partitions based on the metadata table in Presto Hive and Hudi connectors. This performance fix shaves 10+ seconds for listing ~1800 partitions in a Presto query with metadata table enabled.

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope added priority:critical production down; pipelines stalled; Need help asap. release-0.12.2 Patches targetted for 0.12.2 metadata metadata table labels Dec 7, 2022
@codope codope self-assigned this Dec 7, 2022
@hudi-bot
Copy link

hudi-bot commented Dec 8, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit da9fef6 into apache:master Dec 8, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata metadata table priority:critical production down; pipelines stalled; Need help asap. release-0.12.2 Patches targetted for 0.12.2
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants