Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: persist sst meta size #1440

Merged
merged 5 commits into from
Jan 31, 2024
Merged

Conversation

jiacai2050
Copy link
Contributor

@jiacai2050 jiacai2050 commented Jan 15, 2024

Rationale

When read sst's metadata, it will use get method to read whole file, but there is no cache in get, so perf may degrade.

Detailed Changes

  • Add meta_size in SST's custom meta.
  • When read SST's metadata, use get_range when we have size.

The parquet writer is kind of massy now, will do some refactor after this PR.

Test Plan

CI

Copy link
Member

@ShiKaiWi ShiKaiWi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the review, I find it's weird to fix the object store cache problem by avoiding using get because the interface of the object store doesn't tell the caller get is worse than get_range.

Maybe the right way is to ensure get method to work as expected so that any caller won't be required to remember the file size to use get_range.

Comment on lines 80 to 82
meta_size = Some(size.parse::<usize>().with_context(|| InvalidSize {
size: size.to_string(),
})?);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess snafu will auto clone for the string reference:

Suggested change
meta_size = Some(size.parse::<usize>().with_context(|| InvalidSize {
size: size.to_string(),
})?);
meta_size = Some(size.parse::<usize>().context(InvalidSize {
size,
}?);

@ShiKaiWi
Copy link
Member

@jiacai2050 Would you please choose the main branch as the base for merging?

@jiacai2050
Copy link
Contributor Author

The best way to fix is indeed to add cache for get method, but it's require more work, such as:

  • Request duplicate
  • Deal with IO failure

So here I choose to bypass get. In future We may use get_range with 0.. to replace get, but
this require object_store's support, so we may need to contribute there first.

@jiacai2050 jiacai2050 changed the base branch from dev to main January 23, 2024 12:15
@apache apache deleted a comment from CLAassistant Jan 23, 2024
Comment on lines +86 to +106
Some(size) => {
let all_range = 0..size;
self.store
.get_range(meta_path, all_range)
.await
.with_context(|| FetchFromStore {
file_path: meta_path.to_string(),
})?
}
None => self
.store
.get(meta_path)
.await
.with_context(|| FetchFromStore {
file_path: meta_path.to_string(),
})?
.bytes()
.await
.with_context(|| FetchAndDecodeSstMeta {
file_path: meta_path.to_string(),
})?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain why we are in favor of get_range?

@jiacai2050 jiacai2050 merged commit 6f9d426 into apache:main Jan 31, 2024
9 checks passed
@jiacai2050 jiacai2050 deleted the feat-meta-size branch January 31, 2024 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants