Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: change parquet_fast_read_bytes setting from 0 to 16MB #15212

Merged
merged 1 commit into from
Apr 11, 2024

Conversation

BohuTANG
Copy link
Member

@BohuTANG BohuTANG commented Apr 11, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

When copying from the Parquet stage, we default to reading metadata each time, which is not efficient for small Parquet files (less than 16MB) due to the cost of reading metadata.

This PR sets parquet_fast_read_bytes default value to 16MB, meaning if a parquet file is less than 16MB, we won't read the metadata; instead, we'll read the entire file.

Relate Code:
1.

let fast_read_bytes = ctx.get_settings().get_parquet_fast_read_bytes()?;
let mut large_files = vec![];
let mut large_file_indices = vec![];
let mut small_file_indices = vec![];
let mut small_files = vec![];
for (index, (location, size)) in file_locations.into_iter().enumerate() {
if size > fast_read_bytes {
large_files.push((location, size));
large_file_indices.push(index);
} else {
small_files.push((location, size));
small_file_indices.push(index);
}
}

let (mut stats, mut partitions) = if parquet_metas.is_empty() {
self.read_and_prune_metas_in_parallel(
ctx,
large_files,
pruner,
columns_to_read,
Arc::new(topk),
copy_status,
)
.await?
} else {
prune_metas_in_parallel(
ctx,
&parquet_metas,
large_file_indices,
pruner,
columns_to_read,
Arc::new(topk),
copy_status,
)

  • Fixes #[Link the issue here]

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-chore this PR only has small changes that no need to record, like coding styles. label Apr 11, 2024
@BohuTANG BohuTANG marked this pull request as ready for review April 11, 2024 05:42
@BohuTANG BohuTANG merged commit 4c27f29 into datafuselabs:main Apr 11, 2024
82 of 84 checks passed
@BohuTANG
Copy link
Member Author

The default value for parquet_fast_read_bytes is 16MB, so there's no need to configure it. CC @yufan022

@yufan022
Copy link
Contributor

The default value for parquet_fast_read_bytes is 16MB, so there's no need to configure it. CC @yufan022

Copy that. Our current setting is 1073741824.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-chore this PR only has small changes that no need to record, like coding styles.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants