Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert batch_size to config option #2771

Merged
merged 7 commits into from
Jun 24, 2022
Merged

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 22, 2022

Which issue does this PR close?

Part of #2756

This just moves a single config but I wanted to get reviews on this approach before moving all of them.

Rationale for this change

Start moving existing configs to the new config mechanism

What changes are included in this PR?

  • Move batch_size to new config mechanism
  • Change config names to snake case
  • Update docs

Are there any user-facing changes?

SessionConfig::batch_size is now a method rather than an attribute.

@andygrove andygrove requested review from yjshen and alamb June 22, 2022 15:31
@github-actions github-actions bot added the core Core datafusion crate label Jun 22, 2022
),
ConfigDefinition::new_u32(
OPT_BATCH_SIZE,
"Default batch size while creating new batches, it's especially useful for \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description can be improved but that seemed outside the scope of this PR

@andygrove andygrove added api change Changes the API exposed to users of the crate documentation Improvements or additions to documentation labels Jun 22, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the basic pattern looks good to me 👍

I had some questions about type (u32 vs u64) and seeing if we an avoid having to repeat option names so many times but otherwise 👍

Thank you @andygrove

/// Customize batch size
pub fn with_batch_size(mut self, n: usize) -> Self {
pub fn with_batch_size(self, n: usize) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach (with a named function with_batch_size that calls into the generic implementation below). Seems like a good idea to me

👍

datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
// batch size must be greater than zero
assert!(n > 0);
self.batch_size = n;
self
self.set_u32(OPT_BATCH_SIZE, n as u32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use u64 as batch size (why are we storing as a u32 and then casting back and forth to usize)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to u64 but it still requires casting back and forth to usize.

Comment on lines 1134 to 1137
map.insert(
BATCH_SIZE.to_owned(),
format!("{}", self.config_options.get_u32(OPT_BATCH_SIZE)),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is possible to avoid having to enumerate all options again in this function -- what about maybe converting all entries in self.config_options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consolidate BATCH_SIZE with OPT_BATCH_SIZE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should. Ballista currently relies on this method and the keys used here. Once all the properties are converted I will update Ballista to use the new config_option method and delete this method,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we use these names internally as well so I ended up removing the duplicate option. This will require Ballista to look for the new name but that is a small change, I filed apache/datafusion-ballista#73 to track this.

Comment on lines 1134 to 1137
map.insert(
BATCH_SIZE.to_owned(),
format!("{}", self.config_options.get_u32(OPT_BATCH_SIZE)),
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consolidate BATCH_SIZE with OPT_BATCH_SIZE?

andygrove and others added 2 commits June 24, 2022 06:36
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@codecov-commenter
Copy link

Codecov Report

Merging #2771 (89c3c90) into master (cbb0517) will increase coverage by 0.20%.
The diff coverage is 80.00%.

❗ Current head 89c3c90 differs from pull request most recent head 2b013a0. Consider uploading reports for the commit 2b013a0 to get more accurate results

@@            Coverage Diff             @@
##           master    #2771      +/-   ##
==========================================
+ Coverage   84.95%   85.15%   +0.20%     
==========================================
  Files         272      272              
  Lines       48221    48096     -125     
==========================================
- Hits        40964    40958       -6     
+ Misses       7257     7138     -119     
Impacted Files Coverage Δ
...afusion/core/src/physical_plan/file_format/avro.rs 0.00% <ø> (ø)
datafusion/core/src/execution/context.rs 78.43% <54.54%> (+0.06%) ⬆️
datafusion/core/src/config.rs 91.80% <84.21%> (+0.69%) ⬆️
...on/core/src/physical_optimizer/coalesce_batches.rs 100.00% <100.00%> (ø)
...tafusion/core/src/physical_plan/file_format/csv.rs 93.78% <100.00%> (ø)
...afusion/core/src/physical_plan/file_format/json.rs 93.18% <100.00%> (ø)
...sion/core/src/physical_plan/file_format/parquet.rs 95.15% <100.00%> (ø)
...tafusion/core/src/physical_plan/sort_merge_join.rs 90.34% <100.00%> (ø)
datafusion/core/src/physical_plan/sorts/sort.rs 93.09% <100.00%> (ø)
...e/src/physical_plan/sorts/sort_preserving_merge.rs 92.57% <100.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbb0517...2b013a0. Read the comment docs.

@andygrove
Copy link
Member Author

Thanks for the review @alamb and @yjshen. I made some changes to address feedback. Please take another look when you can.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me

datafusion/core/src/config.rs Outdated Show resolved Hide resolved
pub fn to_props(&self) -> HashMap<String, String> {
let mut map = HashMap::new();
map.insert(BATCH_SIZE.to_owned(), format!("{}", self.batch_size));
// copy configs from config_options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@andygrove andygrove merged commit 7c60412 into apache:master Jun 24, 2022
@andygrove andygrove deleted the batch_size branch June 24, 2022 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core datafusion crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants