-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose remaining parquet config options into ConfigOptions (try 2) #4427
Conversation
@@ -396,7 +396,8 @@ async fn get_table( | |||
} | |||
"parquet" => { | |||
let path = format!("{}/{}", path, table); | |||
let format = ParquetFormat::default().with_enable_pruning(true); | |||
let format = ParquetFormat::new(ctx.config_options()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the parquet format now reads defaults from ConfigOptions
they need to be passed to the constructor
One read to fetch the 8-byte parquet footer and \ | ||
another to fetch the metadata length encoded in the footer.", | ||
DataType::UInt64, | ||
ScalarValue::UInt64(None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by @thinkharderdev on #3885 (comment) we should probably change this default to something reasonable (like 64K) but I would rather do that in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #4459 to track
// Session level configuration | ||
config_options: Arc<RwLock<ConfigOptions>>, | ||
// local overides | ||
enable_pruning: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing these to Option
I think it is now clearer that if they are left at default, the (documented) value from ConfigOptions is used
let listing_options = options | ||
.parquet_pruning(parquet_pruning) | ||
.to_listing_options(target_partitions); | ||
let listing_options = options.to_listing_options(&self.state.read().config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some ways I think this is cleaner as now the options make it to the parquet reader, where as before there are places like this that copy some (but not all) of the settings around
@@ -1183,7 +1179,6 @@ impl Default for SessionConfig { | |||
repartition_joins: true, | |||
repartition_aggregations: true, | |||
repartition_windows: true, | |||
parquet_pruning: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yet another copy of this setting!
/// metadata. Defaults to true. | ||
// TODO move this into ConfigOptions | ||
pub skip_metadata: bool, | ||
/// Should the parquet reader use the predicate to prune row groups? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, here make it clear that these settings are overrides to the defaults on the session configuration
@@ -72,6 +72,16 @@ use super::get_output_ordering; | |||
/// Execution plan for scanning one or more Parquet partitions | |||
#[derive(Debug, Clone)] | |||
pub struct ParquetExec { | |||
/// Override for `Self::with_pushdown_filters`. If None, uses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added this back in so that the overrides set directly on a ParquetExec
do not affect the global configuration options
@@ -707,8 +707,11 @@ async fn show_all() { | |||
"| datafusion.execution.coalesce_batches | true |", | |||
"| datafusion.execution.coalesce_target_batch_size | 4096 |", | |||
"| datafusion.execution.parquet.enable_page_index | false |", | |||
"| datafusion.execution.parquet.metadata_size_hint | NULL |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 one can now see much more explicitly both 1) what the parquet options are and 2) what their default values are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb
Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com>
Benchmark runs are scheduled for baseline = 09aea09 and contender = fb8eeb2. fb8eeb2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
this is a reworked version of #3885
Which issue does this PR close?
Closes #3821
This also helps towards #3887 and #4349
Rationale for this change
It turns out options for reading parquet files were able to be set (and possibly) overridden by no less than three different structures! This is confusing, to say the least.
What changes are included in this PR?
Are there any user-facing changes?
ParquetExec
are handled consistently as an override to session wide defaultsPreviously, depending on which of the APIs was used to create / register / run parquet, the settings might change if you change the session config or they might have been a snapshot based on when you registered the reader