Skip to content

Config support type conversion#3522

Merged
andygrove merged 6 commits intoapache:masterfrom
comphead:config_types_support
Sep 22, 2022
Merged

Config support type conversion#3522
andygrove merged 6 commits intoapache:masterfrom
comphead:config_types_support

Conversation

@comphead
Copy link
Copy Markdown
Contributor

@comphead comphead commented Sep 16, 2022

Which issue does this PR close?

Closes #3505 .

Rationale for this change

Avoid unexpected defaults when the type doesn't match for config value

What changes are included in this PR?

Conversion between types when reading config values

Are there any user-facing changes?

@github-actions github-actions Bot added the core Core DataFusion crate label Sep 16, 2022
Comment thread datafusion/core/src/config.rs Outdated
pub fn get_bool(&self, key: &str) -> bool {
match self.get(key) {
Some(ScalarValue::Boolean(Some(b))) => b,
Some(b) => b
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this @comphead. I think it might be better to do the conversion in the set methods rather than the get methods, and have the set methods return a Result but I haven't looked closely at this.

Copy link
Copy Markdown
Contributor

@HaoYang670 HaoYang670 Sep 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @andygrove. It would be better to make the configuration dynamically and strongly typed:

    pub fn get_bool(&self, key: &str) -> Result<bool> {
        match self.get(key) {
            Some(ScalarValue::Boolean(Some(b))) => b,
            Some(_) => Err(DatafusionError("The configuration {} is not a Boolean", key)),
            None => Err(DatafusionError("The configuration {} is not set", key)),
}}

@HaoYang670
Copy link
Copy Markdown
Contributor

HaoYang670 commented Sep 17, 2022

Why does this PR close 3500?

Comment thread datafusion/core/src/config.rs Outdated
Some(ScalarValue::Utf8(Some(s))) => s,
_ => "".into(),
Some(s) => s.to_string(),
_ => "".to_string(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the key is not set, should we return an Error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more common thing in Rust would be to return Option<String> so if the string wasn't set the caller can react to that.

@comphead
Copy link
Copy Markdown
Contributor Author

@andygrove @HaoYang670 Thank you for comments.
I like the idea to return Result<T> from all get methods.
however there are some things to discuss.

  • if they key doesn't exist should we return Err or default value?
  • if the types doesn't match but castable on get, should we fail or cast if possible? cast on set perhaps not good idea as user that sets configs usually knows the type to put to the config store and can select correct set method, whereas reader has no idea what the type by selected key is in the config. Thus I think cast on read is more useful.

@HaoYang670
Copy link
Copy Markdown
Contributor

if they key doesn't exist should we return Err or default value

Personally, I prefer returning an error, because it tells users what has happened (a get_before_set error). And users can customize the error in their ways.

whereas reader has no idea what the type by selected key is in the config

I guess this assumption is somewhat unrealistic. How could a reader use the config if he/she doesn't know the type of it?

@HaoYang670
Copy link
Copy Markdown
Contributor

Could you please remove close 3500 from your PR description? As this PR doesn't add set_string support.

@comphead
Copy link
Copy Markdown
Contributor Author

if they key doesn't exist should we return Err or default value

Personally, I prefer returning an error, because it tells users what has happened (a get_before_set error). And users can customize the error in their ways.

whereas reader has no idea what the type by selected key is in the config

I guess this assumption is somewhat unrealistic. How could a reader use the config if he/she doesn't know the type of it?

Hi @HaoYang670, thanks for pointing out on this. That makes sense.

If we consider spark config, which supports both initial and runtime config params, there all values are strings. The end user casts the config value to the type he needs on his side.
So my concern here, if we support not only documented params, but also custom/runtime params then is this scenario real, that someone sets a custom param, and a reader without knowing a correct type in config will have go through all supported types until he finds the correct one? @andygrove ^^^^

If its not the case, then making values strongly typed on set is good idea

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 17, 2022

Codecov Report

Merging #3522 (360e087) into master (f30fc4e) will increase coverage by 0.11%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #3522      +/-   ##
==========================================
+ Coverage   85.79%   85.90%   +0.11%     
==========================================
  Files         300      301       +1     
  Lines       55403    56046     +643     
==========================================
+ Hits        47533    48147     +614     
- Misses       7870     7899      +29     
Impacted Files Coverage Δ
datafusion/core/src/config.rs 95.50% <100.00%> (+5.85%) ⬆️
datafusion/core/src/execution/context.rs 79.31% <100.00%> (ø)
datafusion/core/src/physical_plan/planner.rs 77.35% <100.00%> (ø)
datafusion/core/tests/config_from_env.rs 100.00% <100.00%> (ø)
datafusion/common/src/dfschema.rs 93.63% <0.00%> (-4.15%) ⬇️
datafusion/core/src/physical_plan/union.rs 96.03% <0.00%> (-0.63%) ⬇️
...sion/core/src/physical_plan/file_format/parquet.rs 94.65% <0.00%> (-0.25%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.75% <0.00%> (-0.10%) ⬇️
...afusion/core/src/physical_plan/coalesce_batches.rs 93.33% <0.00%> (-0.05%) ⬇️
... and 25 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@comphead
Copy link
Copy Markdown
Contributor Author

Made get methods as options, returning None if the key doesn't exist.
@HaoYang670 what is the plan for casting after all?

@HaoYang670
Copy link
Copy Markdown
Contributor

what is the plan for casting after all?

we have already had a typeless get API:

    /// get a configuration option
    pub fn get(&self, key: &str) -> Option<ScalarValue> {
        self.options.get(key).cloned()
    }

for users who don't know the type of the config.

For other get APIs (get_bool, get_u8, get_string), I prefer returning an error when the types aren't matched.
cc @andygrove @alamb

# Conflicts:
#	datafusion/core/src/execution/context.rs
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead -- this is great progress. I am sorry for the late review.

I would really prefer:

  1. No panic's
  2. Do not try and parse any arbitrary ScalarValue variant (parsing ScalarValue::Utf8 makes a lot of sense to me)

I don't have a strong opinion on how to handle config values stored as strings but that can't be parsed into whatever type DataFusion expects.

I think the most complete API would be returning Result<Option<..>>

     /// return the value of the key as a `u64`. If there is no value or a null
     /// value stored for that key, returns Ok(None)
     ///
     /// If the value was stored as a string but that string is not a valid `u64` returns
     /// Err 
    pub fn get_u64(&self, key: &str) -> Result<Option<u64>> { 
...
}

But I am not sure if that would be annoying to call in the rest of the DataFusion codebase

Comment thread datafusion/core/src/config.rs Outdated
Some(b) => Some(
b.to_string()
.parse::<bool>()
.unwrap_or_else(|_| panic!("Cannot parse bool from {:?}", &b)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I feel like in other PRs like #3316 are going in the opposite direction and removing the use of panic!

I would prefer this API returned None rather than panic'd (and maybe log a warn! message)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We don't want to introduce more panics at this point. There is an effort to remove as many of the existing panics as is practical.

Comment thread datafusion/core/src/config.rs Outdated
Some(ScalarValue::UInt64(Some(n))) => n,
_ => 0,
Some(ScalarValue::UInt64(n)) => n,
Some(n) => Some(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling to_string() on a ScalarValue calls its Display implementation: https://github.com/apache/arrow-datafusion/blob/master/datafusion/common/src/scalar.rs#L1991

I am not sure if trying to parse this as a u64 is what someone may expect. For example, if it is ScalarValue::Utf8(None) the value returned will be "None" which will fail to parse and panic.

I would recommend only trying to parse ScalarValue::Utf8(n)) rather than any ScalarValue

@andygrove
Copy link
Copy Markdown
Member

I plan on reviewing this tomorrow.

@comphead
Copy link
Copy Markdown
Contributor Author

Thanks @alamb for detailed suggestions, I have digged into couple of config implementations, namely https://docs.rs/config/latest/src/config/config.rs.html#180

They use Result with Err for both parse or key not found
Probably we can reuse this experience

@comphead
Copy link
Copy Markdown
Contributor Author

@andygrove please us know your thoughts as well, so we finally can close it

Comment thread datafusion/core/src/config.rs Outdated
Some(ScalarValue::Utf8(Some(s))) => s,
_ => "".into(),
Some(ScalarValue::Utf8(s)) => s,
Some(s) => Some(s.to_string()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is what we want. What does this return for a u64 value? Could you add a test for this?

@andygrove
Copy link
Copy Markdown
Member

Apologies for not getting to this sooner. I have been overstretched recently. Here is my view on how we should implement this (we don't have to do all of this in this PR, though).

In the set methods, I think it makes sense to check if the config is one of the built-in (pre-defined) configs, and if so, then check that the value is the correct type and return a Result.

If we do the validation when setting configs, then there is no need to return a Result on the get methods. I like the idea of returning Option here and logging a warning if the config option is set but has the wrong datatype (this should not be possible in most cases if we are verifying the values in the set method).

@comphead
Copy link
Copy Markdown
Contributor Author

Apologies for not getting to this sooner. I have been overstretched recently. Here is my view on how we should implement this (we don't have to do all of this in this PR, though).

In the set methods, I think it makes sense to check if the config is one of the built-in (pre-defined) configs, and if so, then check that the value is the correct type and return a Result.

If we do the validation when setting configs, then there is no need to return a Result on the get methods. I like the idea of returning Option here and logging a warning if the config option is set but has the wrong datatype (this should not be possible in most cases if we are verifying the values in the set method).

Thanks @andygrove. Implemented exactly like discussed.
I will create another PR for setting values and checking if they are part of built in configs.

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- thank you @comphead for staying with this and making multiple changes with various inputs. 🙏

Some(ScalarValue::$TPE(v)) => v,
Some(v) => {
warn!(
"Config type mismatch for {}. Expected: {}, got: {:?}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @comphead! LGTM

@andygrove andygrove merged commit 1e16e43 into apache:master Sep 22, 2022
@comphead comphead deleted the config_types_support branch September 22, 2022 16:34
@ursabot
Copy link
Copy Markdown

ursabot commented Sep 22, 2022

Benchmark runs are scheduled for baseline = be6ad1c and contender = 1e16e43. 1e16e43 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

What should be returned when trying to get a config in invalid format?

6 participants