-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Refactor percentile_cont to clarify support input types
#19611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| signature: Signature::coercible( | ||
| vec![ | ||
| Coercion::new_implicit( | ||
| TypeSignatureClass::Float, | ||
| vec![TypeSignatureClass::Numeric], | ||
| NativeType::Float64, | ||
| ), | ||
| Coercion::new_implicit( | ||
| TypeSignatureClass::Native(logical_float64()), | ||
| vec![TypeSignatureClass::Numeric], | ||
| NativeType::Float64, | ||
| ), | ||
| ], | ||
| Volatility::Immutable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are much clearer that we expect all expressions to be float64's, letting type coercion handle the casting for use so we can remove internal casts
| } | ||
| } | ||
|
|
||
| fn create_accumulator(&self, args: &AccumulatorArgs) -> Result<Box<dyn Accumulator>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining this
| "percentile_cont does not support input type {}, must be numeric", | ||
| dt | ||
| ), | ||
| DataType::Null => Ok(DataType::Float64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null types are a little annoying to handle, see #19458
| DataType::Float16 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float16Type, | ||
| >::new(percentile))), | ||
| DataType::Float32 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float32Type, | ||
| >::new(percentile))), | ||
| DataType::Float64 => Ok(Box::new(DistinctPercentileContAccumulator::< | ||
| Float64Type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much more clear what the internal types we operate on are now
| /// in the final evaluation step so that we avoid expensive conversions and | ||
| /// allocations during `update_batch`. | ||
| struct PercentileContAccumulator<T: ArrowNumericType> { | ||
| data_type: DataType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing data_type from all accumulators; this was only needed if T was ever a decimal type, to maintain precision/scale. However there was no valid way to have decimals come through to the accumulator anyway, and we want to cast to float anyway, so we can refactor this away
| // Build the result list array | ||
| let list_array = ListArray::new( | ||
| Arc::new(Field::new_list_field(self.data_type.clone(), true)), | ||
| Arc::new(Field::new_list_field(T::DATA_TYPE, true)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't need to consider decimals we can just replace usages with T::DATA_TYPE
|
|
||
| fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { | ||
| // Cast to target type if needed (e.g., integer to Float64) | ||
| let values = if values[0].data_type() != &self.data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the internal casts we're removing; now input arguments should already be of the right types via type coercion
| let input_dt = args.expr_fields[0].data_type(); | ||
| if input_dt.is_null() { | ||
| return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None)))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let input_dt = args.expr_fields[0].data_type();
if input_dt.is_null() {
return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None))));
}
let percentile = get_percentile(&args)?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can do early return here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to validate the percentile regardless of if datatype is null or not, otherwise could lead to permitting select percentile_cont(null, 2.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see now, validate is part of get_percentile, makes sense, we can prob optimize it in future so function will do mandatory validation however calc percentile for non null dtypes.
| ); | ||
|
|
||
| let percentile = validate_percentile_expr(&args.exprs[1], "PERCENTILE_CONT")?; | ||
| let percentile = get_percentile(&args)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need to handle null type here the same as in accumulator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simplicity I made groups_accumulator_supported return false if we have a null datatype input
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Jefffrey
|
Thanks @comphead |
Which issue does this PR close?
NUMERICS/INTEGERSindatafusion/expr-common/src/type_coercion/aggregates.rs#18092Rationale for this change
Current signature for
percentile_contis quite confusing in what types it accepts. Currently the code suggests it accepts all numeric types, where floats & decimals are maintained as is, integers are cast to float64 internally. However this is misleading asNUMERICSdoes not contain decimals, so currently everything is cast to float.Clean up the code to make this more clear and do various other refactors.
What changes are included in this PR?
Use coercion signature to have type coercion coerce types to float so we can remove internal code related to casting to float or handling non-float types.
Various other refactors.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.