Skip to content

Conversation

@Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Jan 2, 2026

Which issue does this PR close?

Rationale for this change

Current signature for percentile_cont is quite confusing in what types it accepts. Currently the code suggests it accepts all numeric types, where floats & decimals are maintained as is, integers are cast to float64 internally. However this is misleading as NUMERICS does not contain decimals, so currently everything is cast to float.

Clean up the code to make this more clear and do various other refactors.

What changes are included in this PR?

Use coercion signature to have type coercion coerce types to float so we can remove internal code related to casting to float or handling non-float types.

Various other refactors.

Are these changes tested?

Existing tests.

Are there any user-facing changes?

No.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 2, 2026
Comment on lines +139 to +152
signature: Signature::coercible(
vec![
Coercion::new_implicit(
TypeSignatureClass::Float,
vec![TypeSignatureClass::Numeric],
NativeType::Float64,
),
Coercion::new_implicit(
TypeSignatureClass::Native(logical_float64()),
vec![TypeSignatureClass::Numeric],
NativeType::Float64,
),
],
Volatility::Immutable,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are much clearer that we expect all expressions to be float64's, letting type coercion handle the casting for use so we can remove internal casts

}
}

fn create_accumulator(&self, args: &AccumulatorArgs) -> Result<Box<dyn Accumulator>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inlining this

"percentile_cont does not support input type {}, must be numeric",
dt
),
DataType::Null => Ok(DataType::Float64),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null types are a little annoying to handle, see #19458

Comment on lines +225 to +232
DataType::Float16 => Ok(Box::new(DistinctPercentileContAccumulator::<
Float16Type,
>::new(percentile))),
DataType::Float32 => Ok(Box::new(DistinctPercentileContAccumulator::<
Float32Type,
>::new(percentile))),
DataType::Float64 => Ok(Box::new(DistinctPercentileContAccumulator::<
Float64Type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much more clear what the internal types we operate on are now

/// in the final evaluation step so that we avoid expensive conversions and
/// allocations during `update_batch`.
struct PercentileContAccumulator<T: ArrowNumericType> {
data_type: DataType,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing data_type from all accumulators; this was only needed if T was ever a decimal type, to maintain precision/scale. However there was no valid way to have decimals come through to the accumulator anyway, and we want to cast to float anyway, so we can refactor this away

// Build the result list array
let list_array = ListArray::new(
Arc::new(Field::new_list_field(self.data_type.clone(), true)),
Arc::new(Field::new_list_field(T::DATA_TYPE, true)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't need to consider decimals we can just replace usages with T::DATA_TYPE


fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> {
// Cast to target type if needed (e.g., integer to Float64)
let values = if values[0].data_type() != &self.data_type {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the internal casts we're removing; now input arguments should already be of the right types via type coercion

@Jefffrey Jefffrey marked this pull request as ready for review January 2, 2026 15:05
let input_dt = args.expr_fields[0].data_type();
if input_dt.is_null() {
return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None))));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    let input_dt = args.expr_fields[0].data_type();
    if input_dt.is_null() {
        return Ok(Box::new(NoopAccumulator::new(ScalarValue::Float64(None))));
    }

    let percentile = get_percentile(&args)?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do early return here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to validate the percentile regardless of if datatype is null or not, otherwise could lead to permitting select percentile_cont(null, 2.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see now, validate is part of get_percentile, makes sense, we can prob optimize it in future so function will do mandatory validation however calc percentile for non null dtypes.

);

let percentile = validate_percentile_expr(&args.exprs[1], "PERCENTILE_CONT")?;
let percentile = get_percentile(&args)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont need to handle null type here the same as in accumulator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity I made groups_accumulator_supported return false if we have a null datatype input

@adriangb adriangb self-requested a review January 2, 2026 22:45
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Jefffrey

@Jefffrey Jefffrey added this pull request to the merge queue Jan 6, 2026
Merged via the queue into apache:main with commit ff38480 Jan 6, 2026
28 checks passed
@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 6, 2026

Thanks @comphead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants