Skip to content

Conversation

@kumarUjjawal
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Literal 0/1 percentiles don’t need percentile buffering; using min/max keeps results identical.

What changes are included in this PR?

  • Add a simplify hook so percentile_cont(..., 0|1) rewrites to min/max, preserving distinct/filter/null handling and casting ints to Float64.
  • Add targeted tests for the rewrite and for the no‑rewrite path.

Are these changes tested?

Added tests

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Nov 20, 2025
}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add some tests in sqllogictest https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

It should run some SQL queries that this optimization is applicable, and we first ensure the result is expected, and also do a EXPLAIN to ensure such optimization is applied.

In fact, we can move most of the test coverage to sqllogictests, instead of unit tests here. The reason is:

  1. SQL tests are simpler to maintain
  2. The SQL interface is more stable, while internal APIs may change frequently. As a result, good test coverage here can easily get lost during refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the unit tests along with the new sql test in the sqllogictest. Should I remove the unit tests or is it okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the unit tests if they duplicate the sqllogictests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the unit tests if they duplicate the sqllogictests

+1 unless there are something can't be covered by slt tests

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 20, 2025
kumarUjjawal and others added 3 commits November 20, 2025 21:06
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. Have a few suggestions to simplify the code

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove the unit tests if they duplicate the sqllogictests

Comment on lines +396 to +398
if params.args.len() != 2 {
return Ok(original_expr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if params.args.len() != 2 {
return Ok(original_expr);
}
let [value, percentile] = take_function_args("percentile_cont", &params.args)?;

More ergonomic this way; technically this error path should never occur as the signature should already guard us by now.

Comment on lines +501 to +503
Expr::Alias(alias) => extract_percentile_literal(alias.expr.as_ref()),
Expr::Cast(cast) => extract_percentile_literal(cast.expr.as_ref()),
Expr::TryCast(cast) => extract_percentile_literal(cast.expr.as_ref()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How strictly necessary are these other arms? Is checking only for Literal not sufficient?

(value - target).abs() < PERCENTILE_LITERAL_EPSILON
}

fn percentile_cont_result_type(input_type: &DataType) -> Option<DataType> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should reuse the code from return_type if possible instead of duplicating it here

fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
if !arg_types[0].is_numeric() {
return plan_err!("percentile_cont requires numeric input types");
}
// PERCENTILE_CONT performs linear interpolation and should return a float type
// For integer inputs, return Float64 (matching PostgreSQL/DuckDB behavior)
// For float inputs, preserve the float type
match &arg_types[0] {
DataType::Float16 | DataType::Float32 | DataType::Float64 => {
Ok(arg_types[0].clone())
}
DataType::Decimal32(_, _)
| DataType::Decimal64(_, _)
| DataType::Decimal128(_, _)
| DataType::Decimal256(_, _) => Ok(arg_types[0].clone()),
DataType::UInt8
| DataType::UInt16
| DataType::UInt32
| DataType::UInt64
| DataType::Int8
| DataType::Int16
| DataType::Int32
| DataType::Int64 => Ok(DataType::Float64),
// Shouldn't happen due to signature check, but just in case
dt => plan_err!(
"percentile_cont does not support input type {}, must be numeric",
dt
),
}
}

Comment on lines +469 to +471
fn nearly_equals_fraction(value: f64, target: f64) -> bool {
(value - target).abs() < PERCENTILE_LITERAL_EPSILON
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally of the mind to check directly against 0.0 and 1.0 instead of doing an epsilon check; I think it's more likely a user would input an expr like SELECT percentile_cont(column1, 0.0) than doing something like SELECT percentile_cont(column1, expr) where expr might be some math that could make it 0.0000001 🤔

Comment on lines +417 to +420
let input_type = match info.get_data_type(&value_expr) {
Ok(data_type) => data_type,
Err(_) => return Ok(original_expr),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let input_type = match info.get_data_type(&value_expr) {
Ok(data_type) => data_type,
Err(_) => return Ok(original_expr),
};
let input_type = match info.get_data_type(&value_expr)?;

Comment on lines +432 to +435
let mut agg_arg = value_expr;
if expected_return_type != input_type {
agg_arg = Expr::Cast(Cast::new(Box::new(agg_arg), expected_return_type.clone()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explain why this is necessary in a comment here?

Comment on lines +411 to +414
let rewrite_target = match classify_rewrite_target(percentile_value, is_descending) {
Some(target) => target,
None => return Ok(original_expr),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this should be folded directly into line 400 above, instead of splitting it like this

}
}

fn literal_scalar_to_f64(value: &ScalarValue) -> Option<f64> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have percentiles that are not of type Flaot64? I thought the signature guarded us against this

pub fn new() -> Self {
let mut variants = Vec::with_capacity(NUMERICS.len());
// Accept any numeric value paired with a float64 percentile
for num in NUMERICS {
variants.push(TypeSignature::Exact(vec![num.clone(), DataType::Float64]));
}
Self {
signature: Signature::one_of(variants, Volatility::Immutable)
.with_parameter_names(vec!["expr".to_string(), "percentile".to_string()])
.expect("valid parameter names for percentile_cont"),
aliases: vec![String::from("quantile_cont")],
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Simplify percentile_cont to min/max when percentile is 0 or 1

4 participants