Skip to content

Natively support regr_sxx, regr_syy, regr_sxy, regr_slope, regr_intercept, regr_r2 aggregates #4552

@andygrove

Description

@andygrove

Background

Spark provides the SQL standard linear-regression aggregate functions regr_*(y, x), which compute single-pass statistics over rows where both y and x are non-null. They are standard descriptive statistics (same set as PostgreSQL), not ML.

SQL file tests added in #4551 (spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql) establish the current state empirically:

  • Already accelerated natively: regr_count, regr_avgx, regr_avgy. Spark implements these as RuntimeReplaceableAggregates that lower to Count / Average, which Comet already supports, so they run without any new code.
  • Currently fall back to Spark (this issue): regr_sxx, regr_syy, regr_sxy, regr_slope, regr_intercept, regr_r2. In test: add SQL file tests for regr_* linear-regression aggregates #4551 these are covered with query spark_answer_only (correctness only).

Proposal

Add native Comet support for the six functions that currently fall back. None of these are greenfield: they all build on the streaming moment accumulators Comet already implements for covar_pop, var_pop, and corr. From Spark's linearRegression.scala:

  • regr_sxy extends Covariance (same accumulator as Comet's covar_pop/covar_samp), with a different final expression.
  • regr_r2 extends PearsonCorrelation (same accumulator as Comet's corr), with a different final expression.
  • regr_slope and regr_intercept are DeclarativeAggregates composing CovPopulation + VariancePop, with a null-pair guard on the variance update.
  • regr_sxx and regr_syy are RuntimeReplaceableAggregates that lower to an internal RegrReplacement declarative aggregate (count times variance over the non-null pairs).

So the work is to wire these aggregate classes through QueryPlanSerde / the aggregate serde and expose the appropriate final expressions over the existing native accumulators, plus match Spark's null-pair filtering semantics exactly.

Acceptance criteria

Notes

These could be tackled incrementally (for example regr_sxy and regr_r2 first, since they map most directly onto the existing covariance and correlation accumulators).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions