You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spark provides the SQL standard linear-regression aggregate functions regr_*(y, x), which compute single-pass statistics over rows where both y and x are non-null. They are standard descriptive statistics (same set as PostgreSQL), not ML.
SQL file tests added in #4551 (spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql) establish the current state empirically:
Already accelerated natively:regr_count, regr_avgx, regr_avgy. Spark implements these as RuntimeReplaceableAggregates that lower to Count / Average, which Comet already supports, so they run without any new code.
Add native Comet support for the six functions that currently fall back. None of these are greenfield: they all build on the streaming moment accumulators Comet already implements for covar_pop, var_pop, and corr. From Spark's linearRegression.scala:
regr_sxy extends Covariance (same accumulator as Comet's covar_pop/covar_samp), with a different final expression.
regr_r2 extends PearsonCorrelation (same accumulator as Comet's corr), with a different final expression.
regr_slope and regr_intercept are DeclarativeAggregates composing CovPopulation + VariancePop, with a null-pair guard on the variance update.
regr_sxx and regr_syy are RuntimeReplaceableAggregates that lower to an internal RegrReplacement declarative aggregate (count times variance over the non-null pairs).
So the work is to wire these aggregate classes through QueryPlanSerde / the aggregate serde and expose the appropriate final expressions over the existing native accumulators, plus match Spark's null-pair filtering semantics exactly.
Acceptance criteria
The six functions execute natively in Comet and match Spark.
These could be tackled incrementally (for example regr_sxy and regr_r2 first, since they map most directly onto the existing covariance and correlation accumulators).
Background
Spark provides the SQL standard linear-regression aggregate functions
regr_*(y, x), which compute single-pass statistics over rows where bothyandxare non-null. They are standard descriptive statistics (same set as PostgreSQL), not ML.SQL file tests added in #4551 (
spark/src/test/resources/sql-tests/expressions/aggregate/regr.sql) establish the current state empirically:regr_count,regr_avgx,regr_avgy. Spark implements these asRuntimeReplaceableAggregates that lower toCount/Average, which Comet already supports, so they run without any new code.regr_sxx,regr_syy,regr_sxy,regr_slope,regr_intercept,regr_r2. In test: add SQL file tests for regr_* linear-regression aggregates #4551 these are covered withquery spark_answer_only(correctness only).Proposal
Add native Comet support for the six functions that currently fall back. None of these are greenfield: they all build on the streaming moment accumulators Comet already implements for
covar_pop,var_pop, andcorr. From Spark'slinearRegression.scala:regr_sxyextendsCovariance(same accumulator as Comet'scovar_pop/covar_samp), with a different final expression.regr_r2extendsPearsonCorrelation(same accumulator as Comet'scorr), with a different final expression.regr_slopeandregr_interceptareDeclarativeAggregates composingCovPopulation+VariancePop, with a null-pair guard on the variance update.regr_sxxandregr_syyareRuntimeReplaceableAggregates that lower to an internalRegrReplacementdeclarative aggregate (count times variance over the non-null pairs).So the work is to wire these aggregate classes through
QueryPlanSerde/ the aggregate serde and expose the appropriate final expressions over the existing native accumulators, plus match Spark's null-pair filtering semantics exactly.Acceptance criteria
expressions/aggregate/regr.sql(from test: add SQL file tests for regr_* linear-regression aggregates #4551) to switch these queries fromquery spark_answer_onlyback to the defaultquerymode so native execution is asserted.Notes
These could be tackled incrementally (for example
regr_sxyandregr_r2first, since they map most directly onto the existing covariance and correlation accumulators).