Skip to content

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Nov 11, 2025

What changes were proposed in this pull request?

This PR allows aggregate functions and GROUP BY to be used in |> SELECT pipe operators. Previously, these were only allowed in |> AGGREGATE pipe operators.

Example queries now supported:

-- Aggregate in SELECT
table employees |> select sum(salary) as total_salary;

-- Aggregate with GROUP BY
table orders |> select customer_id, count(*) as order_count group by customer_id;

-- Chained operations
table data |> where status = 'active' |> select sum(value) as total;

Why are the changes needed?

By lifting this restriction (with an opt-out mechanism), we make the SQL pipe operator syntax more intuitive while maintaining backwards compatibility.

Does this PR introduce any user-facing change?

Yes, but it is backwards compatible:

  • Previously failing queries now succeed: Queries using aggregate functions in |> SELECT will now work instead of throwing PIPE_OPERATOR_CONTAINS_AGGREGATE_FUNCTION errors
  • All previously succeeding queries continue to work: No regression; queries using |> AGGREGATE or non-aggregate pipe operators are unaffected

Backwards Compatibility Guarantee:

  • ✅ No queries that worked before will break
  • ✅ Only queries that previously failed will now succeed

How was this patch tested?

  1. Unit Tests: Added comprehensive test coverage in pipe-operators.sql:

    • Positive tests: aggregates in SELECT, with WHERE, with chaining, with GROUP BY
    • Negative tests: aggregates in WHERE (still fails as expected)
    • Regression tests: verified |> AGGREGATE still works correctly
  2. Golden Files: Regenerated and verified pipe-operators.sql.out and analyzer results

  3. Test Execution: All tests pass successfully:

Was this patch authored or co-authored using generative AI tooling?

Yes, claude-4.5-sonnet with manual editing and approval.

@github-actions github-actions bot added the SQL label Nov 11, 2025
@dtenedor dtenedor changed the title commit [SPARK-54292][SQL] Create a configuration to support aggregation in |> SELECT pipe operators Nov 11, 2025
@dtenedor
Copy link
Contributor Author

cc @srielau @cloud-fan

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this new feature, @dtenedor .

@dongjoon-hyun
Copy link
Member

Could you update this PR, @dtenedor ?

.doc("When true, aggregate functions can be used in |> SELECT and other pipe operator " +
"clauses without requiring the |> AGGREGATE keyword. When false, aggregate functions " +
"must be used exclusively with the |> AGGREGATE clause for proper aggregation semantics.")
.version("4.2.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@dtenedor
Copy link
Contributor Author

This one is passing CI and seems self-contained and safe, do you approve?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, +1, LGTM (for Apache Spark 4.2.0). Thank you, @dtenedor .

@dtenedor
Copy link
Contributor Author

OK, great. I will merge it to the master branch only, for Apache Spark 4.2.0 and later. Thank you for review!

if (ctx.aggregationClause != null && !conf.pipeOperatorAllowAggregateInSelect) {
operationNotAllowed(
"|> SELECT with a GROUP BY clause is not allowed when " +
"spark.sql.allowAggregateInSelectWithPipeOperator is disabled", ctx)
Copy link
Contributor

@cloud-fan cloud-fan Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wondering why we need the config. If people don't use agg function in pipe SELECT, then they just don't use it. If they use it, then they will see this error message and enable this config immediately.

cc @srielau

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the chance to speak with Wenchen in person about this -- it is a good point, the config doesn't really help in any scenarios here. I will remove it.

@dtenedor dtenedor changed the title [SPARK-54292][SQL] Create a configuration to support aggregation in |> SELECT pipe operators [SPARK-54292][SQL] Support aggregate functions and GROUP BY in |> SELECT pipe operators Nov 14, 2025
@dtenedor dtenedor requested a review from cloud-fan November 14, 2025 19:33
@dtenedor
Copy link
Contributor Author

Thanks all for reviews -- merging to master.

@dtenedor dtenedor closed this in c5c65d2 Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants