[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284

dtenedor · 2025-03-14T21:04:46Z

What changes were proposed in this pull request?

This PR supports | as an alternative to |> for the SQL pipe operator token.

For example, this is now supported:

from t
| select x, y;

as an alternative to this:

from t
|> select x, y;

The change is controlled by a new Spark configuration SQLConf.SINGLE_CHARACTER_OPERATOR_PIPE_TOKEN_ENABLED (spark.sql.singleCharacterOperatorPipeTokenEnabled).

Background:

We talked with Jeff Shute, the author of the SQL pipe syntax paper from Google.
He mentions that they went with |> because all implementing engines can support it without any issues with their parsers.
Specifically, Google uses an LALR parser [1] which makes it impossible to use | as the token due to ambiguity with bit operations. They won't be able to add support for this.

It seems like a growing consensus in the industry is that we should all support |> as the primary token, but some engines may also decide to support alternative tokens in addition. This blog [2] describes the situation and mentions how another engine decided to go this direction.

[1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl

[2] https://superdb.org/docs/language/pipe-ambiguity/

Why are the changes needed?

This is a simple and safe change and provides syntax compatibility with other languages that use | for this purpose such as Splunk SPL and Kusto.

Does this PR introduce any user-facing change?

Yes, per above. Note this change is fully backwards-compatible.

How was this patch tested?

This PR provides unit tests and golden file tests.

Was this patch authored or co-authored using generative AI tooling?

No.

dtenedor · 2025-03-14T21:06:23Z

cc @cloud-fan @gengliangwang

dongjoon-hyun

I understand your motivation and the current syntax is due to the Google System limitation.

However, I'm a little negative for this approach because of the compatibility with Google Pipe Syntax. Can we hold on this addition until Google supports this new syntax, @dtenedor .

dtenedor · 2025-03-17T16:47:08Z

Hi @dongjoon-hyun, this is a good question. We talked with Jeff Shute, the author of the SQL pipe syntax paper from Google. He mentions that they went with |> because all implementing engines can support it without any issues with their parsers. Specifically, Google uses an LALR parser [1] which makes it impossible to use | as the token due to ambiguity with bit operations. They won't be able to add support for this.

It seems like a growing consensus in the industry is that we should all support |> as the primary token, but some engines may also decide to support alternative tokens in addition. This blog [2] describes the situation and mentions how another engine decided to go this direction. Please let us know your thoughts on this -- we can certainly hold off on merging this PR until we figure out together what we want the plan to be.

[1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl

[2] https://superdb.org/docs/language/pipe-ambiguity/

dongjoon-hyun · 2025-03-17T19:14:20Z

Thank you for the pointers. Yes, that's exactly what I worried.

I believe `|>' is a better choice for a long-term interoperability while the eco-system is growing. We cannot add one by one when a new system is highlighted a new token as their new feature again.

However, let me remove my review comment here for now.

Since this is Apache Spark 4.1.0 feature, we have more time to discuss. So, I'm lifting my previous change request so that this PR gets more reviews.

dongjoon-hyun · 2025-03-17T19:17:03Z

If you don't mind, please revise the PR description with #50284 (comment), @dtenedor . Thank you for making more progress on this area.

dtenedor · 2025-03-18T19:22:36Z

I found a parsing ambiguity with this syntax that the research paper mentions:

Making a note here to add testing for this. Since this example is completely ambiguous, we'll have to decide which application of the | token we would prefer to take precedence.

dongjoon-hyun · 2025-03-18T21:15:28Z

Yes, right.. For the record, I also red that paper, of course, while I reviewed your original SQL Pipe Syntax PR. Thank you for considering adding them as test cases.

Making a note here to add testing for this.

dtenedor · 2025-03-21T20:47:26Z

I looked at this some more and it seems we will have to put in some effort in the parser to differentiate this case and resolve the ambiguity. It should be possible, but will take some extra work/investigation. I'll do some experimentation, it might take some time, so I'll close this PR for now to keep it off the "open Spark PRs" list, and re-open it later when the extra testing is available, in order to help with review.

… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in #50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes #52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in apache/spark#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes #52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in apache#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes apache#52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

commit

433e381

github-actions bot added the SQL label Mar 14, 2025

dongjoon-hyun previously requested changes Mar 15, 2025

View reviewed changes

dtenedor closed this Mar 21, 2025

dtenedor mentioned this pull request Nov 10, 2025

[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #52983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284

[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284

Uh oh!

dtenedor commented Mar 14, 2025 •

edited

Loading

Uh oh!

dtenedor commented Mar 14, 2025

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dtenedor commented Mar 17, 2025

Uh oh!

dongjoon-hyun commented Mar 17, 2025

Uh oh!

dongjoon-hyun commented Mar 17, 2025

Uh oh!

dtenedor commented Mar 18, 2025

Uh oh!

dongjoon-hyun commented Mar 18, 2025

Uh oh!

dtenedor commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284

[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284

Uh oh!

Conversation

dtenedor commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dtenedor commented Mar 14, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dtenedor commented Mar 17, 2025

Uh oh!

dongjoon-hyun commented Mar 17, 2025

Uh oh!

dongjoon-hyun commented Mar 17, 2025

Uh oh!

dtenedor commented Mar 18, 2025

Uh oh!

dongjoon-hyun commented Mar 18, 2025

Uh oh!

dtenedor commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dtenedor commented Mar 14, 2025 •

edited

Loading