-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-51518][SQL] Support | as an alternative to |> for the SQL pipe operator token #50284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your motivation and the current syntax is due to the Google System limitation.
However, I'm a little negative for this approach because of the compatibility with Google Pipe Syntax. Can we hold on this addition until Google supports this new syntax, @dtenedor .
|
Hi @dongjoon-hyun, this is a good question. We talked with Jeff Shute, the author of the SQL pipe syntax paper from Google. He mentions that they went with It seems like a growing consensus in the industry is that we should all support [1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl |
|
Thank you for the pointers. Yes, that's exactly what I worried. I believe `|>' is a better choice for a long-term interoperability while the eco-system is growing. We cannot add one by one when a new system is highlighted a new token as their new feature again. However, let me remove my review comment here for now. |
Since this is Apache Spark 4.1.0 feature, we have more time to discuss. So, I'm lifting my previous change request so that this PR gets more reviews.
|
If you don't mind, please revise the PR description with #50284 (comment), @dtenedor . Thank you for making more progress on this area. |
|
Yes, right.. For the record, I also red that paper, of course, while I reviewed your original SQL Pipe Syntax PR. Thank you for considering adding them as test cases.
|
|
I looked at this some more and it seems we will have to put in some effort in the parser to differentiate this case and resolve the ambiguity. It should be possible, but will take some extra work/investigation. I'll do some experimentation, it might take some time, so I'll close this PR for now to keep it off the "open Spark PRs" list, and re-open it later when the extra testing is available, in order to help with review. |
… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in #50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes #52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in apache/spark#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes #52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
… operator token ### What changes were proposed in this pull request? This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token. For example, this is now supported: ```sql table t | select x, y | where x < 2; ``` as an alternative to: ```sql table t |> select x, y |> where x < 2; ``` The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between: - **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc. - **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`) This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way. The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`. ### Why are the changes needed? This provides syntax compatibility with other languages that use `|` for pipe operations, such as: - Splunk SPL - Kusto (KQL) - Unix shell pipes **Background:** We previously attempted this in apache#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts. As discussed in that PR: - Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations - There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens - This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/) Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation. ### Does this PR introduce _any_ user-facing change? Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators. ### How was this patch tested? This PR includes comprehensive test coverage in `pipe-operators.sql`. ### Was this patch authored or co-authored using generative AI tooling? Yes, `claude-4.5-sonnet` with manual review and editing. Closes apache#52983 from dtenedor/pipe-syntax-single-char. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>

What changes were proposed in this pull request?
This PR supports
|as an alternative to|>for the SQL pipe operator token.For example, this is now supported:
as an alternative to this:
The change is controlled by a new Spark configuration
SQLConf.SINGLE_CHARACTER_OPERATOR_PIPE_TOKEN_ENABLED(spark.sql.singleCharacterOperatorPipeTokenEnabled).Background:
|>because all implementing engines can support it without any issues with their parsers.It seems like a growing consensus in the industry is that we should all support
|>as the primary token, but some engines may also decide to support alternative tokens in addition. This blog [2] describes the situation and mentions how another engine decided to go this direction.[1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl
[2] https://superdb.org/docs/language/pipe-ambiguity/
Why are the changes needed?
This is a simple and safe change and provides syntax compatibility with other languages that use
|for this purpose such as Splunk SPL and Kusto.Does this PR introduce any user-facing change?
Yes, per above. Note this change is fully backwards-compatible.
How was this patch tested?
This PR provides unit tests and golden file tests.
Was this patch authored or co-authored using generative AI tooling?
No.