Skip to content

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Mar 14, 2025

What changes were proposed in this pull request?

This PR supports | as an alternative to |> for the SQL pipe operator token.

For example, this is now supported:

from t
| select x, y;

as an alternative to this:

from t
|> select x, y;

The change is controlled by a new Spark configuration SQLConf.SINGLE_CHARACTER_OPERATOR_PIPE_TOKEN_ENABLED (spark.sql.singleCharacterOperatorPipeTokenEnabled).

Background:

  • We talked with Jeff Shute, the author of the SQL pipe syntax paper from Google.
  • He mentions that they went with |> because all implementing engines can support it without any issues with their parsers.
  • Specifically, Google uses an LALR parser [1] which makes it impossible to use | as the token due to ambiguity with bit operations. They won't be able to add support for this.

It seems like a growing consensus in the industry is that we should all support |> as the primary token, but some engines may also decide to support alternative tokens in addition. This blog [2] describes the situation and mentions how another engine decided to go this direction.

[1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl

[2] https://superdb.org/docs/language/pipe-ambiguity/

Why are the changes needed?

This is a simple and safe change and provides syntax compatibility with other languages that use | for this purpose such as Splunk SPL and Kusto.

Does this PR introduce any user-facing change?

Yes, per above. Note this change is fully backwards-compatible.

How was this patch tested?

This PR provides unit tests and golden file tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 14, 2025
@dtenedor
Copy link
Contributor Author

cc @cloud-fan @gengliangwang

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your motivation and the current syntax is due to the Google System limitation.

However, I'm a little negative for this approach because of the compatibility with Google Pipe Syntax. Can we hold on this addition until Google supports this new syntax, @dtenedor .

@dtenedor
Copy link
Contributor Author

Hi @dongjoon-hyun, this is a good question. We talked with Jeff Shute, the author of the SQL pipe syntax paper from Google. He mentions that they went with |> because all implementing engines can support it without any issues with their parsers. Specifically, Google uses an LALR parser [1] which makes it impossible to use | as the token due to ambiguity with bit operations. They won't be able to add support for this.

It seems like a growing consensus in the industry is that we should all support |> as the primary token, but some engines may also decide to support alternative tokens in addition. This blog [2] describes the situation and mentions how another engine decided to go this direction. Please let us know your thoughts on this -- we can certainly hold off on merging this PR until we figure out together what we want the plan to be.

[1] https://github.com/google/zetasql/blob/master/bazel/bison.bzl

[2] https://superdb.org/docs/language/pipe-ambiguity/

@dongjoon-hyun
Copy link
Member

Thank you for the pointers. Yes, that's exactly what I worried.

I believe `|>' is a better choice for a long-term interoperability while the eco-system is growing. We cannot add one by one when a new system is highlighted a new token as their new feature again.

However, let me remove my review comment here for now.

@dongjoon-hyun dongjoon-hyun dismissed their stale review March 17, 2025 19:16

Since this is Apache Spark 4.1.0 feature, we have more time to discuss. So, I'm lifting my previous change request so that this PR gets more reviews.

@dongjoon-hyun
Copy link
Member

If you don't mind, please revise the PR description with #50284 (comment), @dtenedor . Thank you for making more progress on this area.

@dtenedor
Copy link
Contributor Author

I found a parsing ambiguity with this syntax that the research paper mentions:

image

Making a note here to add testing for this. Since this example is completely ambiguous, we'll have to decide which application of the | token we would prefer to take precedence.

@dongjoon-hyun
Copy link
Member

Yes, right.. For the record, I also red that paper, of course, while I reviewed your original SQL Pipe Syntax PR. Thank you for considering adding them as test cases.

Making a note here to add testing for this.

@dtenedor
Copy link
Contributor Author

I looked at this some more and it seems we will have to put in some effort in the parser to differentiate this case and resolve the ambiguity. It should be possible, but will take some extra work/investigation. I'll do some experimentation, it might take some time, so I'll close this PR for now to keep it off the "open Spark PRs" list, and re-open it later when the extra testing is available, in order to help with review.

@dtenedor dtenedor closed this Mar 21, 2025
dtenedor added a commit that referenced this pull request Nov 14, 2025
… operator token

### What changes were proposed in this pull request?

This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token.

For example, this is now supported:

```sql
table t
| select x, y
| where x < 2;
```

as an alternative to:

```sql
table t
|> select x, y
|> where x < 2;
```

The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between:
- **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc.
- **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`)

This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way.

The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`.

### Why are the changes needed?

This provides syntax compatibility with other languages that use `|` for pipe operations, such as:
- Splunk SPL
- Kusto (KQL)
- Unix shell pipes

**Background:**

We previously attempted this in #50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts.

As discussed in that PR:
- Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations
- There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens
- This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/)

Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation.

### Does this PR introduce _any_ user-facing change?

Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators.

### How was this patch tested?

This PR includes comprehensive test coverage in `pipe-operators.sql`.

### Was this patch authored or co-authored using generative AI tooling?

Yes, `claude-4.5-sonnet` with manual review and editing.

Closes #52983 from dtenedor/pipe-syntax-single-char.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Nov 14, 2025
… operator token

### What changes were proposed in this pull request?

This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token.

For example, this is now supported:

```sql
table t
| select x, y
| where x < 2;
```

as an alternative to:

```sql
table t
|> select x, y
|> where x < 2;
```

The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between:
- **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc.
- **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`)

This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way.

The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`.

### Why are the changes needed?

This provides syntax compatibility with other languages that use `|` for pipe operations, such as:
- Splunk SPL
- Kusto (KQL)
- Unix shell pipes

**Background:**

We previously attempted this in apache/spark#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts.

As discussed in that PR:
- Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations
- There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens
- This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/)

Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation.

### Does this PR introduce _any_ user-facing change?

Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators.

### How was this patch tested?

This PR includes comprehensive test coverage in `pipe-operators.sql`.

### Was this patch authored or co-authored using generative AI tooling?

Yes, `claude-4.5-sonnet` with manual review and editing.

Closes #52983 from dtenedor/pipe-syntax-single-char.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
… operator token

### What changes were proposed in this pull request?

This PR adds support for `|` as an alternative to `|>` for the SQL pipe operator token.

For example, this is now supported:

```sql
table t
| select x, y
| where x < 2;
```

as an alternative to:

```sql
table t
|> select x, y
|> where x < 2;
```

The implementation uses a semantic predicate with 2-token lookahead (`isOperatorPipeStart()`) to disambiguate between:
- **Pipe operators**: When `|` is followed by corresponding keywords like `SELECT`, `WHERE`, `EXTEND`, `JOIN`, etc.
- **Bitwise OR**: When `|` is part of an expression (e.g., `col1 | col2`)

This approach ensures that existing SQL queries using `|` for bitwise OR operations continue to work without any changes. The only exception is when column names match pipe operator keywords (e.g., `col1 | select` where `select` is a column name). An analysis of existing SQL usage found no instances of bitwise OR operations being used in this way.

The new behavior is controlled by a configuration toggle `spark.sql.parser.singleCharacterPipeOperator.enabled`.

### Why are the changes needed?

This provides syntax compatibility with other languages that use `|` for pipe operations, such as:
- Splunk SPL
- Kusto (KQL)
- Unix shell pipes

**Background:**

We previously attempted this in apache#50284 but abandoned that approach because it inadvertently broke bitwise OR expression usage. After further investigation, we've developed a solution using ANTLR semantic predicates that properly disambiguates the two contexts.

As discussed in that PR:
- Jeff Shute (author of the SQL pipe syntax paper from Google) confirmed that Google uses an LALR parser which makes it impossible for them to support `|` due to ambiguity with bitwise operations
- There is growing industry consensus that `|>` should be the primary/universal token, but engines may optionally support additional tokens
- This approach aligns with how other databases have addressed this (see https://superdb.org/docs/language/pipe-ambiguity/)

Spark's use of ANTLR (not LALR) enables us to support both tokens through lookahead-based disambiguation.

### Does this PR introduce _any_ user-facing change?

Yes, users can now use `|` as a more concise alternative to `|>` for pipe operators.

### How was this patch tested?

This PR includes comprehensive test coverage in `pipe-operators.sql`.

### Was this patch authored or co-authored using generative AI tooling?

Yes, `claude-4.5-sonnet` with manual review and editing.

Closes apache#52983 from dtenedor/pipe-syntax-single-char.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants