Skip to content

[SPARK-52709][SQL] Fix parsing of STRUCT<> #51480

Closed
ManosGEM wants to merge 3 commits intoapache:masterfrom
ManosGEM:SPARK-52709-fix-empty-struct-parsing
Closed

[SPARK-52709][SQL] Fix parsing of STRUCT<> #51480
ManosGEM wants to merge 3 commits intoapache:masterfrom
ManosGEM:SPARK-52709-fix-empty-struct-parsing

Conversation

@ManosGEM
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR fixes an issue in Spark SQL's parser where empty or nested STRUCT<> types cause incorrect parenthesis tracking and parsing failures. Previously, the parser increased the parenthesis depth counter upon encountering the keyword STRUCT. Due to the operator precedence in the lexer (e.g., NEQ is matched before LT), a construct like STRUCT<> could incorrectly be tokenized as STRUCT and NEQ. This caused the parser to increase the nesting counter without ever decreasing it, eventually resulting in a syntax error.
For example, the following valid query fails under the current logic:
SELECT cast(null as STRUCT<>), 2 >> 1;

To fix this, we adjusted the definition of the NEQ token in the SQL lexer so that it no longer matches <> when used in a complex data type. This ensures that the parser correctly interprets the angle brackets as part of a type specification rather than as a comparison operator.

Why are the changes needed?

bug fix.
By modifying the NEQ token rule to avoid incorrectly matching <> in this context, we ensure that:

  • Empty STRUCT types like STRUCT<> are parsed correctly.
    
  • Nested and complex STRUCT types are supported without breaking parsing logic.
    
  • Queries with complex data types and bitwise operations (e.g., >>), are no longer broken due to incorrect token handling.
    

This change improves Spark SQL’s compatibility with standard SQL syntax and user expectations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new test case has been added to PlanParserSuite.scala to specifically verify that queries containing CAST(null AS STRUCT<>), nested STRUCT<> (if applicable), and the >> operator now parse successfully into the expected logical plan.

Was this patch authored or co-authored using generative AI tooling?

No

Closes SPARK-52709
Reported by : @mihailom-db

@github-actions github-actions bot added the SQL label Jul 14, 2025
Copy link
Copy Markdown
Contributor

@mihailomilosevic2001 mihailomilosevic2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ManosGEM Thanks for working on this issue. I have left a few comments on the PR. Some of them are just to make code more durable to future errors of this type. Please feel free to ping me for another round of review when you go through them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, if we are adding tests to specific files, we try to reuse as much boiler plate code that is there. Could you please rewrite this test to have similar structure as the tests above?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this, I mean to reuse comparePlans. Also, I would say we can move this test to SparkSqlParserSuite as the point of this test is to enable parsing of different queries.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try my best to rewrite it in the manner you propose.

@mihailomilosevic2001
Copy link
Copy Markdown
Contributor

nit: We usually like to name the PR the same way the ticket is named on Jira. Could you please align the PR name with Jira?

@ManosGEM ManosGEM changed the title [SPARK-52709][SQL] Fix parsing of null STRUCT<> [SPARK-52709][SQL] Fix parsing of STRUCT<> Jul 15, 2025
@ManosGEM
Copy link
Copy Markdown
Contributor Author

nit: We usually like to name the PR the same way the ticket is named on Jira. Could you please align the PR name with Jira?

remove the word "null" then? Because the only other difference is the [SQL] tag that I added according to the guidelines.

@mihailomilosevic2001
Copy link
Copy Markdown
Contributor

Yeah, I meant the main name of the PR, the tags are all good.

This commit addresses a parsing issue where the `STRUCT<>` data type
was incorrectly handled, especially when appearing before operators
like the bitwise shift (`>>`).

The problem stemmed from the parser misinterpreting the angle brackets
of `STRUCT<>` as the 'not equal' operator (`<>`), leading to syntax errors.
The fix ensures `STRUCT<>` is correctly recognized as a data type.

A new test case in `PlanParserSuite.scala` confirms that queries
with `CAST(null AS STRUCT<>)`, nested structs, and `>>` now parse correctly.
@ManosGEM ManosGEM force-pushed the SPARK-52709-fix-empty-struct-parsing branch from 88f22d9 to 81f7573 Compare July 18, 2025 08:23
…ta types.

This commit follows SPARK-52709 by:

- Removing  from  parsing rules.

- Relocating relevant parser tests from  to .

- Refactoring the test setup for complex data types into a reusable helper function.

- Adding comprehensive tests for valid , nested , ,  and their combinations to .

- Adding negative tests for invalid empty  and  types to confirm correct  behavior.
@ManosGEM ManosGEM force-pushed the SPARK-52709-fix-empty-struct-parsing branch from 81f7573 to 76c0544 Compare July 18, 2025 11:46
Copy link
Copy Markdown
Contributor

@mihailomilosevic2001 mihailomilosevic2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @MaxGekk @cloud-fan Could you please have a last review/merge.

@cloud-fan
Copy link
Copy Markdown
Contributor

cloud-fan commented Jul 24, 2025

good catch! merging to master!

@cloud-fan cloud-fan closed this in 64cada1 Jul 24, 2025
@cloud-fan
Copy link
Copy Markdown
Contributor

@ManosGEM Can you open a branch 4.0 backport PR? thanks!

@ManosGEM
Copy link
Copy Markdown
Contributor Author

@cloud-fan Everything should be the same in the PR but open it on the branch-4.0 ? Sorry but this is actually my first PR on Spark.

@cloud-fan
Copy link
Copy Markdown
Contributor

It has merge conflicts, you will need to git cherry-pick this commit against branch-4.0 locally first, resolve merge conflicts, and open a new PR against branch-4.0

@cloud-fan
Copy link
Copy Markdown
Contributor

cloud-fan commented Aug 18, 2025

This PR introduced a regression. complex_type_level_counter is increased when the lexer sees STRUCR/ARRAY/MAP. However, STRUCR/ARRAY/MAP are also function names and ARRAY(col1 <> col2) should be allowed.

That being said, complex_type_level_counter itself already has bugs and ARRAY(col >> 1) should be allowed but not today. I think we can't disambiguate it at the lexer side, but we should handle it at the parser. For example, STRUCT<> can be an empty struct type, but can also be part of struct <> another_col where <> means "not equal to".

Let me revert this PR first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants