Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47300][SQL] quoteIfNeeded should quote identifier starts with digits #45401

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

quoteIfNeeded is used to generate pretty strings of identifiers in error message, EXPLAIN result, etc. It's mostly for humans to read but sometimes people may copy-paste the identifier string and put it in a SQL. There is a small issue in quoteIfNeeded: it does not quote number literals like 0d, and people will get parse exception if they directly put the identifier string in SQL.

This PR fixes the issue by always quoting the identifier if it does not start with alphabet or underscore. Note, there might be program trying to parse the identifier string, but this change is safe as these programs should already handle quoting and it's ok to quote more.

Why are the changes needed?

make identifier string parsable is more user-friendly.

Does this PR introduce any user-facing change?

yes, the identifier string in the error message and EXPLAIN result will be properly quoted.

How was this patch tested?

new test suite

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Mar 6, 2024
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Mar 6, 2024

cc @yaooqinn @dongjoon-hyun

@yaooqinn
Copy link
Member

yaooqinn commented Mar 6, 2024

spark-sql (default)> select version();
3.5.0 ce5ddad990373636e94071e7cef2f31021add07b
spark-sql (default)> create table d (0a real) using parquet;
spark-sql (default)> desc formatted d;
0a                  	float

Do we already have it?

@cloud-fan
Copy link
Contributor Author

@yaooqinn 0a is fine but 0d is not, as it's double literal.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaooqinn yaooqinn changed the title [SPARK-47300] quoteIfNeeded should quote identifier starts with digits [SPARK-47300][SQL] quoteIfNeeded should quote identifier starts with digits Mar 6, 2024
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Seems the test failure is related:

[info] - SPARK-34872: quoteIfNeeded should quote a string which contains non-word characters *** FAILED *** (3 milliseconds)
[info]   "[`1a`]" did not equal "[1a]" (StringUtilsSuite.scala:136)
[info]   Analysis:
[info]   "[`1a`]" -> "[1a]"

@MaxGekk
Copy link
Member

MaxGekk commented Mar 6, 2024

How about deduplicate the tests:

test("SPARK-34872: quoteIfNeeded should quote a string which contains non-word characters") {
assert(quoteIfNeeded("a b") === "`a b`")
assert(quoteIfNeeded("a*b") === "`a*b`")
assert(quoteIfNeeded("123") === "`123`")
assert(quoteIfNeeded("1a") === "1a")
assert(quoteIfNeeded("_ab_") === "_ab_")
assert(quoteIfNeeded("_") === "_")
assert(quoteIfNeeded("") === "``")
}

and maybe move them to QuotingUtilsSuite.

@dongjoon-hyun
Copy link
Member

It seems that the same failure (which @MaxGekk 's reported) still exists at the last CI.

StringUtilsSuite:
[info] - escapeLikeRegex (1 millisecond)
[info] - filter pattern (1 millisecond)
[info] - string concatenation (1 millisecond)
[info] - string concatenation with limit (1 millisecond)
[info] - string concatenation return value (1 millisecond)
[info] - SPARK-31916: StringConcat doesn't overflow on many inputs (190 milliseconds)
[info] - SPARK-31916: verify that PlanStringConcat's output shows the actual length of the plan (3 milliseconds)
[info] - SPARK-34872: quoteIfNeeded should quote a string which contains non-word characters *** FAILED *** (2 milliseconds)
[info]   "[`1a`]" did not equal "[1a]" (StringUtilsSuite.scala:136)
[info]   Analysis:
[info]   "[`1a`]" -> "[1a]"

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (if CIs passes after adjusting the test suite)

@@ -129,16 +129,6 @@ class StringUtilsSuite extends SparkFunSuite with SQLHelper {
}
}

test("SPARK-34872: quoteIfNeeded should quote a string which contains non-word characters") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's covered by the new test

@yaooqinn yaooqinn closed this in 9b743ff Mar 7, 2024
@yaooqinn
Copy link
Member

yaooqinn commented Mar 7, 2024

Merged to master.

Thank you @cloud-fan @dongjoon-hyun @HyukjinKwon @MaxGekk

jpcorreia99 pushed a commit to jpcorreia99/spark that referenced this pull request Mar 12, 2024
…h digits

### What changes were proposed in this pull request?

`quoteIfNeeded` is used to generate pretty strings of identifiers in error message, EXPLAIN result, etc. It's mostly for humans to read but sometimes people may copy-paste the identifier string and put it in a SQL. There is a small issue in `quoteIfNeeded`: it does not quote number literals like `0d`, and people will get parse exception if they directly put the identifier string in SQL.

This PR fixes the issue by always quoting the identifier if it does not start with alphabet or underscore. Note, there might be program trying to parse the identifier string, but this change is safe as these programs should already handle quoting and it's ok to quote more.

### Why are the changes needed?

make identifier string parsable is more user-friendly.

### Does this PR introduce _any_ user-facing change?

yes, the identifier string in the error message and EXPLAIN result will be properly quoted.

### How was this patch tested?

new test suite

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#45401 from cloud-fan/quote.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants