-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… column names properly.
|
cc: @cloud-fan |
| if (char == '`') { | ||
| inBacktick = false | ||
| if (i + 1 < name.length && name(i + 1) != '.') throw e | ||
| if (i + 1 < name.length && name(i + 1) == '`') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should get rid of this hand-written parser and simply call CatalystSqlParser.parseMultipartIdentifier here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll try to replace it with CatalystSqlParser.parseMultipartIdentifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found replacing with CatlystSqlParser.parseMultipartIdentifier breaks API compatibility.
If a string passed to UnresolvedAttribute.parseAttributeName contains quoted parts, each of them is regarded as a complete name part.
Otherwise, each unquoted part is regarded as name parts which can be separated by . and each name part doesn't need to be quoted.
So, this is valid for parseAttributeName but not for parseMultipartIdentifier.
UnresolvedAttribute.parseAttributeName("*# !.`abc`")
So, I restore this part.
670bb83 to
249f8cc
Compare
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136271 has finished for PR 31854 at commit
|
|
Refer to this link for build results (access rights to CI server needed): |
| checkAnswer(df3.selectExpr("`*-#&% ?`.`a``b.c`"), Row("col1")) | ||
| checkAnswer(df3.select(df3("*-#&% ?.`a``b.c`")), Row("col1")) | ||
| checkAnswer(df3.select(col("*-#&% ?.`a``b.c`")), Row("col1")) | ||
| checkAnswer(df3.select($"*-#&% ?.`a``b.c`"), Row("col1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is intentional. How many tests are broken if we switch to parseMultipartIdentifier? We probably should make this behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ow many tests are broken if we switch to parseMultipartIdentifier
40+ tests fail.
https://github.com/apache/spark/runs/2164624369?check_suite_focus=true
I don't think this is intentional
Hmm, according to the comment on UnresolvedAttribute.quotedString, it seems to rely on the behavior.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala#L188-L193
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 40+ tests fail, we definitely shouldn't do it in this PR.
It's intentional when we write the code, but I don't think it's intentional to allow end-users to do something like df3.select($"*-#&% ?.a``b.c").
…ed column names properly
### What changes were proposed in this pull request?
This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly.
For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```
For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```
But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
... 49 elided
```
### Why are the changes needed?
It's a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes #31854 from sarutak/fix-parseAttributeName.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
thanks, merging to master/3.1/3.0! |
…ed column names properly
This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly.
For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```
For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```
But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
... 49 elided
```
It's a bug.
No.
New tests.
Closes #31854 from sarutak/fix-parseAttributeName.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ed column names properly
### What changes were proposed in this pull request?
This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly.
For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```
For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```
But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
... 49 elided
```
### Why are the changes needed?
It's a bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes apache#31854 from sarutak/fix-parseAttributeName.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR fixes an issue that
col(),$"<name>"anddf("name")don't handle quoted column names like`a``b.c`properly.For example, if we have a following DataFrame.
For the DataFrame, this query is successfully executed.
But the following query will fail because
df1("`a``b.c`")throws an exception.Why are the changes needed?
It's a bug.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests.