[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854

sarutak · 2021-03-16T15:30:37Z

What changes were proposed in this pull request?

This PR fixes an issue that col(), $"<name>" and df("name") don't handle quoted column names like `a``b.c`properly.

For example, if we have a following DataFrame.

val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")

For the DataFrame, this query is successfully executed.

scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+

But the following query will fail because df1("`a``b.c`") throws an exception.

scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
  ... 49 elided

Why are the changes needed?

It's a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

… column names properly.

sarutak · 2021-03-18T14:04:58Z

cc: @cloud-fan

cloud-fan · 2021-03-18T14:15:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

        if (char == '`') {
-          inBacktick = false
-          if (i + 1 < name.length && name(i + 1) != '.') throw e
+          if (i + 1 < name.length && name(i + 1) == '`') {


maybe we should get rid of this hand-written parser and simply call CatalystSqlParser.parseMultipartIdentifier here.

Thanks. I'll try to replace it with CatalystSqlParser.parseMultipartIdentifier.

I found replacing with CatlystSqlParser.parseMultipartIdentifier breaks API compatibility.
If a string passed to UnresolvedAttribute.parseAttributeName contains quoted parts, each of them is regarded as a complete name part.
Otherwise, each unquoted part is regarded as name parts which can be separated by . and each name part doesn't need to be quoted.

So, this is valid for parseAttributeName but not for parseMultipartIdentifier.

UnresolvedAttribute.parseAttributeName("*# !.`abc`")

So, I restore this part.

…seAttributeName

SparkQA · 2021-03-19T23:45:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40852/

SparkQA · 2021-03-19T23:49:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40852/

SparkQA · 2021-03-20T03:21:16Z

Test build #136271 has finished for PR 31854 at commit 249f8cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2021-03-20T03:56:47Z

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136271/

cloud-fan · 2021-03-22T08:28:02Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    checkAnswer(df3.selectExpr("`*-#&% ?`.`a``b.c`"), Row("col1"))
+    checkAnswer(df3.select(df3("*-#&% ?.`a``b.c`")), Row("col1"))
+    checkAnswer(df3.select(col("*-#&% ?.`a``b.c`")), Row("col1"))
+    checkAnswer(df3.select($"*-#&% ?.`a``b.c`"), Row("col1"))


I don't think this is intentional. How many tests are broken if we switch to parseMultipartIdentifier? We probably should make this behavior change.

ow many tests are broken if we switch to parseMultipartIdentifier

40+ tests fail.
https://github.com/apache/spark/runs/2164624369?check_suite_focus=true

I don't think this is intentional

Hmm, according to the comment on UnresolvedAttribute.quotedString, it seems to rely on the behavior.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala#L188-L193

If 40+ tests fail, we definitely shouldn't do it in this PR.

It's intentional when we write the code, but I don't think it's intentional to allow end-users to do something like df3.select($"*-#&% ?.a``b.c").

…ed column names properly ### What changes were proposed in this pull request? This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly. For example, if we have a following DataFrame. ``` val df1 = spark.sql("SELECT 'col1' AS `a``b.c`") ``` For the DataFrame, this query is successfully executed. ``` scala> df1.selectExpr("`a``b.c`").show +-----+ |a`b.c| +-----+ | col1| +-----+ ``` But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception. ``` scala> df1.select(df1("`a``b.c`")).show org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1274) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241) ... 49 elided ``` ### Why are the changes needed? It's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #31854 from sarutak/fix-parseAttributeName. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f7e9b6e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2021-03-24T05:35:52Z

thanks, merging to master/3.1/3.0!

…ed column names properly This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly. For example, if we have a following DataFrame. ``` val df1 = spark.sql("SELECT 'col1' AS `a``b.c`") ``` For the DataFrame, this query is successfully executed. ``` scala> df1.selectExpr("`a``b.c`").show +-----+ |a`b.c| +-----+ | col1| +-----+ ``` But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception. ``` scala> df1.select(df1("`a``b.c`")).show org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1274) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241) ... 49 elided ``` It's a bug. No. New tests. Closes #31854 from sarutak/fix-parseAttributeName. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f7e9b6e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ed column names properly ### What changes were proposed in this pull request? This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names like ``` `a``b.c` ```properly. For example, if we have a following DataFrame. ``` val df1 = spark.sql("SELECT 'col1' AS `a``b.c`") ``` For the DataFrame, this query is successfully executed. ``` scala> df1.selectExpr("`a``b.c`").show +-----+ |a`b.c| +-----+ | col1| +-----+ ``` But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception. ``` scala> df1.select(df1("`a``b.c`")).show org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`; at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) at org.apache.spark.sql.Dataset.col(Dataset.scala:1274) at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241) ... 49 elided ``` ### Why are the changes needed? It's a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes apache#31854 from sarutak/fix-parseAttributeName. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f7e9b6e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Fix an issue that col(), $"<name>" and df("name") don't handle quoted…

a0364a0

… column names properly.

github-actions bot added the SQL label Mar 16, 2021

cloud-fan reviewed Mar 18, 2021

View reviewed changes

sarutak added 4 commits March 18, 2021 23:31

Replace the logic with parseMultipartIdentifier.

f811192

Merge branch 'master' of https://github.com/apache/spark into fix-par…

be7b46f

…seAttributeName

Resolve conflict.

2028b5c

Revert partually.

249f8cc

sarutak force-pushed the fix-parseAttributeName branch from 670bb83 to 249f8cc Compare March 19, 2021 22:47

cloud-fan reviewed Mar 22, 2021

View reviewed changes

cloud-fan approved these changes Mar 24, 2021

View reviewed changes

cloud-fan closed this in f7e9b6e Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854

[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854

Uh oh!

sarutak commented Mar 16, 2021

Uh oh!

sarutak commented Mar 18, 2021

Uh oh!

cloud-fan Mar 18, 2021

Uh oh!

sarutak Mar 18, 2021

Uh oh!

sarutak Mar 20, 2021 •

edited

Loading

Uh oh!

SparkQA commented Mar 19, 2021

Uh oh!

SparkQA commented Mar 19, 2021

Uh oh!

SparkQA commented Mar 20, 2021

Uh oh!

AmplabJenkins commented Mar 20, 2021

Uh oh!

cloud-fan Mar 22, 2021

Uh oh!

sarutak Mar 24, 2021 •

edited

Loading

Uh oh!

cloud-fan Mar 24, 2021

Uh oh!

cloud-fan commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854

[SPARK-34763][SQL] col(), $"<name>" and df("name") should handle quoted column names properly #31854

Uh oh!

Conversation

sarutak commented Mar 16, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sarutak commented Mar 18, 2021

Uh oh!

cloud-fan Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

sarutak Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

sarutak Mar 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 19, 2021

Uh oh!

SparkQA commented Mar 19, 2021

Uh oh!

SparkQA commented Mar 20, 2021

Uh oh!

AmplabJenkins commented Mar 20, 2021

Uh oh!

cloud-fan Mar 22, 2021

Choose a reason for hiding this comment

Uh oh!

sarutak Mar 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 24, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sarutak Mar 20, 2021 •

edited

Loading

sarutak Mar 24, 2021 •

edited

Loading