Skip to content

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Mar 16, 2021

What changes were proposed in this pull request?

This PR fixes an issue that col(), $"<name>" and df("name") don't handle quoted column names like `a``b.c`properly.

For example, if we have a following DataFrame.

val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")

For the DataFrame, this query is successfully executed.

scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+

But the following query will fail because df1("`a``b.c`") throws an exception.

scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
  ... 49 elided

Why are the changes needed?

It's a bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

@github-actions github-actions bot added the SQL label Mar 16, 2021
@sarutak
Copy link
Member Author

sarutak commented Mar 18, 2021

cc: @cloud-fan

if (char == '`') {
inBacktick = false
if (i + 1 < name.length && name(i + 1) != '.') throw e
if (i + 1 < name.length && name(i + 1) == '`') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should get rid of this hand-written parser and simply call CatalystSqlParser.parseMultipartIdentifier here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll try to replace it with CatalystSqlParser.parseMultipartIdentifier.

Copy link
Member Author

@sarutak sarutak Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found replacing with CatlystSqlParser.parseMultipartIdentifier breaks API compatibility.
If a string passed to UnresolvedAttribute.parseAttributeName contains quoted parts, each of them is regarded as a complete name part.
Otherwise, each unquoted part is regarded as name parts which can be separated by . and each name part doesn't need to be quoted.

So, this is valid for parseAttributeName but not for parseMultipartIdentifier.

UnresolvedAttribute.parseAttributeName("*#  !.`abc`")

So, I restore this part.

@sarutak sarutak force-pushed the fix-parseAttributeName branch from 670bb83 to 249f8cc Compare March 19, 2021 22:47
@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40852/

@SparkQA
Copy link

SparkQA commented Mar 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40852/

@SparkQA
Copy link

SparkQA commented Mar 20, 2021

Test build #136271 has finished for PR 31854 at commit 249f8cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136271/

checkAnswer(df3.selectExpr("`*-#&% ?`.`a``b.c`"), Row("col1"))
checkAnswer(df3.select(df3("*-#&% ?.`a``b.c`")), Row("col1"))
checkAnswer(df3.select(col("*-#&% ?.`a``b.c`")), Row("col1"))
checkAnswer(df3.select($"*-#&% ?.`a``b.c`"), Row("col1"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is intentional. How many tests are broken if we switch to parseMultipartIdentifier? We probably should make this behavior change.

Copy link
Member Author

@sarutak sarutak Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ow many tests are broken if we switch to parseMultipartIdentifier

40+ tests fail.
https://github.com/apache/spark/runs/2164624369?check_suite_focus=true

I don't think this is intentional

Hmm, according to the comment on UnresolvedAttribute.quotedString, it seems to rely on the behavior.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala#L188-L193

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 40+ tests fail, we definitely shouldn't do it in this PR.

It's intentional when we write the code, but I don't think it's intentional to allow end-users to do something like df3.select($"*-#&% ?.a``b.c").

@cloud-fan cloud-fan closed this in f7e9b6e Mar 24, 2021
cloud-fan pushed a commit that referenced this pull request Mar 24, 2021
…ed column names properly

### What changes were proposed in this pull request?

This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names  like ``` `a``b.c` ```properly.

For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```

For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```

But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
  ... 49 elided
```
### Why are the changes needed?

It's a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #31854 from sarutak/fix-parseAttributeName.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1/3.0!

cloud-fan pushed a commit that referenced this pull request Mar 24, 2021
…ed column names properly

This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names  like ``` `a``b.c` ```properly.

For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```

For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```

But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
  ... 49 elided
```

It's a bug.

No.

New tests.

Closes #31854 from sarutak/fix-parseAttributeName.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…ed column names properly

### What changes were proposed in this pull request?

This PR fixes an issue that `col()`, `$"<name>"` and `df("name")` don't handle quoted column names  like ``` `a``b.c` ```properly.

For example, if we have a following DataFrame.
```
val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
```

For the DataFrame, this query is successfully executed.
```
scala> df1.selectExpr("`a``b.c`").show
+-----+
|a`b.c|
+-----+
| col1|
+-----+
```

But the following query will fail because ``` df1("`a``b.c`") ``` throws an exception.
```
scala> df1.select(df1("`a``b.c`")).show
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `a``b.c`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
  at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
  at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
  at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
  ... 49 elided
```
### Why are the changes needed?

It's a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes apache#31854 from sarutak/fix-parseAttributeName.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f7e9b6e)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants