[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

zhichao-li · 2015-09-18T07:23:07Z

Best explained with this example:
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD(
"""{"a.b": "c", "d": "e" }""" :: Nil))
df.select("").show() //successful
df.select(df("")).show() //throws exception
df.withColumnRenamed("d", "f").show() //also fails

This PR address this by adding back tick for the column names.

SparkQA · 2015-09-18T09:18:45Z

Test build #42648 has finished for PR 8811 at commit 2fa0aa6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Interaction(override val uid: String) extends Transformer

zhichao-li · 2015-09-21T06:36:53Z

cc @chenghao-intel @yhuai @liancheng

chenghao-intel · 2015-09-21T11:31:07Z

A better place to add the back tick is in the class Column?

yhuai · 2015-09-21T15:59:13Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+    checkAnswer(df.select("`a.b`"), Row(10))
+    checkAnswer(df.select("*"), Row(10, 11, Row(12)))
+    checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12)))
+    checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10))


For these tests, which ones will fail without your fix?

Just add more test to cover the if branch you mentioned. and the below items would fail if without this patch.

checkAnswer(df.select(df("*")), Row(10, 11, Row(12))) checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12))) checkAnswer(df.withColumnRenamed("f", "f").select("f"), Row(Row(12))) checkAnswer(df.withColumnRenamed("`a.b`", "s").select("s"), Row(10)) checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10))

SparkQA · 2015-09-22T01:55:17Z

Test build #42801 has finished for PR 8811 at commit d360b07.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2015-09-22T02:01:31Z

@chenghao-intel, UnresolvedAttribute.parseAttributeName still lack the ability to solve case like this: name, so we cannot add the back tick to all cases which will do if you make the changes in Column. Here I'm supposing the original field name should not containing any back-tick.

SparkQA · 2015-09-22T06:08:44Z

Test build #42814 has finished for PR 8811 at commit 626a82b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-10-20T18:35:48Z

Jenkins, retest this please.

SparkQA · 2015-10-20T19:36:50Z

Test build #44003 has finished for PR 8811 at commit 626a82b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2015-10-21T00:47:02Z

retest this please

SparkQA · 2015-10-21T02:42:31Z

Test build #44024 has finished for PR 8811 at commit 626a82b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2015-10-21T14:42:51Z

I think the main problem is: we interpret column name with special handling of . for DataFrame. This enables us to write something like df("a.b") to get the field b of a. However, we don't need this feature in DataFrame.apply("*") or DataFrame.withColumnRenamed. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

How about adding a new version of DataFrame.resolve which use the column name directly without interpreting it? And use new version of DataFrame.resolve in DataFrame.apply("*") and DataFrame.withColumnRenamed.

the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close #8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars. (cherry picked from commit d9e30c5) Signed-off-by: Yin Huai <yhuai@databricks.com>

the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close apache/spark#8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.

fix selection fails when a column has special characters

2fa0aa6

yhuai reviewed Sep 21, 2015
View reviewed changes

add more unittest

d360b07

style

626a82b

cloud-fan mentioned this pull request Nov 4, 2015

[SPARK-10656][SQL] completely support special chars in DataFrame #9462

Closed

asfgit closed this in d9e30c5 Nov 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

zhichao-li commented Sep 18, 2015

SparkQA commented Sep 18, 2015

zhichao-li commented Sep 21, 2015

chenghao-intel commented Sep 21, 2015

yhuai Sep 21, 2015

zhichao-li Sep 22, 2015

SparkQA commented Sep 22, 2015

zhichao-li commented Sep 22, 2015

SparkQA commented Sep 22, 2015

JoshRosen commented Oct 20, 2015

SparkQA commented Oct 20, 2015

zhichao-li commented Oct 21, 2015

SparkQA commented Oct 21, 2015

cloud-fan commented Oct 21, 2015

[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

Conversation

zhichao-li commented Sep 18, 2015

SparkQA commented Sep 18, 2015

zhichao-li commented Sep 21, 2015

chenghao-intel commented Sep 21, 2015

yhuai Sep 21, 2015

Choose a reason for hiding this comment

zhichao-li Sep 22, 2015

Choose a reason for hiding this comment

SparkQA commented Sep 22, 2015

zhichao-li commented Sep 22, 2015

SparkQA commented Sep 22, 2015

JoshRosen commented Oct 20, 2015

SparkQA commented Oct 20, 2015

zhichao-li commented Oct 21, 2015

SparkQA commented Oct 21, 2015

cloud-fan commented Oct 21, 2015