Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10656][SQL]fix selection fails when a column has special characters #8811

Closed
wants to merge 3 commits into from

Conversation

zhichao-li
Copy link
Contributor

Best explained with this example:
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD(
"""{"a.b": "c", "d": "e" }""" :: Nil))
df.select("").show() //successful
df.select(df("
")).show() //throws exception
df.withColumnRenamed("d", "f").show() //also fails

This PR address this by adding back tick for the column names.

@SparkQA
Copy link

SparkQA commented Sep 18, 2015

Test build #42648 has finished for PR 8811 at commit 2fa0aa6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Interaction(override val uid: String) extends Transformer

@zhichao-li
Copy link
Contributor Author

cc @chenghao-intel @yhuai @liancheng

@chenghao-intel
Copy link
Contributor

A better place to add the back tick is in the class Column?

checkAnswer(df.select("`a.b`"), Row(10))
checkAnswer(df.select("*"), Row(10, 11, Row(12)))
checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12)))
checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests, which ones will fail without your fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add more test to cover the if branch you mentioned. and the below items would fail if without this patch.

    checkAnswer(df.select(df("*")), Row(10, 11, Row(12)))
    checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12)))
    checkAnswer(df.withColumnRenamed("f", "f").select("f"), Row(Row(12)))
    checkAnswer(df.withColumnRenamed("`a.b`", "s").select("s"), Row(10))
    checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10))

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42801 has finished for PR 8811 at commit d360b07.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhichao-li
Copy link
Contributor Author

@chenghao-intel, UnresolvedAttribute.parseAttributeName still lack the ability to solve case like this: name, so we cannot add the back tick to all cases which will do if you make the changes in Column. Here I'm supposing the original field name should not containing any back-tick.

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42814 has finished for PR 8811 at commit 626a82b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 20, 2015

Test build #44003 has finished for PR 8811 at commit 626a82b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhichao-li
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2015

Test build #44024 has finished for PR 8811 at commit 626a82b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

I think the main problem is: we interpret column name with special handling of . for DataFrame. This enables us to write something like df("a.b") to get the field b of a. However, we don't need this feature in DataFrame.apply("*") or DataFrame.withColumnRenamed. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

How about adding a new version of DataFrame.resolve which use the column name directly without interpreting it? And use new version of DataFrame.resolve in DataFrame.apply("*") and DataFrame.withColumnRenamed.

@asfgit asfgit closed this in d9e30c5 Nov 5, 2015
asfgit pushed a commit that referenced this pull request Nov 5, 2015
the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`.

close #8811

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9462 from cloud-fan/special-chars.

(cherry picked from commit d9e30c5)
Signed-off-by: Yin Huai <yhuai@databricks.com>
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it.

The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`.

close apache/spark#8811

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9462 from cloud-fan/special-chars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants