-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10656][SQL]fix selection fails when a column has special characters #8811
Conversation
Test build #42648 has finished for PR 8811 at commit
|
A better place to add the back tick is in the class |
checkAnswer(df.select("`a.b`"), Row(10)) | ||
checkAnswer(df.select("*"), Row(10, 11, Row(12))) | ||
checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12))) | ||
checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests, which ones will fail without your fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add more test to cover the if branch you mentioned. and the below items would fail if without this patch.
checkAnswer(df.select(df("*")), Row(10, 11, Row(12)))
checkAnswer(df.withColumnRenamed("f", "h").select("h"), Row(Row(12)))
checkAnswer(df.withColumnRenamed("f", "f").select("f"), Row(Row(12)))
checkAnswer(df.withColumnRenamed("`a.b`", "s").select("s"), Row(10))
checkAnswer(df.withColumnRenamed("f", "h").select("`a.b`"), Row(10))
Test build #42801 has finished for PR 8811 at commit
|
@chenghao-intel, |
Test build #42814 has finished for PR 8811 at commit
|
Jenkins, retest this please. |
Test build #44003 has finished for PR 8811 at commit
|
retest this please |
Test build #44024 has finished for PR 8811 at commit
|
I think the main problem is: we interpret column name with special handling of How about adding a new version of |
the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close #8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars. (cherry picked from commit d9e30c5) Signed-off-by: Yin Huai <yhuai@databricks.com>
the main problem is: we interpret column name with special handling of `.` for DataFrame. This enables us to write something like `df("a.b")` to get the field `b` of `a`. However, we don't need this feature in `DataFrame.apply("*")` or `DataFrame.withColumnRenamed`. In these 2 cases, the column name is the final name already, we don't need extra process to interpret it. The solution is simple, use `queryExecution.analyzed.output` to get resolved column directly, instead of using `DataFrame.resolve`. close apache/spark#8811 Author: Wenchen Fan <wenchen@databricks.com> Closes #9462 from cloud-fan/special-chars.
Best explained with this example:
val df = sqlContext.read.json(sqlContext.sparkContext.makeRDD(
"""{"a.b": "c", "d": "e" }""" :: Nil))
df.select("").show() //successful
df.select(df("")).show() //throws exception
df.withColumnRenamed("d", "f").show() //also fails
This PR address this by adding back tick for the column names.