-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12988][SQL] Can't drop columns that contain dots #10943
Conversation
ok to test |
Test build #50215 has finished for PR 10943 at commit
|
Actually this kind of problem has come out many times, I think we should distinguish cc @rxin |
@cloud-fan Thank you Wenchen for your comments. In my understanding , users need to use back-tick to quote the column names if they wanted them to be treated as a column name as opposed to column path. I tried the following example val df = Seq((1, 2, 3)).toDF("a_b", "a.c", "b.c") Is this not how it is supposed to work ? Can you please elaborate by taking a small |
FYI, the PR that fixed a similar problem for |
@cloud-fan Thank you Wenchen. |
@cloud-fan Hi Wenchen, let me know if i have interpreted your suggestion correctly ? Please let me know if something is amiss. df.resolve() has many callers .. so i have not changed its name but have added a comment. Let me know if you want me to refactor it. Thanks.. |
Test build #50452 has finished for PR 10943 at commit
|
* Resolves a column name. This is called when it is required to resolve a column by its | ||
* name only and not as a column path.. | ||
*/ | ||
private[sql] def resolveColName(colName: String, userSuppliedName: String): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
private[sql] def indexOf(colName: String): Option[Int] = {
val resolver = sqlContext.analyzer.resolver
val index = queryExecution.analyzed.output.indexWhere(f => resolver(f.name, colName))
if (index >= 0) Some(index) else None
}
then we can rewrite withColumn
to:
indexOf(colName).map { index =>
select(output.updated(index, col.as(colName)).map(Column(_)) : _*)
}.getOrElse {
select(Column("*"), col.as(colName))
}
There may be better name for this, like resolveToIndex
@cloud-fan Thank you Wenchen. I will try your suggestion and get back. |
@cloud-fan Hi Wenchen, couldn't get the code snippet to compile and i made a change that looks like the following. def withColumn(colName: String, col: Column): DataFrame = {
val output = queryExecution.analyzed.output
indexOf(colName).map {index =>
val columns = output.zipWithIndex.map {
case (a, i) => if (i == index) col.as(colName) else Column(a)
}
select(columns: _*)
}.getOrElse {
select(Column("*"), col.as(colName))
}
} Does this look okay to you ? Let me know please .. |
ah, change |
@cloud-fan Thanks a lot. I have implemented as per your input. |
Test build #50522 has finished for PR 10943 at commit
|
test this please |
Test build #50525 has finished for PR 10943 at commit
|
private[sql] def resolveToIndex(colName: String): Option[Int] = { | ||
val resolver = sqlContext.analyzer.resolver | ||
// First remove any user supplied quotes. | ||
val unquotedColName = colName.stripPrefix("`").stripSuffix("`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this? I think for these methods that require column name, user should just pass in an exact column name string, and we don't need to do any extra parsing here, i.e. no resolver, no strip for "`"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, what if a column is named a`a
? User should be able to just pass in a`a
and we shouldn't strip the "`"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Hi Wenchen,
Can you please go through the following comment.
I was trying to address the 3rd bullet in the list. About your second question , per bullet one this should be disallowed ? Please let me know.
@cloud-fan I have incorporated your suggestions except the comment about allowing sorrounding backticks in column name. Once we have a decision, i can remove it. Please let me know. |
@@ -1274,16 +1259,20 @@ class DataFrame private[sql]( | |||
* @since 1.4.1 | |||
*/ | |||
def drop(col: Column): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Can we retest please ? |
retest this please |
Test build #50582 has finished for PR 10943 at commit
|
@cloud-fan Hi Wenchen, can you please advice on what is the next step for this PR ? I am thinking that it may require more discussion to decide if we need top keep or remove the df.drop(Column) interface. |
Sorry for the delay, we are discussing about this design choice, and will have an agreement this week or next week. Thanks for working on it and sorry for make you waiting :) |
@cloud-fan No issues. Thanks for your reply :-) |
ping @cloud-fan |
cc @rxin , looks like we missed this one... |
## What changes were proposed in this pull request? Fixes "Can't drop top level columns that contain dots". This work is based on dilipbiswal's #10943. This PR fixes problems like: ``` scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b") org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c]; ``` `drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "." We should also NOT interpret back tick "`", otherwise it is hard to understand what ``` ```aaa```bbb`` ``` actually means. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #13306 from clockfly/fix_drop_column.
## What changes were proposed in this pull request? Fixes "Can't drop top level columns that contain dots". This work is based on dilipbiswal's #10943. This PR fixes problems like: ``` scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b") org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c]; ``` `drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "." We should also NOT interpret back tick "`", otherwise it is hard to understand what ``` ```aaa```bbb`` ``` actually means. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #13306 from clockfly/fix_drop_column. (cherry picked from commit 06514d6) Signed-off-by: Yin Huai <yhuai@databricks.com>
how about we close this pr since #13306 has been merged? |
Closing the following PRs due to requests or unresponsive users. Closes apache#13923 Closes apache#14462 Closes apache#13123 Closes apache#14423 (requested by srowen) Closes apache#14424 (requested by srowen) Closes apache#14101 (requested by jkbradley) Closes apache#10676 (requested by srowen) Closes apache#10943 (requested by yhuai) Closes apache#9936 Closes apache#10701
Neither of theses works:
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("a.c").collect()
df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("
a.c
").collect()df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
Given that you can't use drop to drop subfields, it seems to me that we should treat the column name literally (i.e. as though it is wrapped in back ticks)