[SPARK-12988][SQL] Can't drop columns that contain dots #10943

dilipbiswal · 2016-01-27T07:07:30Z

Neither of theses works:
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("a.c").collect()

df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("a.c").collect()

df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
Given that you can't use drop to drop subfields, it seems to me that we should treat the column name literally (i.e. as though it is wrapped in back ticks)

marmbrus · 2016-01-27T20:03:17Z

ok to test

SparkQA · 2016-01-27T21:24:54Z

Test build #50215 has finished for PR 10943 at commit e7f30a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-27T21:51:12Z

Actually this kind of problem has come out many times, I think we should distinguish column name and column path(which respects "." and ""), and have 2 methods that can parse column nameandcolumn pathrespectively. Currently we only have aresolvemethod that can parsecolumn path, we should add one for column nameand go through allDataFrameAPIs to fix stuffs that should becolumn namebut handled ascolumn path`

cc @rxin

dilipbiswal · 2016-01-27T22:47:12Z

@cloud-fan Thank you Wenchen for your comments. In my understanding , users need to use back-tick to quote the column names if they wanted them to be treated as a column name as opposed to column path. I tried the following example

val df = Seq((1, 2, 3)).toDF("a_b", "a.c", "b.c")
df.select("a.c") => fails to resolve
df.select("a.c") => works fine.

Is this not how it is supposed to work ? Can you please elaborate by taking a small
example ? Thanks in advance.

cloud-fan · 2016-01-27T23:13:11Z

select is the API that supposed to take column path, but something like withColumn, drop, etc. is supposed to take column name. So what I suggest is: change resolve to resolvePath and add a new method resolvedName which abstract the columm name resolution logic from withColumn, then we can use resolveName in drop and other APIs that need a column name instead of column path.

FYI, the PR that fixed a similar problem for withColumn: #10500

dilipbiswal · 2016-01-28T00:00:29Z

@cloud-fan Thank you Wenchen.

dilipbiswal · 2016-01-31T01:00:57Z

@cloud-fan Hi Wenchen, let me know if i have interpreted your suggestion correctly ? Please let me know if something is amiss. df.resolve() has many callers .. so i have not changed its name but have added a comment. Let me know if you want me to refactor it. Thanks..

SparkQA · 2016-01-31T02:35:04Z

Test build #50452 has finished for PR 10943 at commit 8201994.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-31T05:00:46Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+   * Resolves a column name. This is called when it is required to resolve a column by its
+   * name only and not as a column path..
+   */
+  private[sql] def resolveColName(colName: String, userSuppliedName: String): Boolean = {


how about

private[sql] def indexOf(colName: String): Option[Int] = { val resolver = sqlContext.analyzer.resolver val index = queryExecution.analyzed.output.indexWhere(f => resolver(f.name, colName)) if (index >= 0) Some(index) else None }

then we can rewrite withColumn to:

indexOf(colName).map { index => select(output.updated(index, col.as(colName)).map(Column(_)) : _*) }.getOrElse { select(Column("*"), col.as(colName)) }

There may be better name for this, like resolveToIndex

dilipbiswal · 2016-02-01T07:01:54Z

@cloud-fan Thank you Wenchen. I will try your suggestion and get back.

dilipbiswal · 2016-02-01T22:02:15Z

@cloud-fan Hi Wenchen, couldn't get the code snippet to compile and i made a change that looks like the following.

def withColumn(colName: String, col: Column): DataFrame = {
    val output = queryExecution.analyzed.output
    indexOf(colName).map {index =>
      val columns = output.zipWithIndex.map {
        case (a, i) => if (i == index) col.as(colName) else Column(a)
      }
     select(columns: _*)
    }.getOrElse {
      select(Column("*"), col.as(colName))
    }
  }

Does this look okay to you ? Let me know please ..

cloud-fan · 2016-02-01T22:11:31Z

ah, change select(output.updated(index, col.as(colName)).map(Column(_)) : _*) to select(output.map(Column(_)).updated(index, col.as(colName)): _*) should work

dilipbiswal · 2016-02-02T01:38:02Z

@cloud-fan Thanks a lot. I have implemented as per your input.

SparkQA · 2016-02-02T02:04:16Z

Test build #50522 has finished for PR 10943 at commit dfaa13b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-02-02T02:16:33Z

test this please

SparkQA · 2016-02-02T03:42:25Z

Test build #50525 has finished for PR 10943 at commit dfaa13b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-02T17:54:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+  private[sql] def resolveToIndex(colName: String): Option[Int] = {
+    val resolver = sqlContext.analyzer.resolver
+    // First remove any user supplied quotes.
+    val unquotedColName = colName.stripPrefix("`").stripSuffix("`")


do we need to do this? I think for these methods that require column name, user should just pass in an exact column name string, and we don't need to do any extra parsing here, i.e. no resolver, no strip for "`"

for example, what if a column is named a`a? User should be able to just pass in a`a and we shouldn't strip the "`"

@cloud-fan Hi Wenchen,

Can you please go through the following comment.

https://issues.apache.org/jira/browse/SPARK-12988?focusedCommentId=15118433&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15118433

I was trying to address the 3rd bullet in the list. About your second question , per bullet one this should be disallowed ? Please let me know.

dilipbiswal · 2016-02-02T19:05:27Z

@cloud-fan I have incorporated your suggestions except the comment about allowing sorrounding backticks in column name. Once we have a decision, i can remove it. Please let me know.

cloud-fan · 2016-02-02T19:15:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

@@ -1274,16 +1259,20 @@ class DataFrame private[sql](
   * @since 1.4.1
   */
  def drop(col: Column): DataFrame = {


why we have this method....
we can only drop top level columns, allowing users to pass in a Column doesn't make sense.

cc @rxin @marmbrus

dilipbiswal · 2016-02-02T19:29:23Z

@cloud-fan Can we retest please ?

cloud-fan · 2016-02-02T19:30:49Z

retest this please

SparkQA · 2016-02-02T21:08:27Z

Test build #50582 has finished for PR 10943 at commit d2b373f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2016-02-22T05:13:08Z

@cloud-fan Hi Wenchen, can you please advice on what is the next step for this PR ? I am thinking that it may require more discussion to decide if we need top keep or remove the df.drop(Column) interface.
What do you think ?

cloud-fan · 2016-02-23T05:42:57Z

Sorry for the delay, we are discussing about this design choice, and will have an agreement this week or next week. Thanks for working on it and sorry for make you waiting :)

dilipbiswal · 2016-02-23T05:49:06Z

@cloud-fan No issues. Thanks for your reply :-)

HyukjinKwon · 2016-05-06T14:48:29Z

ping @cloud-fan

cloud-fan · 2016-05-07T03:37:30Z

cc @rxin , looks like we missed this one...

## What changes were proposed in this pull request? Fixes "Can't drop top level columns that contain dots". This work is based on dilipbiswal's #10943. This PR fixes problems like: ``` scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b") org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c]; ``` `drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "." We should also NOT interpret back tick "`", otherwise it is hard to understand what ``` ```aaa```bbb`` ``` actually means. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #13306 from clockfly/fix_drop_column.

## What changes were proposed in this pull request? Fixes "Can't drop top level columns that contain dots". This work is based on dilipbiswal's #10943. This PR fixes problems like: ``` scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b") org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c]; ``` `drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "." We should also NOT interpret back tick "`", otherwise it is hard to understand what ``` ```aaa```bbb`` ``` actually means. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #13306 from clockfly/fix_drop_column. (cherry picked from commit 06514d6) Signed-off-by: Yin Huai <yhuai@databricks.com>

yhuai · 2016-06-28T17:19:59Z

how about we close this pr since #13306 has been merged?

Closing the following PRs due to requests or unresponsive users. Closes apache#13923 Closes apache#14462 Closes apache#13123 Closes apache#14423 (requested by srowen) Closes apache#14424 (requested by srowen) Closes apache#14101 (requested by jkbradley) Closes apache#10676 (requested by srowen) Closes apache#10943 (requested by yhuai) Closes apache#9936 Closes apache#10701

[SPARK-12988] Can't drop columns that contain dots

e7f30a4

dilipbiswal changed the title ~~[SPARK-12988] Can't drop columns that contain dots~~ [SPARK-12988][SQL] Can't drop columns that contain dots Jan 27, 2016

wenchen's comments

8201994

cloud-fan reviewed Jan 31, 2016
View reviewed changes

wenchen's comment

dfaa13b

cloud-fan mentioned this pull request Feb 2, 2016

[SPARK-12987][SQL]Fixing the name resolution in drop column #11009

Closed

cloud-fan reviewed Feb 2, 2016
View reviewed changes

wenchen's comments

d2b373f

cloud-fan reviewed Feb 2, 2016
View reviewed changes

hvanhovell mentioned this pull request Feb 4, 2016

[SPARK-13197][SQL] When trying to select from the data frame which contains the columns with . in it, it is throwing exception. #11079

Closed

clockfly mentioned this pull request May 25, 2016

[SPARK-12988][SQL]Can't drop top level columns that contain dots #13306

Closed

vanzin mentioned this pull request Aug 4, 2016

MAINTENANCE. Cleaning up stale PRs. #14495

Closed

asfgit closed this in 53e766c Aug 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12988][SQL] Can't drop columns that contain dots #10943

[SPARK-12988][SQL] Can't drop columns that contain dots #10943

dilipbiswal commented Jan 27, 2016

marmbrus commented Jan 27, 2016

SparkQA commented Jan 27, 2016

cloud-fan commented Jan 27, 2016

dilipbiswal commented Jan 27, 2016

cloud-fan commented Jan 27, 2016

dilipbiswal commented Jan 28, 2016

dilipbiswal commented Jan 31, 2016

SparkQA commented Jan 31, 2016

cloud-fan Jan 31, 2016

dilipbiswal commented Feb 1, 2016

dilipbiswal commented Feb 1, 2016

cloud-fan commented Feb 1, 2016

dilipbiswal commented Feb 2, 2016

SparkQA commented Feb 2, 2016

marmbrus commented Feb 2, 2016

SparkQA commented Feb 2, 2016

cloud-fan Feb 2, 2016

cloud-fan Feb 2, 2016

dilipbiswal Feb 2, 2016

dilipbiswal commented Feb 2, 2016

cloud-fan Feb 2, 2016

dilipbiswal commented Feb 2, 2016

cloud-fan commented Feb 2, 2016

SparkQA commented Feb 2, 2016

dilipbiswal commented Feb 22, 2016

cloud-fan commented Feb 23, 2016

dilipbiswal commented Feb 23, 2016

HyukjinKwon commented May 6, 2016

cloud-fan commented May 7, 2016

yhuai commented Jun 28, 2016

[SPARK-12988][SQL] Can't drop columns that contain dots #10943

[SPARK-12988][SQL] Can't drop columns that contain dots #10943

Conversation

dilipbiswal commented Jan 27, 2016

marmbrus commented Jan 27, 2016

SparkQA commented Jan 27, 2016

cloud-fan commented Jan 27, 2016

dilipbiswal commented Jan 27, 2016

cloud-fan commented Jan 27, 2016

dilipbiswal commented Jan 28, 2016

dilipbiswal commented Jan 31, 2016

SparkQA commented Jan 31, 2016

cloud-fan Jan 31, 2016

Choose a reason for hiding this comment

dilipbiswal commented Feb 1, 2016

dilipbiswal commented Feb 1, 2016

cloud-fan commented Feb 1, 2016

dilipbiswal commented Feb 2, 2016

SparkQA commented Feb 2, 2016

marmbrus commented Feb 2, 2016

SparkQA commented Feb 2, 2016

cloud-fan Feb 2, 2016

Choose a reason for hiding this comment

cloud-fan Feb 2, 2016

Choose a reason for hiding this comment

dilipbiswal Feb 2, 2016

Choose a reason for hiding this comment

dilipbiswal commented Feb 2, 2016

cloud-fan Feb 2, 2016

Choose a reason for hiding this comment

dilipbiswal commented Feb 2, 2016

cloud-fan commented Feb 2, 2016

SparkQA commented Feb 2, 2016

dilipbiswal commented Feb 22, 2016

cloud-fan commented Feb 23, 2016

dilipbiswal commented Feb 23, 2016

HyukjinKwon commented May 6, 2016

cloud-fan commented May 7, 2016

yhuai commented Jun 28, 2016