Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12988][SQL] Can't drop columns that contain dots #10943

Closed
wants to merge 4 commits into from

Conversation

dilipbiswal
Copy link
Contributor

Neither of theses works:
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("a.c").collect()

df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
val df = Seq((1, 1)).toDF("a_b", "a.c")
df.drop("a.c").collect()

df: org.apache.spark.sql.DataFrame = [a_b: int, a.c: int]
Given that you can't use drop to drop subfields, it seems to me that we should treat the column name literally (i.e. as though it is wrapped in back ticks)

@dilipbiswal dilipbiswal changed the title [SPARK-12988] Can't drop columns that contain dots [SPARK-12988][SQL] Can't drop columns that contain dots Jan 27, 2016
@marmbrus
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50215 has finished for PR 10943 at commit e7f30a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Actually this kind of problem has come out many times, I think we should distinguish column name and column path(which respects "." and ""), and have 2 methods that can parse column nameandcolumn pathrespectively. Currently we only have aresolvemethod that can parsecolumn path, we should add one for column nameand go through allDataFrameAPIs to fix stuffs that should becolumn namebut handled ascolumn path`

cc @rxin

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you Wenchen for your comments. In my understanding , users need to use back-tick to quote the column names if they wanted them to be treated as a column name as opposed to column path. I tried the following example

val df = Seq((1, 2, 3)).toDF("a_b", "a.c", "b.c")
df.select("a.c") => fails to resolve
df.select("a.c") => works fine.

Is this not how it is supposed to work ? Can you please elaborate by taking a small
example ? Thanks in advance.

@cloud-fan
Copy link
Contributor

select is the API that supposed to take column path, but something like withColumn, drop, etc. is supposed to take column name. So what I suggest is: change resolve to resolvePath and add a new method resolvedName which abstract the columm name resolution logic from withColumn, then we can use resolveName in drop and other APIs that need a column name instead of column path.

FYI, the PR that fixed a similar problem for withColumn: #10500

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you Wenchen.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, let me know if i have interpreted your suggestion correctly ? Please let me know if something is amiss. df.resolve() has many callers .. so i have not changed its name but have added a comment. Let me know if you want me to refactor it. Thanks..

@SparkQA
Copy link

SparkQA commented Jan 31, 2016

Test build #50452 has finished for PR 10943 at commit 8201994.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Resolves a column name. This is called when it is required to resolve a column by its
* name only and not as a column path..
*/
private[sql] def resolveColName(colName: String, userSuppliedName: String): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

private[sql] def indexOf(colName: String): Option[Int] = {
  val resolver = sqlContext.analyzer.resolver
  val index = queryExecution.analyzed.output.indexWhere(f => resolver(f.name, colName))
  if (index >= 0) Some(index) else None
}

then we can rewrite withColumn to:

indexOf(colName).map { index =>
  select(output.updated(index, col.as(colName)).map(Column(_)) : _*)
}.getOrElse {
  select(Column("*"), col.as(colName))
}

There may be better name for this, like resolveToIndex

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thank you Wenchen. I will try your suggestion and get back.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, couldn't get the code snippet to compile and i made a change that looks like the following.

def withColumn(colName: String, col: Column): DataFrame = {
    val output = queryExecution.analyzed.output
    indexOf(colName).map {index =>
      val columns = output.zipWithIndex.map {
        case (a, i) => if (i == index) col.as(colName) else Column(a)
      }
     select(columns: _*)
    }.getOrElse {
      select(Column("*"), col.as(colName))
    }
  }

Does this look okay to you ? Let me know please ..

@cloud-fan
Copy link
Contributor

ah, change select(output.updated(index, col.as(colName)).map(Column(_)) : _*) to select(output.map(Column(_)).updated(index, col.as(colName)): _*) should work

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Thanks a lot. I have implemented as per your input.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50522 has finished for PR 10943 at commit dfaa13b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

marmbrus commented Feb 2, 2016

test this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50525 has finished for PR 10943 at commit dfaa13b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[sql] def resolveToIndex(colName: String): Option[Int] = {
val resolver = sqlContext.analyzer.resolver
// First remove any user supplied quotes.
val unquotedColName = colName.stripPrefix("`").stripSuffix("`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this? I think for these methods that require column name, user should just pass in an exact column name string, and we don't need to do any extra parsing here, i.e. no resolver, no strip for "`"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, what if a column is named a`a? User should be able to just pass in a`a and we shouldn't strip the "`"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Hi Wenchen,

Can you please go through the following comment.

https://issues.apache.org/jira/browse/SPARK-12988?focusedCommentId=15118433&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15118433

I was trying to address the 3rd bullet in the list. About your second question , per bullet one this should be disallowed ? Please let me know.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan I have incorporated your suggestions except the comment about allowing sorrounding backticks in column name. Once we have a decision, i can remove it. Please let me know.

@@ -1274,16 +1259,20 @@ class DataFrame private[sql](
* @since 1.4.1
*/
def drop(col: Column): DataFrame = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have this method....
we can only drop top level columns, allowing users to pass in a Column doesn't make sense.

cc @rxin @marmbrus

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Can we retest please ?

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50582 has finished for PR 10943 at commit d2b373f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

@cloud-fan Hi Wenchen, can you please advice on what is the next step for this PR ? I am thinking that it may require more discussion to decide if we need top keep or remove the df.drop(Column) interface.
What do you think ?

@cloud-fan
Copy link
Contributor

Sorry for the delay, we are discussing about this design choice, and will have an agreement this week or next week. Thanks for working on it and sorry for make you waiting :)

@dilipbiswal
Copy link
Contributor Author

@cloud-fan No issues. Thanks for your reply :-)

@HyukjinKwon
Copy link
Member

ping @cloud-fan

@cloud-fan
Copy link
Contributor

cc @rxin , looks like we missed this one...

asfgit pushed a commit that referenced this pull request Jun 1, 2016
## What changes were proposed in this pull request?

Fixes "Can't drop top level columns that contain dots".

This work is based on dilipbiswal's #10943.
This PR fixes problems like:

```
scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b")
org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c];
```

`drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "."

We should also NOT interpret back tick "`", otherwise it is hard to understand what

```
​```aaa```bbb``
```

actually means.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13306 from clockfly/fix_drop_column.
asfgit pushed a commit that referenced this pull request Jun 1, 2016
## What changes were proposed in this pull request?

Fixes "Can't drop top level columns that contain dots".

This work is based on dilipbiswal's #10943.
This PR fixes problems like:

```
scala> Seq((1, 2)).toDF("a.b", "a.c").drop("a.b")
org.apache.spark.sql.AnalysisException: cannot resolve '`a.c`' given input columns: [a.b, a.c];
```

`drop(columnName)` can only be used to drop top level column, so, we should parse the column name literally WITHOUT interpreting dot "."

We should also NOT interpret back tick "`", otherwise it is hard to understand what

```
​```aaa```bbb``
```

actually means.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13306 from clockfly/fix_drop_column.

(cherry picked from commit 06514d6)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@yhuai
Copy link
Contributor

yhuai commented Jun 28, 2016

how about we close this pr since #13306 has been merged?

vanzin pushed a commit to vanzin/spark that referenced this pull request Aug 4, 2016
Closing the following PRs due to requests or unresponsive users.

Closes apache#13923
Closes apache#14462
Closes apache#13123
Closes apache#14423 (requested by srowen)
Closes apache#14424 (requested by srowen)
Closes apache#14101 (requested by jkbradley)
Closes apache#10676 (requested by srowen)
Closes apache#10943 (requested by yhuai)
Closes apache#9936
Closes apache#10701
@asfgit asfgit closed this in 53e766c Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants