Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. #10201

Closed
wants to merge 9 commits into from

Conversation

sun-rui
Copy link
Contributor

@sun-rui sun-rui commented Dec 8, 2015

No description provided.

@SparkQA
Copy link

SparkQA commented Dec 8, 2015

Test build #47335 has finished for PR 10201 at commit e4372d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1324,12 +1312,16 @@ setMethod("selectExpr",
#' path <- "path/to/file.json"
#' df <- jsonFile(sqlContext, path)
#' newDF <- withColumn(df, "newCol", df$col1 * 5)
#' # Replace an existing column
#' newDF2 <- withColumn(newDF, "newCol", newDF$col1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% about the replace existing column behavior - I thought it was intentional that we support multiple columns with the same name before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the reason. The original commit can be found at amplab-extras/SparkR-pkg#204.

I don't think it is related to supporting multiple columns with the same name. Spark Core itself allows multiple columns with the same name:

scala> val df=sqlContext.createDataFrame(Seq((1,2,3))).toDF("a","a","c")
df: org.apache.spark.sql.DataFrame = [a: int, a: int, c: int]

scala> df.show
+---+---+---+
|  a|  a|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+
scala> df.withColumn("a", df("c")).show
+---+---+---+
|  a|  a|  c|
+---+---+---+
|  3|  3|  3|
+---+---+---+

You can see all columns of the same name will be replaced in the above example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the reason. When the withColumn was implemented in SparkR, the withColumn() in Scala support just adding columns, without support for replacing existing columns. But later, withColumn() in Scala was enhanced to support replacing existing columns, see #5541. However, withColumn in SparkR have not been synced with Scala until this PR:)

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47571 has finished for PR 10201 at commit aa7682d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

function(x, col) {
stopifnot(class(col) == "character" || class(col) == "Column")

if (class(col) == "character") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd flip this check, since @jc should only be called on Column
but minor point since it's checked in line 2245.

@felixcheung
Copy link
Member

looks good, though I'm concerned with the replace column which could be breaking behavior change we should document.
Also @shivaram I think this shows it's good to be able to detected new/accidentally masked function :) #10171

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 11, 2015

@felixcheung, yes, this may cause backward-compatibility issue. But this is not SparkR specific, as it's change in Spark SQL core. Where is the appropriate place for documentation?

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47580 has finished for PR 10201 at commit 619b946.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

SQL and MLlib have a "Migration guide" section, perhaps something like that? http://spark.apache.org/docs/latest/sql-programming-guide.html#migration-guide
In fact, there's language specific stuff in SQL's migration guide.

@shivaram
Copy link
Contributor

@felixcheung Was there a migration guide entry for withColumn changing in Scala / Python. If so, we can also add one to a SparkR migration guide. At a high level, adding functionality that was added in Scala seems fine to me.

@felixcheung
Copy link
Member

@shivaram I checked but release notes and programming guide/migration guide and I don't see referencing to withColumn for Spark 1.4.0 or 1.4.1. Perhaps the behavior change happened before the 1.4.0 release?

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 14, 2015

According to https://issues.apache.org/jira/browse/SPARK-6635 and https://issues.apache.org/jira/browse/SPARK-10073, the feature for Scala was in Spark 1.4.0 and python in 1.5.0. But seems both just have API updated without any migration guide for compatibility break. Do we need to do it specifically for SparkR?

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 14, 2015

@felixcheung, @shivaram, documentation for withColumn changed. please take a review

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47643 has finished for PR 10201 at commit e6e9f10.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2015

Test build #47647 has finished for PR 10201 at commit 14215d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 15, 2015

@felixcheung, refine the wording:

Prior to 1.4, DataFrame.withColumn() supports adding a column only. The column will always be added as a new column with its specified name in the result DataFrame even if there may be any existing columns of the same name. Since 1.4, DataFrame.withColumn() supports adding a column of a different name from names of all existing columns or replacing existing columns of the same name.

Any comment?

@felixcheung
Copy link
Member

that's good, thanks

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47718 has finished for PR 10201 at commit 5eba7f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' sc <- sparkR.init()
#' sqlCtx <- sparkRSQL.init(sc)
#' path <- "path/to/file.json"
#' df <- jsonFile(sqlCtx, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this to read.json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. thanks

@felixcheung
Copy link
Member

looks good - only a minor code doc comment.

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47803 has finished for PR 10201 at commit f9659db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sun-rui
Copy link
Contributor Author

sun-rui commented Dec 22, 2015

any other comments? @shivaram, could you merge it?

@shivaram
Copy link
Contributor

@sun-rui Sorry for the delay in looking at this. Could you bring this up to date with master ? It looks good to me.

@@ -2073,6 +2073,8 @@ options.
--conf spark.sql.hive.thriftServer.singleSession=true \
...
{% endhighlight %}
- Since 1.6.1, withColumn method in sparkR supports adding a new column to or replacing existing columns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which version is appropriate here? 1.6.1 or 2.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we want to put this in the R migration guide session instead of SQL? or both?

@sun-rui
Copy link
Contributor Author

sun-rui commented Jan 20, 2016

rebased to master

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49758 has finished for PR 10201 at commit 89657f8.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49777 has finished for PR 10201 at commit c08c1ea.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49846 has finished for PR 10201 at commit 5eb3004.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

LGTM

@shivaram
Copy link
Contributor

Merging this to master

@asfgit asfgit closed this in 1b2a918 Jan 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants