[SparkR-209] Remaining DataFrame methods + `dropTempTable` #204

cafreeman · 2015-03-06T00:31:54Z

dropTempTable in SQLContext

DataFrame Methods:

insertInto
unionAll
intersect
subtract
withColumn
withColumnRenamed

Add `dropTempTable` function and update tests to drop the test table at the end of every test.

Added another version of `select` that will take a list as an argument (instead of just specific column names or expression)

cafreeman · 2015-03-06T00:48:35Z

Hmmm, this was passing on my machine. I'll look at the tests and see if the merge threw something off.

cafreeman · 2015-03-06T01:17:13Z

I've run run-tests.sh several times now and it's passed every time. Is there a way to restart the Travis build?

shivaram · 2015-03-06T01:18:25Z

I've restarted it -- FWIW the errors on the previous build were

1. Failure(@test_sparkSQL.R#81): insertInto() on a registered table ------------
first(sql(sqlCtx, "select * from table1"))$name == "Michael" isn't true
2. Failure(@test_sparkSQL.R#484): unionAll(), subtract(), and intersect() on a DataFrame 
first(subtracted)$name == "Justin" isn't true

cafreeman · 2015-03-06T01:30:33Z

@shivaram I saw those, which is especially strange since those two tests rely solely on the order of rows in the inputs...which never changed. Curious to see how Travis goes this time.

cafreeman · 2015-03-06T01:31:50Z

Still the same errors...I'm at a loss.

cafreeman · 2015-03-06T01:37:00Z

Maybe it has to do with the temporary filepaths? I think Davies and I used the same filepath names inside some of our tests. Maybe we need to unlink the extra filepaths (basically everything that isn't either jsonPath or parquetPath which are declared at the very beginning.

Still not sure why none of the tests would fail locally though.

shivaram · 2015-03-06T03:00:12Z

I think the problem might be that the order in which things are appended or unioned need to not be deterministic. For example if we say insertInto(df, "table1") do we have guarantees that df will be placed above or below table1 ?

shivaram · 2015-03-06T03:14:08Z

BTW the reason I suspect that is because on my machine the subtract test failed and it turned out the rows were reordered

> collect(subtracted)
15/03/05 19:13:09 INFO FileInputFormat: Total input paths to process : 1
  age    name
1  NA Michael
2  19  Justin

cafreeman · 2015-03-06T03:29:31Z

Wow, that is very interesting, especially since it seems like the order varies from machine to machine but not from run to run (I even tried restarting my computer and running the tests, but still didn't get any errors.)

I'm not aware of any internal mechanism for guaranteeing the order of the rows, but I can just sort the results prior to any expect() statements and I think that will make it consistent.

shivaram · 2015-03-06T03:33:02Z

Sorting will be good or you can also add statements that check if any row contains "Michael" etc.

cafreeman · 2015-03-06T05:42:49Z

Travis is happy again! Looks like sorting fixed it. Good catch @shivaram

shivaram · 2015-03-06T05:54:52Z

@davies could you take a look ?

into sparkr-sql

davies · 2015-03-06T22:09:54Z

pkg/R/DataFrame.R

@@ -768,6 +798,20 @@ setMethod("select", signature(x = "DataFrame", col = "Column"),
            dataFrame(sdf)
          })

+setMethod("select",
+          signature(x = "DataFrame", col = "list"),


What's the purpose of this API? I did not find one in Scala/Python to match this.

There isn't one in Scala or Python that matches. I created it as utility because it allows you to pass a list of arguments to select. Since the generic for select is currently specified as function(x, col, ...) where col is either character or Column, I don't see a good way to pass in an entire list.

In the Python/Scala API, you get around this by using *args, but that isn't quite as easy with S4.

You can see how this is used in the withColumnRenamed below.

I see, we won't need this if we change withColumnRenamed to call Java API, right?

FWIW in R you can use do.call to call a varargs function with a list of arguments. For example

arglist <- list(x=runif(10), trim=0.1, na.rm=TRUE) do.call(mean, arglist)

@shivaram That's a good point, but I think having this method is still a nice built-in option. My thinking was that it gives you a very accessible way to programmatically create new Data Frames in R. Meaning like...you've got a ton of columns and you lapply a function to grab a subset of the columns (based on some condition) and pass them through to select without needing to even think about do.call.

I see -- yes i think it is a good feature to have. I wrote up a few more APIs we should support for column selection at [1]
and one of them was to include a way to specify a list of column names.

[1] https://sparkr.atlassian.net/browse/SPARKR-189?focusedCommentId=12320&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12320

Also, with the way the signature for select is written, will do.call actually work that way? The signature specifies one explicit col argument prior to ..., so I was thinking that to pass a list to the current methods, you'd need something like this:

arglist <- list("age", "name", "height") select(df, arglist[[1]], arglist[2:3])

Does do.call collapse all the other arguments down into ... for you?

Yeah here is a simple example

a <- function(x1, ...) { first <- x1; rest <- list(...); cat("Got ", length(rest) + 1, " arguments\n") } do.call(a, list(1, 2, 3))

Oh, wow, I stand corrected :)

So should I drop/refactor this method?

shivaram · 2015-03-08T23:28:02Z

@cafreeman I think this looks pretty good. As I mentioned inline I think we should extend the select(cols) API and the withColumn API in the future to make it more R-like (i.e. df[[c("a", "b")]] or df$newCol <- df$col * 2 etc.). But I think we can track that in SPARKR-189

So this PR LGTM but for a minor inline comment I had.

@davies any other comments ?

davies · 2015-03-09T00:59:24Z

LGTM, thanks!

cafreeman · 2015-03-09T01:22:36Z

Thanks @shivaram and @davies. Agreed on existing the APIs even further to make use of the double bracket notation. That'll be a good addition.

shivaram · 2015-03-09T01:34:28Z

Merging this

[SparkR-209] Remaining DataFrame methods + `dropTempTable`

cafreeman added 6 commits March 5, 2015 14:26

dropTempTable

9d01bcd

Add `dropTempTable` function and update tests to drop the test table at the end of every test.

insertInto

befbd32

intersect, subtract, unionAll

fef99de

New select method

c5fa3b9

Added another version of `select` that will take a list as an argument (instead of just specific column names or expression)

withColumn and withColumnRenamed

7a5d6fd

Merge branch 'dfMethods' into sparkr-sql

a582810

update tests to guarantee row order

e60578a

cafreeman force-pushed the sparkr-sql branch from 8168fe5 to e60578a Compare March 6, 2015 05:21

Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg

3db5649

into sparkr-sql

davies reviewed Mar 6, 2015
View reviewed changes

Fix example for dropTempTable

15a713f

shivaram added a commit that referenced this pull request Mar 9, 2015

Merge pull request #204 from cafreeman/sparkr-sql

09ff163

[SparkR-209] Remaining DataFrame methods + `dropTempTable`

shivaram merged commit 09ff163 into amplab-extras:sparkr-sql Mar 9, 2015

sun-rui mentioned this pull request Dec 9, 2015

[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR. apache/spark#10201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SparkR-209] Remaining DataFrame methods + `dropTempTable` #204

[SparkR-209] Remaining DataFrame methods + `dropTempTable` #204

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

davies Mar 6, 2015

cafreeman Mar 6, 2015

davies Mar 6, 2015

shivaram Mar 6, 2015

cafreeman Mar 6, 2015

shivaram Mar 6, 2015

cafreeman Mar 6, 2015

shivaram Mar 6, 2015

cafreeman Mar 6, 2015

shivaram commented Mar 8, 2015

davies commented Mar 9, 2015

cafreeman commented Mar 9, 2015

shivaram commented Mar 9, 2015

[SparkR-209] Remaining DataFrame methods + dropTempTable #204

[SparkR-209] Remaining DataFrame methods + dropTempTable #204

Conversation

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

cafreeman commented Mar 6, 2015

shivaram commented Mar 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram commented Mar 8, 2015

davies commented Mar 9, 2015

cafreeman commented Mar 9, 2015

shivaram commented Mar 9, 2015

[SparkR-209] Remaining DataFrame methods + `dropTempTable` #204

[SparkR-209] Remaining DataFrame methods + `dropTempTable` #204