Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SparkR-209] Remaining DataFrame methods + dropTempTable #204

Merged
merged 9 commits into from
Mar 9, 2015

Conversation

cafreeman
Copy link
Contributor

dropTempTable in SQLContext

DataFrame Methods:

  • insertInto
  • unionAll
  • intersect
  • subtract
  • withColumn
  • withColumnRenamed

cafreeman added 6 commits March 5, 2015 14:26
Add `dropTempTable` function and update tests to drop the test table at the end of every test.
Added another version of `select` that will take a list as an argument (instead of just specific column names or expression)
@cafreeman
Copy link
Contributor Author

Hmmm, this was passing on my machine. I'll look at the tests and see if the merge threw something off.

@cafreeman
Copy link
Contributor Author

I've run run-tests.sh several times now and it's passed every time. Is there a way to restart the Travis build?

@shivaram
Copy link
Contributor

shivaram commented Mar 6, 2015

I've restarted it -- FWIW the errors on the previous build were

1. Failure(@test_sparkSQL.R#81): insertInto() on a registered table ------------
first(sql(sqlCtx, "select * from table1"))$name == "Michael" isn't true
2. Failure(@test_sparkSQL.R#484): unionAll(), subtract(), and intersect() on a DataFrame 
first(subtracted)$name == "Justin" isn't true

@cafreeman
Copy link
Contributor Author

@shivaram I saw those, which is especially strange since those two tests rely solely on the order of rows in the inputs...which never changed. Curious to see how Travis goes this time.

@cafreeman
Copy link
Contributor Author

Still the same errors...I'm at a loss.

@cafreeman
Copy link
Contributor Author

Maybe it has to do with the temporary filepaths? I think Davies and I used the same filepath names inside some of our tests. Maybe we need to unlink the extra filepaths (basically everything that isn't either jsonPath or parquetPath which are declared at the very beginning.

Still not sure why none of the tests would fail locally though.

@shivaram
Copy link
Contributor

shivaram commented Mar 6, 2015

I think the problem might be that the order in which things are appended or unioned need to not be deterministic. For example if we say insertInto(df, "table1") do we have guarantees that df will be placed above or below table1 ?

@shivaram
Copy link
Contributor

shivaram commented Mar 6, 2015

BTW the reason I suspect that is because on my machine the subtract test failed and it turned out the rows were reordered

> collect(subtracted)
15/03/05 19:13:09 INFO FileInputFormat: Total input paths to process : 1
  age    name
1  NA Michael
2  19  Justin

@cafreeman
Copy link
Contributor Author

Wow, that is very interesting, especially since it seems like the order varies from machine to machine but not from run to run (I even tried restarting my computer and running the tests, but still didn't get any errors.)

I'm not aware of any internal mechanism for guaranteeing the order of the rows, but I can just sort the results prior to any expect() statements and I think that will make it consistent.

@shivaram
Copy link
Contributor

shivaram commented Mar 6, 2015

Sorting will be good or you can also add statements that check if any row contains "Michael" etc.

@cafreeman
Copy link
Contributor Author

Travis is happy again! Looks like sorting fixed it. Good catch @shivaram

@shivaram
Copy link
Contributor

shivaram commented Mar 6, 2015

@davies could you take a look ?

@@ -768,6 +798,20 @@ setMethod("select", signature(x = "DataFrame", col = "Column"),
dataFrame(sdf)
})

setMethod("select",
signature(x = "DataFrame", col = "list"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this API? I did not find one in Scala/Python to match this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't one in Scala or Python that matches. I created it as utility because it allows you to pass a list of arguments to select. Since the generic for select is currently specified as function(x, col, ...) where col is either character or Column, I don't see a good way to pass in an entire list.

In the Python/Scala API, you get around this by using *args, but that isn't quite as easy with S4.

You can see how this is used in the withColumnRenamed below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we won't need this if we change withColumnRenamed to call Java API, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW in R you can use do.call to call a varargs function with a list of arguments. For example

arglist <- list(x=runif(10), trim=0.1, na.rm=TRUE)
do.call(mean, arglist)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram That's a good point, but I think having this method is still a nice built-in option. My thinking was that it gives you a very accessible way to programmatically create new Data Frames in R. Meaning like...you've got a ton of columns and you lapply a function to grab a subset of the columns (based on some condition) and pass them through to select without needing to even think about do.call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- yes i think it is a good feature to have. I wrote up a few more APIs we should support for column selection at [1]
and one of them was to include a way to specify a list of column names.

[1] https://sparkr.atlassian.net/browse/SPARKR-189?focusedCommentId=12320&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12320

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, with the way the signature for select is written, will do.call actually work that way? The signature specifies one explicit col argument prior to ..., so I was thinking that to pass a list to the current methods, you'd need something like this:

arglist <- list("age", "name", "height")
select(df, arglist[[1]], arglist[2:3])

Does do.call collapse all the other arguments down into ... for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah here is a simple example

a <- function(x1, ...) { first <- x1; rest <- list(...); cat("Got ", length(rest) + 1, " arguments\n") }
do.call(a, list(1, 2, 3))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wow, I stand corrected :)

So should I drop/refactor this method?

@shivaram
Copy link
Contributor

shivaram commented Mar 8, 2015

@cafreeman I think this looks pretty good. As I mentioned inline I think we should extend the select(cols) API and the withColumn API in the future to make it more R-like (i.e. df[[c("a", "b")]] or df$newCol <- df$col * 2 etc.). But I think we can track that in SPARKR-189

So this PR LGTM but for a minor inline comment I had.

@davies any other comments ?

@davies
Copy link
Contributor

davies commented Mar 9, 2015

LGTM, thanks!

@cafreeman
Copy link
Contributor Author

Thanks @shivaram and @davies. Agreed on existing the APIs even further to make use of the double bracket notation. That'll be a good addition.

@shivaram
Copy link
Contributor

shivaram commented Mar 9, 2015

Merging this

shivaram added a commit that referenced this pull request Mar 9, 2015
[SparkR-209] Remaining DataFrame methods + `dropTempTable`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants