[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

mengxr · 2015-07-09T07:47:12Z

Add crosstab to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to table in R. However, table in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for crosstab to be compatible with Scala/Python.

I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.

SparkQA · 2015-07-09T08:08:50Z

Test build #36912 has finished for PR 7318 at commit f1348d6.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

sun-rui · 2015-07-09T13:47:43Z

R/pkg/inst/tests/test_sparkSQL.R

+  ct <- crosstab(df, "a", "b")
+  ordered <- ct[order("a_b"),]
+  expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1))
+  assert_true(identical(expected, ordered))


expect_identical(expected, ordered)

SparkQA · 2015-07-09T17:10:20Z

Test build #36946 has finished for PR 7318 at commit 53f6ddd.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-07-13T21:58:55Z

Jents! slow test please

SparkQA · 2015-07-13T22:22:53Z

Test build #4 has finished for PR 7318 at commit 53f6ddd.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-14T17:30:28Z

R/pkg/inst/tests/test_sparkSQL.R

+  df <- toDF(rdd, list("a", "b"))
+  ct <- crosstab(df, "a", "b")
+  ordered <- ct[order("a_b"),]
+  expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1))


I think I figured out what is going on here -- The expected data.frame creates the column a_b as factors. You can pass in stringsAsFactors=F to the data.frame constructor to avoid this.

Also I think the order command above should probably be ordered <- ct[order(ct$a_b),] to get all the rows back out of it. It still associates row names which makes it not identical (you can do row.names(ordered) <- NULL).
A simpler check might be to just get one or two rows from ct and then compare them with expected

Finally I made the unit test ran on my local. I upgraded R to 3.2.0 from 3.0.2. I don't know whether this is the cause or not. The unit test should work now.

mengxr · 2015-07-18T03:01:31Z

@shivaram This should be ready for review after Jenkins.

SparkQA · 2015-07-18T03:43:44Z

Test build #37688 has finished for PR 7318 at commit d75e894.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2015-07-18T16:54:36Z

R/pkg/inst/tests/test_sparkSQL.R

+  row.names(ordered) <- NULL
+  expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 0, 1), "b1" = c(1, 1, 0),
+                         stringsAsFactors = FALSE, row.names = NULL)
+  expect_identical(expected, ordered)


This is minor, but it might be good to have a test case where we get NULL to just make sure that code path works correctly.

I don't quite get what you mean.

So the help message says that Pairs that have no occurrences will have null as their counts. I just wanted to make sure that case worked correctly

Ah, we need to update the doc. The behavior changes in 6396cc0. We output 0 instead of null. I will send a follow-up PR for this because it also touches Scala and Python code.

I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-9243

Ah okay - sounds good. LGTM

shivaram · 2015-07-18T16:57:17Z

@mengxr One question about the function naming convention -- What are your thoughts on using table for this use case and use some other keyword sqlTable for the SQL use case ? This is one of those cases where we need to choose between the R function names and the SparkSQL function names.

mengxr · 2015-07-21T03:41:00Z

I don't have strong preference about the name. We use crosstab in Scala/Python because table is already taken, and we shouldn't overload table for both loading SQL tables and computing the contingency table. Changing table to sqlTable would be a bigger change than calling R's table crosstab.

shivaram · 2015-07-21T16:06:44Z

Hmm okay. Lets leave it as crosstab in this PR -- Before the release I'll try to do one more pass over the API and we can revisit this if required. Other than the minor unit test comment this looks good to me.

shivaram · 2015-07-22T18:31:36Z

@mengxr I guess we will have a new PR for the documentation update, so this PR LGTM. I will merge this unless you have anything else to add

mengxr · 2015-07-23T00:56:06Z

Yes, please merge it. Thanks for reviewing!

mengxr added 3 commits June 14, 2015 14:01

first version without test

5621262

Merge remote-tracking branch 'apache/master' into SPARK-8364

47cb088

update test

f1348d6

mengxr force-pushed the SPARK-8364 branch from 18b9fd9 to f1348d6 Compare July 9, 2015 07:48

sun-rui reviewed Jul 9, 2015
View reviewed changes

fix tests

53f6ddd

shivaram reviewed Jul 14, 2015
View reviewed changes

fix tests

d75e894

mengxr changed the title ~~[WIP][SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames~~ [SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames Jul 18, 2015

shivaram reviewed Jul 18, 2015
View reviewed changes

asfgit closed this in 2f5cbd8 Jul 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

mengxr commented Jul 9, 2015

SparkQA commented Jul 9, 2015

sun-rui Jul 9, 2015

SparkQA commented Jul 9, 2015

andrewor14 commented Jul 13, 2015

SparkQA commented Jul 13, 2015

shivaram Jul 14, 2015

mengxr Jul 18, 2015

mengxr commented Jul 18, 2015

SparkQA commented Jul 18, 2015

shivaram Jul 18, 2015

mengxr Jul 21, 2015

shivaram Jul 21, 2015

mengxr Jul 22, 2015

shivaram Jul 22, 2015

shivaram commented Jul 18, 2015

mengxr commented Jul 21, 2015

shivaram commented Jul 21, 2015

shivaram commented Jul 22, 2015

mengxr commented Jul 23, 2015

[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

Conversation

mengxr commented Jul 9, 2015

SparkQA commented Jul 9, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 9, 2015

andrewor14 commented Jul 13, 2015

SparkQA commented Jul 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Jul 18, 2015

SparkQA commented Jul 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivaram commented Jul 18, 2015

mengxr commented Jul 21, 2015

shivaram commented Jul 21, 2015

shivaram commented Jul 22, 2015

mengxr commented Jul 23, 2015