Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318

Closed
wants to merge 5 commits into from

Conversation

mengxr
Copy link
Contributor

@mengxr mengxr commented Jul 9, 2015

Add crosstab to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar to table in R. However, table in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table for crosstab to be compatible with Scala/Python.

I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36912 has finished for PR 7318 at commit f1348d6.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ct <- crosstab(df, "a", "b")
ordered <- ct[order("a_b"),]
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1))
assert_true(identical(expected, ordered))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect_identical(expected, ordered)

@SparkQA
Copy link

SparkQA commented Jul 9, 2015

Test build #36946 has finished for PR 7318 at commit 53f6ddd.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

Jents! slow test please

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #4 has finished for PR 7318 at commit 53f6ddd.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

df <- toDF(rdd, list("a", "b"))
ct <- crosstab(df, "a", "b")
ordered <- ct[order("a_b"),]
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out what is going on here -- The expected data.frame creates the column a_b as factors. You can pass in stringsAsFactors=F to the data.frame constructor to avoid this.

Also I think the order command above should probably be ordered <- ct[order(ct$a_b),] to get all the rows back out of it. It still associates row names which makes it not identical (you can do row.names(ordered) <- NULL).
A simpler check might be to just get one or two rows from ct and then compare them with expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally I made the unit test ran on my local. I upgraded R to 3.2.0 from 3.0.2. I don't know whether this is the cause or not. The unit test should work now.

@mengxr mengxr changed the title [WIP][SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames [SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames Jul 18, 2015
@mengxr
Copy link
Contributor Author

mengxr commented Jul 18, 2015

@shivaram This should be ready for review after Jenkins.

@SparkQA
Copy link

SparkQA commented Jul 18, 2015

Test build #37688 has finished for PR 7318 at commit d75e894.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

row.names(ordered) <- NULL
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 0, 1), "b1" = c(1, 1, 0),
stringsAsFactors = FALSE, row.names = NULL)
expect_identical(expected, ordered)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, but it might be good to have a test case where we get NULL to just make sure that code path works correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the help message says that Pairs that have no occurrences will have null as their counts. I just wanted to make sure that case worked correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we need to update the doc. The behavior changes in 6396cc0. We output 0 instead of null. I will send a follow-up PR for this because it also touches Scala and Python code.

I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-9243

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay - sounds good. LGTM

@shivaram
Copy link
Contributor

@mengxr One question about the function naming convention -- What are your thoughts on using table for this use case and use some other keyword sqlTable for the SQL use case ? This is one of those cases where we need to choose between the R function names and the SparkSQL function names.

@mengxr
Copy link
Contributor Author

mengxr commented Jul 21, 2015

I don't have strong preference about the name. We use crosstab in Scala/Python because table is already taken, and we shouldn't overload table for both loading SQL tables and computing the contingency table. Changing table to sqlTable would be a bigger change than calling R's table crosstab.

@shivaram
Copy link
Contributor

Hmm okay. Lets leave it as crosstab in this PR -- Before the release I'll try to do one more pass over the API and we can revisit this if required. Other than the minor unit test comment this looks good to me.

@shivaram
Copy link
Contributor

@mengxr I guess we will have a new PR for the documentation update, so this PR LGTM. I will merge this unless you have anything else to add

@mengxr
Copy link
Contributor Author

mengxr commented Jul 23, 2015

Yes, please merge it. Thanks for reviewing!

@asfgit asfgit closed this in 2f5cbd8 Jul 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants