-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8364][SPARKR] Add crosstab to SparkR DataFrames #7318
Conversation
Test build #36912 has finished for PR 7318 at commit
|
ct <- crosstab(df, "a", "b") | ||
ordered <- ct[order("a_b"),] | ||
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1)) | ||
assert_true(identical(expected, ordered)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expect_identical(expected, ordered)
Test build #36946 has finished for PR 7318 at commit
|
Jents! slow test please |
Test build #4 has finished for PR 7318 at commit
|
df <- toDF(rdd, list("a", "b")) | ||
ct <- crosstab(df, "a", "b") | ||
ordered <- ct[order("a_b"),] | ||
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 1, 1), "b1" = c(1, 1, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I figured out what is going on here -- The expected
data.frame creates the column a_b
as factors. You can pass in stringsAsFactors=F
to the data.frame constructor to avoid this.
Also I think the order
command above should probably be ordered <- ct[order(ct$a_b),]
to get all the rows back out of it. It still associates row names which makes it not identical (you can do row.names(ordered) <- NULL
).
A simpler check might be to just get one or two rows from ct
and then compare them with expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally I made the unit test ran on my local. I upgraded R to 3.2.0 from 3.0.2. I don't know whether this is the cause or not. The unit test should work now.
@shivaram This should be ready for review after Jenkins. |
Test build #37688 has finished for PR 7318 at commit
|
row.names(ordered) <- NULL | ||
expected <- data.frame("a_b" = c("a0", "a1", "a2"), "b0" = c(1, 0, 1), "b1" = c(1, 1, 0), | ||
stringsAsFactors = FALSE, row.names = NULL) | ||
expect_identical(expected, ordered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, but it might be good to have a test case where we get NULL
to just make sure that code path works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the help message says that Pairs that have no occurrences will have null as their counts
. I just wanted to make sure that case worked correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we need to update the doc. The behavior changes in 6396cc0. We output 0 instead of null. I will send a follow-up PR for this because it also touches Scala and Python code.
I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-9243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay - sounds good. LGTM
@mengxr One question about the function naming convention -- What are your thoughts on using |
I don't have strong preference about the name. We use |
Hmm okay. Lets leave it as |
@mengxr I guess we will have a new PR for the documentation update, so this PR LGTM. I will merge this unless you have anything else to add |
Yes, please merge it. Thanks for reviewing! |
Add
crosstab
to SparkR DataFrames, which takes two column names and returns a local R data.frame. This is similar totable
in R. However,table
in SparkR is used for loading SQL tables as DataFrames. The return type is data.frame instead table forcrosstab
to be compatible with Scala/Python.I couldn't run R tests successfully on my local. Many unit tests failed. So let's try Jenkins.