-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-5270 [CORE] Elegantly check if RDD is empty #4074
Conversation
Test build #25667 has started for PR 4074 at commit
|
(Oh of course, if this looks good I can add this to Java / Python too) |
Test build #25667 has finished for PR 4074 at commit
|
Test PASSed. |
LTGM. What is the use case? is this part of a bigger pr? |
This is all there is to it. It's just a convenience method that implements the check efficiently. Given several questions on the list, it seems that people do want to test for an empty RDD and there hasn't been an accepted way to do it that is faster than http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 |
Seems reasonable to have since it's non obvious how to do it - @srowen could you add this in Java and Python? |
Test build #25682 has started for PR 4074 at commit
|
Test build #25682 has finished for PR 4074 at commit
|
Test FAILed. |
Jenkins, retest this please. |
Test build #25701 has started for PR 4074 at commit
|
Test build #25701 has finished for PR 4074 at commit
|
Test FAILed. |
test("isEmpty") { | ||
assert(sc.emptyRDD.isEmpty()) | ||
assert(sc.parallelize(Seq[Int]()).isEmpty()) | ||
assert(!sc.parallelize(Seq(1)).isEmpty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this tests the case where there are multiple partitions but no data in any of the partitions. Maybe add something like
assert(sc.parallelize(Seq(1,2,3), 3).filter(_ < 0).isEmpty())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the sc.parallelize(Seq[Int]()
case actually has multiple partitions but I'll add this too. Also, I'll check the case where the first partition is empty but others aren't.
Test build #25730 has started for PR 4074 at commit
|
Test build #25730 has finished for PR 4074 at commit
|
Test PASSed. |
Test build #25731 has started for PR 4074 at commit
|
Test build #25731 has finished for PR 4074 at commit
|
Test PASSed. |
LGTM @srowen - are you still working on it or is it good from your end? Will leave a bit of time for others to comment as well. |
@pwendell No more changes from my side. |
@srowen Thanks Sean, I committed this with a minor re-word of the title. |
Pretty minor, but submitted for consideration -- this would at least help people make this check in the most efficient way I know. Author: Sean Owen <sowen@cloudera.com> Closes apache#4074 from srowen/SPARK-5270 and squashes the following commits: 66885b8 [Sean Owen] Add note that JavaRDDLike should not be implemented by user code 2e9b490 [Sean Owen] More tests, and Mima-exclude the new isEmpty method in JavaRDDLike 28395ff [Sean Owen] Add isEmpty to Java, Python 7dd04b7 [Sean Owen] Add efficient RDD.isEmpty()
Pretty minor, but submitted for consideration -- this would at least help people make this check in the most efficient way I know.