-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5744] [CORE] Making RDD.isEmpty robust to empty partitions #4534
Conversation
RDD.isEmpty fails when an RDD contains empty partitions.
FYI: The method was introduced in #4074 |
Can one of the admins verify this patch? |
Thanks for doing this, but the title of this PR isn't sufficient. It will become the commit log message, so please update the PR title to adequately describe what you did so that other developers don't have to look into the details of the commit or look up the JIRA issue just to get an idea of what this PR is about. |
*/ | ||
def isEmpty(): Boolean = partitions.length == 0 || take(1).length == 0 | ||
def isEmpty(): Boolean = partitions.length == 0 || mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try the test case, sure, to investigate. The case of an empty partition should be handled already by take()
, so I don't think that's it per se.
(I'm worried about this logic since it will touch every partition, and the point was to not do so. The 2 changes before this line aren't necessary.)
The exception looks more like funny business in handling Seq()
(i.e. type Any
) somewhere along the line. I'll look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might have a point, that this is actually a bug in take()
.
sc.parallelize(Seq(1,2,3),1).take(1337)
works fine but and returns Array[Int] = Array(1, 2, 3)
sc.parallelize(Seq(),1).take(1)
fails on the other hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting more convinced that this is a bug in take
. It is provoked by RDD[Nothing].take(1). Take for example these three commands:
sc.parallelize(Seq[Nothing]()).take(1) // Fails
sc.parallelize(Seq[Any]()).take(1) // Array[Any] = Array()
sc.parallelize(Seq[Int]()).take(1) // Array[Int] = Array()
@tbertelsen Better, but you still should include SPARK-5744 and add [CORE] to the PR title. |
Sorry. Is is good now? |
perfect |
This works, so it's not quite empty partitions:
This also creates an exception, so it's to do with
I think the problem roughly boils down to this behavior in Scala:
The problem is the |
I think I have a solution. A
Witness:
It deserves a unit test for both of these and a run through all the tests. Want to try that? |
How do we treat By the way |
But there is actually still the same problem with the |
Hmph. I think we might could work around this if All of these are fairly artificial cases. An empty RDD or partition of a normal type works fine. I think it's worth making the |
It compiles if you include
I think your right about this being fairly artificial. One solution could be, just not to accept Nothing as the type of an RDD, and fail fast and predictable like this:
We must however be carefull. Right now If we just want to provide a good error message, we could also just catch the ArrayStoreException and wrap it in an exception with a more descriptive message. Perhaps at the place where, it is already wrapped in the SparkDriverExecutionException. |
The case of |
@tbertelsen Here's my proposed fix: srowen@2390a3f I discovered along the way that histogram() doesn't support an RDD with 0 partitions, so included that fix. |
It looks like it fixes the issue with If you manage to create a non-empty collection of Handling In summary: your fix seems like a good solution. I'll close this PR. |
OK, I can merge my PR. I would be fine with you adding this in to yours, with whatever additions you want to. That is, I didn't intend to hijack your change, and can credit you in JIRA. |
RDD.isEmpty fails when an RDD contains empty partitions.