-
Notifications
You must be signed in to change notification settings - Fork 29k
SPARK-5500. Document that feeding hadoopFile into a shuffle operation wi... #4293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #26422 has started for PR 4293 at commit
|
|
Test build #26422 has finished for PR 4293 at commit
|
|
Test PASSed. |
|
This is good to put. One idea that just came to my mind is ... why don't the downstream operators inspect whether they need to do copys or not? |
|
Even if we can't fix this problem, I wonder if we could override some of the HadoopRDD methods to log warnings when they're called (e.g. |
78ba008 to
6e1932a
Compare
|
Test build #26446 has started for PR 4293 at commit
|
|
Test build #26446 has finished for PR 4293 at commit
|
|
Test PASSed. |
|
@rxin @JoshRosen I like both of those ideas. Updated patch implements Josh's . Reynold's is a little more involved, but would be good to implement down the line as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throwing an exception is a bit extreme and will break existing code. How about just a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1; I had a recent PR that added some similar errors / warnings for common user mistakes, but I only raised exceptions for branches of the code that threw errors (e.g. NPE) 100% of the time. Let's make this into a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the error message might include the workaround; right now, it seems somewhat final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any conceivable situations where existing code that's doing this wouldn't have a bug?
On Jan 30, 2015, at 6:59 PM, Reynold Xin notifications@github.com wrote:
In core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala:
@@ -308,6 +309,14 @@ class HadoopRDD[K, V](
// Do nothing. Hadoop RDD should not be checkpointed.
}
- override def persist(storageLevel: StorageLevel): this.type = {
- if (storageLevel.deserialized) {
throwing an exception is a bit extreme and will break existing code. How about just a warning?throw new SparkException("Can't cache HadoopRDDs as deserialized objects because Hadoop's" +—
Reply to this email directly or view it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... maybe this is more akin to the "arrays shouldn't be used as keys when partitioning RDDs with HashPartitioner", which is virtually guaranteed to lead to a wrong answer.
|
Test build #26522 has started for PR 4293 at commit
|
|
Updated patch adds instructions on how to avoid the exception and extends behavior to My opinion is still that this deserves an Exception rather than a warning. It's not a groupByKey kind of situation where there is a more performant choice - users that cache RDDs in this way face fundamental correctness issues. If I'm missing legitimate uses for this of course I take it back. |
|
Test build #26522 has finished for PR 4293 at commit
|
|
Test FAILed. |
|
retest this please |
|
Test build #26525 has started for PR 4293 at commit
|
|
Test build #26525 has finished for PR 4293 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the import here
|
@sryza the fact that we have a test case failing I think is enough of a reason to not do something too drastic here. How about just changing it to logWarning for now? |
|
Looks like the test failures are legit |
|
Test build #26538 has started for PR 4293 at commit
|
|
Ah, so the failing tests expose the possibility that someone could write their own It might also be worthwhile to suppress the warning for |
|
Test build #26538 has finished for PR 4293 at commit
|
|
Test PASSed. |
|
I'm merging this in master. Thanks Sandy. |
...ll cause problems