SPARK-5500. Document that feeding hadoopFile into a shuffle operation wi... #4293

sryza · 2015-01-30T20:40:49Z

...ll cause problems

SparkQA · 2015-01-30T20:42:45Z

Test build #26422 has started for PR 4293 at commit 78ba008.

This patch merges cleanly.

SparkQA · 2015-01-30T21:52:35Z

Test build #26422 has finished for PR 4293 at commit 78ba008.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-30T21:52:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26422/
Test PASSed.

rxin · 2015-01-30T22:55:52Z

This is good to put. One idea that just came to my mind is ... why don't the downstream operators inspect whether they need to do copys or not?

JoshRosen · 2015-01-30T23:17:09Z

Even if we can't fix this problem, I wonder if we could override some of the HadoopRDD methods to log warnings when they're called (e.g. cache).

… will cause problems

SparkQA · 2015-01-31T01:17:47Z

Test build #26446 has started for PR 4293 at commit 6e1932a.

This patch merges cleanly.

SparkQA · 2015-01-31T02:28:46Z

Test build #26446 has finished for PR 4293 at commit 6e1932a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-31T02:28:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26446/
Test PASSed.

sryza · 2015-01-31T02:43:05Z

@rxin @JoshRosen I like both of those ideas. Updated patch implements Josh's . Reynold's is a little more involved, but would be good to implement down the line as well.

rxin · 2015-01-31T02:59:15Z

core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala

throwing an exception is a bit extreme and will break existing code. How about just a warning?

+1; I had a recent PR that added some similar errors / warnings for common user mistakes, but I only raised exceptions for branches of the code that threw errors (e.g. NPE) 100% of the time. Let's make this into a warning.

Also the error message might include the workaround; right now, it seems somewhat final.

Are there any conceivable situations where existing code that's doing this wouldn't have a bug?

On Jan 30, 2015, at 6:59 PM, Reynold Xin notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala:

@@ -308,6 +309,14 @@ class HadoopRDD[K, V](
// Do nothing. Hadoop RDD should not be checkpointed.
}

override def persist(storageLevel: StorageLevel): this.type = {

if (storageLevel.deserialized) {

throw new SparkException("Can't cache HadoopRDDs as deserialized objects because Hadoop's" +
throwing an exception is a bit extreme and will break existing code. How about just a warning?

—
Reply to this email directly or view it on GitHub.

Hmm... maybe this is more akin to the "arrays shouldn't be used as keys when partitioning RDDs with HashPartitioner", which is virtually guaranteed to lead to a wrong answer.

SparkQA · 2015-02-02T18:22:52Z

Test build #26522 has started for PR 4293 at commit cc46e52.

This patch merges cleanly.

sryza · 2015-02-02T18:27:48Z

Updated patch adds instructions on how to avoid the exception and extends behavior to NewHadoopRDD.

My opinion is still that this deserves an Exception rather than a warning. It's not a groupByKey kind of situation where there is a more performant choice - users that cache RDDs in this way face fundamental correctness issues. If I'm missing legitimate uses for this of course I take it back.

SparkQA · 2015-02-02T19:16:33Z

Test build #26522 has finished for PR 4293 at commit cc46e52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-02T19:16:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26522/
Test FAILed.

sryza · 2015-02-02T19:27:50Z

retest this please

SparkQA · 2015-02-02T19:32:56Z

Test build #26525 has started for PR 4293 at commit cc46e52.

This patch merges cleanly.

SparkQA · 2015-02-02T20:25:17Z

Test build #26525 has finished for PR 4293 at commit cc46e52.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-02T20:25:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26525/
Test FAILed.

rxin · 2015-02-02T20:46:27Z

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala

remove the import here

rxin · 2015-02-02T20:47:25Z

@sryza the fact that we have a test case failing I think is enough of a reason to not do something too drastic here. How about just changing it to logWarning for now?

sryza · 2015-02-02T20:47:31Z

Looks like the test failures are legit

SparkQA · 2015-02-02T21:12:35Z

Test build #26538 has started for PR 4293 at commit e9ce742.

This patch merges cleanly.

sryza · 2015-02-02T21:13:26Z

Ah, so the failing tests expose the possibility that someone could write their own InputFormat that doesn't reuse a single object, and caching that would be fine. So changing it to logWarning sounds fine to me.

It might also be worthwhile to suppress the warning for BinaryFilesRDD and WholeTextFileRDD, though it would look fairly weird, because persist would need to be overridden in these methods to call the grandparent class's version of the method. So I can add that if y'all prefer but will leave it out otherwise.

SparkQA · 2015-02-02T22:20:47Z

Test build #26538 has finished for PR 4293 at commit e9ce742.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-02T22:20:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26538/
Test PASSed.

rxin · 2015-02-02T22:52:26Z

I'm merging this in master. Thanks Sandy.

sryza added 2 commits January 30, 2015 17:11

SPARK-5500. Document that feeding hadoopFile into a shuffle operation…

0f6c4eb

… will cause problems

Throw exception on cache

6e1932a

sryza force-pushed the sandy-spark-5500 branch from 78ba008 to 6e1932a Compare January 31, 2015 01:12

rxin reviewed Jan 31, 2015
View reviewed changes

Add instructions and extend to NewHadoopRDD

cc46e52

rxin reviewed Feb 2, 2015
View reviewed changes

core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala Outdated

Copy link

Contributor

rxin Feb 2, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the import here

Change to warning

e9ce742

asfgit closed this in 8309349 Feb 2, 2015

SPARK-5500. Document that feeding hadoopFile into a shuffle operation wi... #4293

SPARK-5500. Document that feeding hadoopFile into a shuffle operation wi... #4293

Uh oh!

Conversation

sryza commented Jan 30, 2015

Uh oh!

SparkQA commented Jan 30, 2015

Uh oh!

SparkQA commented Jan 30, 2015

Uh oh!

AmplabJenkins commented Jan 30, 2015

Uh oh!

rxin commented Jan 30, 2015

Uh oh!

JoshRosen commented Jan 30, 2015

Uh oh!

SparkQA commented Jan 31, 2015

Uh oh!

SparkQA commented Jan 31, 2015

Uh oh!

AmplabJenkins commented Jan 31, 2015

Uh oh!

sryza commented Jan 31, 2015

Uh oh!

rxin Jan 31, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jan 31, 2015

Choose a reason for hiding this comment

Uh oh!

aarondav Jan 31, 2015

Choose a reason for hiding this comment

Uh oh!

sryza Jan 31, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen Jan 31, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

sryza commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

AmplabJenkins commented Feb 2, 2015

Uh oh!

sryza commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

AmplabJenkins commented Feb 2, 2015

Uh oh!

rxin Feb 2, 2015

Choose a reason for hiding this comment

Uh oh!

rxin commented Feb 2, 2015

Uh oh!

sryza commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

sryza commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

AmplabJenkins commented Feb 2, 2015

Uh oh!

rxin commented Feb 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants