Skip to content

Conversation

@sryza
Copy link
Contributor

@sryza sryza commented Jan 30, 2015

...ll cause problems

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26422 has started for PR 4293 at commit 78ba008.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 30, 2015

Test build #26422 has finished for PR 4293 at commit 78ba008.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26422/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Jan 30, 2015

This is good to put. One idea that just came to my mind is ... why don't the downstream operators inspect whether they need to do copys or not?

@JoshRosen
Copy link
Contributor

Even if we can't fix this problem, I wonder if we could override some of the HadoopRDD methods to log warnings when they're called (e.g. cache).

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26446 has started for PR 4293 at commit 6e1932a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26446 has finished for PR 4293 at commit 6e1932a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26446/
Test PASSed.

@sryza
Copy link
Contributor Author

sryza commented Jan 31, 2015

@rxin @JoshRosen I like both of those ideas. Updated patch implements Josh's . Reynold's is a little more involved, but would be good to implement down the line as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throwing an exception is a bit extreme and will break existing code. How about just a warning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1; I had a recent PR that added some similar errors / warnings for common user mistakes, but I only raised exceptions for branches of the code that threw errors (e.g. NPE) 100% of the time. Let's make this into a warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the error message might include the workaround; right now, it seems somewhat final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any conceivable situations where existing code that's doing this wouldn't have a bug?

On Jan 30, 2015, at 6:59 PM, Reynold Xin notifications@github.com wrote:

In core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala:

@@ -308,6 +309,14 @@ class HadoopRDD[K, V](
// Do nothing. Hadoop RDD should not be checkpointed.
}

  • override def persist(storageLevel: StorageLevel): this.type = {
  • if (storageLevel.deserialized) {
  •  throw new SparkException("Can't cache HadoopRDDs as deserialized objects because Hadoop's" +
    
    throwing an exception is a bit extreme and will break existing code. How about just a warning?


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... maybe this is more akin to the "arrays shouldn't be used as keys when partitioning RDDs with HashPartitioner", which is virtually guaranteed to lead to a wrong answer.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26522 has started for PR 4293 at commit cc46e52.

  • This patch merges cleanly.

@sryza
Copy link
Contributor Author

sryza commented Feb 2, 2015

Updated patch adds instructions on how to avoid the exception and extends behavior to NewHadoopRDD.

My opinion is still that this deserves an Exception rather than a warning. It's not a groupByKey kind of situation where there is a more performant choice - users that cache RDDs in this way face fundamental correctness issues. If I'm missing legitimate uses for this of course I take it back.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26522 has finished for PR 4293 at commit cc46e52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26522/
Test FAILed.

@sryza
Copy link
Contributor Author

sryza commented Feb 2, 2015

retest this please

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26525 has started for PR 4293 at commit cc46e52.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26525 has finished for PR 4293 at commit cc46e52.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26525/
Test FAILed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the import here

@rxin
Copy link
Contributor

rxin commented Feb 2, 2015

@sryza the fact that we have a test case failing I think is enough of a reason to not do something too drastic here. How about just changing it to logWarning for now?

@sryza
Copy link
Contributor Author

sryza commented Feb 2, 2015

Looks like the test failures are legit

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26538 has started for PR 4293 at commit e9ce742.

  • This patch merges cleanly.

@sryza
Copy link
Contributor Author

sryza commented Feb 2, 2015

Ah, so the failing tests expose the possibility that someone could write their own InputFormat that doesn't reuse a single object, and caching that would be fine. So changing it to logWarning sounds fine to me.

It might also be worthwhile to suppress the warning for BinaryFilesRDD and WholeTextFileRDD, though it would look fairly weird, because persist would need to be overridden in these methods to call the grandparent class's version of the method. So I can add that if y'all prefer but will leave it out otherwise.

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26538 has finished for PR 4293 at commit e9ce742.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26538/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Feb 2, 2015

I'm merging this in master. Thanks Sandy.

@asfgit asfgit closed this in 8309349 Feb 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants