New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2251] fix concurrency issues in random sampler #1229
Conversation
@@ -54,17 +54,17 @@ trait RandomSampler[T, U] extends Pseudorandom with Cloneable with Serializable | |||
*/ | |||
@DeveloperApi | |||
class BernoulliSampler[T](lb: Double, ub: Double, complement: Boolean = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could dropping this implicit break source and binary compatiblity? I think we'd like to avoid asking people to make code changes to upgrade to a bug-fix release, even if the API's are marked as developer. Can you just leave the existing argument and just ignore it?
The following code is very likely to throw an exception: ~~~ val rdd = sc.parallelize(0 until 111, 10).sample(false, 0.1) rdd.zip(rdd).count() ~~~ because the same random number generator is used in compute partitions. Author: Xiangrui Meng <meng@databricks.com> Closes apache#1229 from mengxr/fix-sample and squashes the following commits: f1ee3d7 [Xiangrui Meng] fix concurrency issues in random sampler
Hi all, I'm getting similar problem using kmeans clustering with Spark-1.5.1. The stacktrace is below. Any clue? Thank you in advance.
Some good links are:
|
My RDD comes from a HBase table, which is growing. When I suspend the row insertion, the problem doesn't happen. The RDD is cached, should the problem occur? Is there any way to "freeze" the RDD in some point enable the use without troubles? regards. PA |
@pauloangelo sounds like your RDD is not immutable then, in which case many bets are off. RDDs are generally always the same whenever you compute them. |
The following code is very likely to throw an exception:
because the same random number generator is used in compute partitions.