[SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev#13403
[SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev#13403dongjoon-hyun wants to merge 10 commits intoapache:masterfrom dongjoon-hyun:SPARK-15660
variance/stdev description and add popVariance/popStdev#13403Conversation
|
Test build #59637 has finished for PR 13403 at commit
|
|
hm I think we probably don't want to change the behavior of this to not surprise people ... |
|
Thank you for review again @rxin. Actually, I fully understand and expected your decision. I worried that Spark shows this inconsistency forever implicitly. As we know, if we do not this in Spark 2.0, this will happen on Spark 3.0 or maybe never because of the same reason. |
|
Hi, @rxin . |
|
Yeah I find this surprising too. As far as I know, databases tend to treat stddev as the sample stddev -- except Hive for some reason, AFAIK. I've never quite understood the theoretical motivation for that. Maybe the idea is that the aggregate is typically over some projection, or subset, of all data. But to me, the default should logically be population stdev, since there's no inherent reason to believe the result set is not the entire population of interest. For RDDs, it seems even clearer that the behavior should be population stddev. It does bear documentation for sure, but maybe not changing the behavior. |
|
Thank you for review, @srowen . |
|
Rebased. |
|
Test build #59793 has finished for PR 13403 at commit
|
|
Test build #59979 has finished for PR 13403 at commit
|
|
Test build #60026 has finished for PR 13403 at commit
|
|
Test build #60195 has finished for PR 13403 at commit
|
|
MLLIB |
|
Although we can not change old API, I think it's a good idea to add If everything in this PR is now allowed, what about just adding an explicit note on old http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter |
|
Test build #60496 has finished for PR 13403 at commit
|
variance/stdev description and add popVariance/popStdev
|
Test build #60715 has finished for PR 13403 at commit
|
|
Test build #60716 has finished for PR 13403 at commit
|
There was a problem hiding this comment.
Here is the only changed part of legacy code.
|
Thank you so much for your review, @srowen ! |
|
Test build #60771 has finished for PR 13403 at commit
|
|
CC @mengxr @jkbradley I'd kinda like to have this for 2.0.0 for completeness. |
|
Ping~ |
| assertEquals(20/6.0, rdd.mean(), 0.01); | ||
| assertEquals(20/6.0, rdd.mean(), 0.01); | ||
| assertEquals(6.22222, rdd.variance(), 0.01); | ||
| assertEquals(rdd.variance(), rdd.popVariance(), 0.01); |
There was a problem hiding this comment.
I think these need to assert exact equality; there's no reason that they would ever be different at all.
| assertEquals(rdd.variance(), rdd.popVariance()); | ||
| assertEquals(7.46667, rdd.sampleVariance(), 0.01); | ||
| assertEquals(2.49444, rdd.stdev(), 0.01); | ||
| assertEquals(rdd.stdev(), rdd.popStdev(), 0.01); |
There was a problem hiding this comment.
This still asserts approximate equality, when it should be exactly equal
There was a problem hiding this comment.
Oops. Sorry, I'll review my code again!
There was a problem hiding this comment.
I think it should be approximate equality but with a very small tolerance, e.g. 1e-14. Both calls trigger reduce jobs and we don't have guarantees on the ordering or reduce operations, which might lead to small numerical errors. Maybe it doesn't apply to the test case here, but it is still a best practice to not test strict equality of floating-point numbers.
|
OK, I suppose it's either down the over, or over then down in the API. Either way is consistent with something. |
|
@dongjoon-hyun @srowen I made a comment just before @dongjoon-hyun updated the PR:
|
|
Hi, @srowen . |
|
Oh, thank you, @mengxr ! |
| assert(abs(1.0 - rdd.stdev) < 0.01) | ||
| assert(abs(rdd.variance - rdd.popVariance) < 1e-14) | ||
| assert(abs(rdd.stdev - rdd.popStdev) < 1e-14) | ||
| assert(abs(2.0 - rdd.sampleVariance) < 0.01) |
There was a problem hiding this comment.
The tolerance should be smaller too.
|
Hi, @mengxr . |
|
Test build #60946 has finished for PR 13403 at commit
|
|
Test build #60938 has finished for PR 13403 at commit
|
|
Test build #60948 has finished for PR 13403 at commit
|
|
Test build #60950 has finished for PR 13403 at commit
|
|
|
||
| /** | ||
| * Compute the population standard deviation of this RDD's elements. | ||
| */ |
There was a problem hiding this comment.
Oh, I forgot: these need @Since annotations. Hm, I'm not clear whether we should merge this for 2.0.0 now. I wouldn't mind but we have an RC now. It's not super urgent. I might mark this as since 2.1.0.
There was a problem hiding this comment.
I agree with you. I'll add @since 2.1.0 to all new functions in this PR.
There was a problem hiding this comment.
Sorry, I fixed the previous comment.
|
Thank you, @srowen . |
|
Test build #61018 has finished for PR 13403 at commit
|
| } | ||
|
|
||
| /** | ||
| * Compute the population standard deviation of this RDD's elements. |
There was a problem hiding this comment.
We need to use the Scala @Since annotation. Have a look how this is annotated elsewhere in Scala. For Java, yeah, we only use the @since javadoc tag because that's all that is available.
There was a problem hiding this comment.
Oh, I did a stupid mistake again. Then, the correct form is @Since("2.1.0")?
There was a problem hiding this comment.
Ur, by the way, some scala code also use java since annotation.
- SparkSession.scala, StreamingQuery.scala, StreamingQueryManager.scala
Is it they designed for Java compatible?
There was a problem hiding this comment.
It's valid, though I don't think we generally use them. I would just use the @Since tag along in Scala, since it is processed differently on purpose.
There was a problem hiding this comment.
Sure. No doubt about that.
I'm just curious whether we need to replace them someday.
Anyway, I'm almost fixed.
|
Test build #61106 has finished for PR 13403 at commit
|
|
Merged to master |
What changes were proposed in this pull request?
In Spark-11490,
variance/stdevare redefined as the samplevariance/stdevinstead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.Also, this PR adds them
popVarianceandpopStdevfunctions clearly.How was this patch tested?
Pass the updated Jenkins tests.