[SPARK-15660][CORE] Update RDD `variance/stdev` description and add popVariance/popStdev by dongjoon-hyun · Pull Request #13403 · apache/spark

dongjoon-hyun · 2016-05-31T05:25:14Z

What changes were proposed in this pull request?

In Spark-11490, variance/stdev are redefined as the sample variance/stdev instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.

Also, this PR adds them popVariance and popStdev functions clearly.

How was this patch tested?

Pass the updated Jenkins tests.

SparkQA · 2016-05-31T07:15:36Z

Test build #59637 has finished for PR 13403 at commit 3fe0cb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-31T07:27:56Z

hm I think we probably don't want to change the behavior of this to not surprise people ...

dongjoon-hyun · 2016-05-31T07:42:26Z

Thank you for review again @rxin.

Actually, I fully understand and expected your decision.
The reason why I made this issue is I think we need explicit discussions and the conclusion for this issue.

I worried that Spark shows this inconsistency forever implicitly. As we know, if we do not this in Spark 2.0, this will happen on Spark 3.0 or maybe never because of the same reason.

dongjoon-hyun · 2016-05-31T10:17:02Z

Hi, @rxin .
I updated the example more practically by using SparkSession.createDataset().rdd.stdev.
If we must preserve the current behavior for backward compatibility, what about adding notes somewhere about the inconsistency for new users?

srowen · 2016-05-31T11:49:04Z

Yeah I find this surprising too. As far as I know, databases tend to treat stddev as the sample stddev -- except Hive for some reason, AFAIK. I've never quite understood the theoretical motivation for that. Maybe the idea is that the aggregate is typically over some projection, or subset, of all data. But to me, the default should logically be population stdev, since there's no inherent reason to believe the result set is not the entire population of interest.

For RDDs, it seems even clearer that the behavior should be population stddev.
For Datasets, maybe it's less surprising to do what databases usually do.

It does bear documentation for sure, but maybe not changing the behavior.

dongjoon-hyun · 2016-05-31T12:24:28Z

Thank you for review, @srowen .

dongjoon-hyun · 2016-06-02T00:03:01Z

Rebased.

SparkQA · 2016-06-02T02:12:38Z

Test build #59793 has finished for PR 13403 at commit 0715f08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-04T02:52:49Z

Test build #59979 has finished for PR 13403 at commit ffff801.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-06T08:33:04Z

Test build #60026 has finished for PR 13403 at commit ae5fb7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-08T22:37:12Z

Test build #60195 has finished for PR 13403 at commit f999c8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-09T08:03:53Z

MLLIB stat.Statistics is also consistent with Dataset.

scala> import org.apache.spark.mllib.linalg.Vectors
scala> import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
scala> Statistics.colStats(sc.parallelize(Seq(Vectors.dense(1.0),Vectors.dense(2.0),Vectors.dense(3.0)))).variance
res10: org.apache.spark.mllib.linalg.Vector = [1.0]

dongjoon-hyun · 2016-06-09T08:10:33Z

Although we can not change old API, I think it's a good idea to add popVariance and popStdev clearly.

If everything in this PR is now allowed, what about just adding an explicit note on old StatCounter.variance and StatCounter.stdev?

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter

SparkQA · 2016-06-14T18:53:19Z

Test build #60496 has finished for PR 13403 at commit a3baf4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-17T17:09:18Z

Hi, @rxin and @srowen .
Now, I update this PR like the following.

Update the documentation of legacy Scala/Java API more clearly
Add popVariance/popStdev functions explicitly

Could you give me some feedback?

SparkQA · 2016-06-17T18:56:12Z

Test build #60715 has finished for PR 13403 at commit e41561e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-17T19:16:04Z

Test build #60716 has finished for PR 13403 at commit 602e236.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-06-17T20:00:49Z

core/src/main/scala/org/apache/spark/util/StatCounter.scala

Here is the only changed part of legacy code.

…for variance/stdev.

…s only.

dongjoon-hyun · 2016-06-18T07:40:40Z

Thank you so much for your review, @srowen !
I updated the PR according to your comments.

SparkQA · 2016-06-18T09:34:25Z

Test build #60771 has finished for PR 13403 at commit 15938b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-18T11:07:16Z

CC @mengxr @jkbradley I'd kinda like to have this for 2.0.0 for completeness.

dongjoon-hyun · 2016-06-20T20:54:20Z

Ping~

srowen · 2016-06-21T07:53:51Z

core/src/test/java/org/apache/spark/JavaAPISuite.java

    assertEquals(20/6.0, rdd.mean(), 0.01);
    assertEquals(20/6.0, rdd.mean(), 0.01);
    assertEquals(6.22222, rdd.variance(), 0.01);
+    assertEquals(rdd.variance(), rdd.popVariance(), 0.01);


I think these need to assert exact equality; there's no reason that they would ever be different at all.

srowen · 2016-06-21T18:14:49Z

core/src/test/java/org/apache/spark/JavaAPISuite.java

+    assertEquals(rdd.variance(), rdd.popVariance());
    assertEquals(7.46667, rdd.sampleVariance(), 0.01);
    assertEquals(2.49444, rdd.stdev(), 0.01);
+    assertEquals(rdd.stdev(), rdd.popStdev(), 0.01);


This still asserts approximate equality, when it should be exactly equal

Oops. Sorry, I'll review my code again!

I think it should be approximate equality but with a very small tolerance, e.g. 1e-14. Both calls trigger reduce jobs and we don't have guarantees on the ordering or reduce operations, which might lead to small numerical errors. Maybe it doesn't apply to the test case here, but it is still a best practice to not test strict equality of floating-point numbers.

srowen · 2016-06-21T18:15:29Z

OK, I suppose it's either down the over, or over then down in the API. Either way is consistent with something.

mengxr · 2016-06-21T18:28:29Z

@dongjoon-hyun @srowen I made a comment just before @dongjoon-hyun updated the PR:

I think it should be approximate equality but with a very small tolerance, e.g. 1e-14. Both calls trigger reduce jobs and we don't have guarantees on the ordering or reduce operations, which might lead to small numerical errors. Maybe it doesn't apply to the test case here, but it is still a best practice to not test strict equality of floating-point numbers.

dongjoon-hyun · 2016-06-21T18:29:15Z

Hi, @srowen .
Now, I fixed them all. Sorry for missing those.

dongjoon-hyun · 2016-06-21T18:30:42Z

Oh, thank you, @mengxr !
I'll update again.

dongjoon-hyun · 2016-06-21T18:34:29Z

Thank you for reviewing this PR, @mengxr and @srowen !

mengxr · 2016-06-21T18:49:12Z

core/src/test/scala/org/apache/spark/PartitioningSuite.scala

    assert(abs(1.0 - rdd.stdev) < 0.01)
+    assert(abs(rdd.variance - rdd.popVariance) < 1e-14)
+    assert(abs(rdd.stdev - rdd.popStdev) < 1e-14)
+    assert(abs(2.0 - rdd.sampleVariance) < 0.01)


The tolerance should be smaller too.

dongjoon-hyun · 2016-06-21T19:04:24Z

Hi, @mengxr .
I updated them to use accurate values and small tolerances, too.

-    assert(abs(2.0 - rdd.sampleVariance) < 0.01)
-    assert(abs(1.41 - rdd.sampleStdev) < 0.01)
+    assert(abs(2.0 - rdd.sampleVariance) < 1e-14)
+    assert(abs(Math.sqrt(2.0) - rdd.sampleStdev) < 1e-14)

SparkQA · 2016-06-21T20:30:05Z

Test build #60946 has finished for PR 13403 at commit aeb4888.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-21T20:47:38Z

Test build #60938 has finished for PR 13403 at commit 4f0fc8a.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-21T20:56:42Z

Test build #60948 has finished for PR 13403 at commit cb10b9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-06-21T21:31:03Z

Test build #60950 has finished for PR 13403 at commit 3a6396e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-22T07:42:00Z

core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala


+  /**
+   * Compute the population standard deviation of this RDD's elements.
+   */


Oh, I forgot: these need @Since annotations. Hm, I'm not clear whether we should merge this for 2.0.0 now. I wouldn't mind but we have an RC now. It's not super urgent. I might mark this as since 2.1.0.

I agree with you. I'll add @since 2.1.0 to all new functions in this PR.

Sorry, I fixed the previous comment.

dongjoon-hyun · 2016-06-22T07:56:48Z

Thank you, @srowen .

SparkQA · 2016-06-22T09:40:48Z

Test build #61018 has finished for PR 13403 at commit a98633b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-23T07:02:42Z

core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala

  }

+  /**
+   * Compute the population standard deviation of this RDD's elements.


We need to use the Scala @Since annotation. Have a look how this is annotated elsewhere in Scala. For Java, yeah, we only use the @since javadoc tag because that's all that is available.

Oh, I did a stupid mistake again. Then, the correct form is @Since("2.1.0")?

Ur, by the way, some scala code also use java since annotation.

SparkSession.scala, StreamingQuery.scala, StreamingQueryManager.scala

Is it they designed for Java compatible?

It's valid, though I don't think we generally use them. I would just use the @Since tag along in Scala, since it is processed differently on purpose.

Sure. No doubt about that.
I'm just curious whether we need to replace them someday.
Anyway, I'm almost fixed.

SparkQA · 2016-06-23T09:47:19Z

Test build #61106 has finished for PR 13403 at commit bb12a7f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-06-23T10:07:54Z

Merged to master

dongjoon-hyun · 2016-06-23T10:11:02Z

Thank you for everything, @srowen , @mengxr , @rxin .

dongjoon-hyun changed the title ~~[SPARK-15660][CORE] RDD and Dataset should show the consistent values for variance/stdev.~~ [SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev Jun 17, 2016

dongjoon-hyun reviewed Jun 17, 2016
View reviewed changes

core/src/main/scala/org/apache/spark/util/StatCounter.scala Outdated

Copy link

Member Author

dongjoon-hyun Jun 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the only changed part of legacy code.

dongjoon-hyun added 4 commits June 18, 2016 00:18

[SPARK-15660][CORE] RDD and Dataset should show the consistent value …

fd05c86

…for variance/stdev.

Keep the existing variance and stddev semantic and update comment…

eedd141

…s only.

Update, too.

9523224

Address comments.

15938b3

srowen reviewed Jun 21, 2016
View reviewed changes

Fix all.

aeb4888

Use very small tolerance 1e-14 instead.

cb10b9c

mengxr reviewed Jun 21, 2016
View reviewed changes

Use accurate value and very small tolerance 1e-14, too.

3a6396e

srowen reviewed Jun 22, 2016
View reviewed changes

Add since 2.1.0 tags.

a98633b

srowen reviewed Jun 23, 2016
View reviewed changes

Use @SInCE annotation

bb12a7f

asfgit closed this in 5eef1e6 Jun 23, 2016

dongjoon-hyun deleted the SPARK-15660 branch July 20, 2016 07:40

Conversation

dongjoon-hyun commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 31, 2016

Uh oh!

rxin commented May 31, 2016

Uh oh!

dongjoon-hyun commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 31, 2016

Uh oh!

srowen commented May 31, 2016

Uh oh!

dongjoon-hyun commented May 31, 2016

Uh oh!

dongjoon-hyun commented Jun 2, 2016

Uh oh!

SparkQA commented Jun 2, 2016

Uh oh!

SparkQA commented Jun 4, 2016

Uh oh!

SparkQA commented Jun 6, 2016

Uh oh!

SparkQA commented Jun 8, 2016

Uh oh!

dongjoon-hyun commented Jun 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 9, 2016

Uh oh!

SparkQA commented Jun 14, 2016

Uh oh!

dongjoon-hyun commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

SparkQA commented Jun 17, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 18, 2016

Uh oh!

SparkQA commented Jun 18, 2016

Uh oh!

srowen commented Jun 18, 2016

Uh oh!

dongjoon-hyun commented Jun 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Jun 21, 2016

Uh oh!

mengxr commented Jun 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jun 21, 2016

Uh oh!

SparkQA commented Jun 21, 2016

dongjoon-hyun commented May 31, 2016 •

edited

Loading

dongjoon-hyun commented May 31, 2016 •

edited

Loading

dongjoon-hyun commented Jun 9, 2016 •

edited

Loading

mengxr commented Jun 21, 2016 •

edited

Loading

dongjoon-hyun Jun 22, 2016 •

edited

Loading

dongjoon-hyun Jun 23, 2016 •

edited

Loading