Skip to content

[SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev#13403

Closed
dongjoon-hyun wants to merge 10 commits intoapache:masterfrom
dongjoon-hyun:SPARK-15660
Closed

[SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev#13403
dongjoon-hyun wants to merge 10 commits intoapache:masterfrom
dongjoon-hyun:SPARK-15660

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented May 31, 2016

What changes were proposed in this pull request?

In Spark-11490, variance/stdev are redefined as the sample variance/stdev instead of population ones. This PR updates the other old documentations to prevent users from misunderstanding. This will update the following Scala/Java API docs.

Also, this PR adds them popVariance and popStdev functions clearly.

How was this patch tested?

Pass the updated Jenkins tests.

@SparkQA
Copy link

SparkQA commented May 31, 2016

Test build #59637 has finished for PR 13403 at commit 3fe0cb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 31, 2016

hm I think we probably don't want to change the behavior of this to not surprise people ...

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented May 31, 2016

Thank you for review again @rxin.

Actually, I fully understand and expected your decision.
The reason why I made this issue is I think we need explicit discussions and the conclusion for this issue.

I worried that Spark shows this inconsistency forever implicitly. As we know, if we do not this in Spark 2.0, this will happen on Spark 3.0 or maybe never because of the same reason.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
I updated the example more practically by using SparkSession.createDataset().rdd.stdev.
If we must preserve the current behavior for backward compatibility, what about adding notes somewhere about the inconsistency for new users?

@srowen
Copy link
Member

srowen commented May 31, 2016

Yeah I find this surprising too. As far as I know, databases tend to treat stddev as the sample stddev -- except Hive for some reason, AFAIK. I've never quite understood the theoretical motivation for that. Maybe the idea is that the aggregate is typically over some projection, or subset, of all data. But to me, the default should logically be population stdev, since there's no inherent reason to believe the result set is not the entire population of interest.

For RDDs, it seems even clearer that the behavior should be population stddev.
For Datasets, maybe it's less surprising to do what databases usually do.

It does bear documentation for sure, but maybe not changing the behavior.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @srowen .

@dongjoon-hyun
Copy link
Member Author

Rebased.

@SparkQA
Copy link

SparkQA commented Jun 2, 2016

Test build #59793 has finished for PR 13403 at commit 0715f08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2016

Test build #59979 has finished for PR 13403 at commit ffff801.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 6, 2016

Test build #60026 has finished for PR 13403 at commit ae5fb7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 8, 2016

Test build #60195 has finished for PR 13403 at commit f999c8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 9, 2016

MLLIB stat.Statistics is also consistent with Dataset.

scala> import org.apache.spark.mllib.linalg.Vectors
scala> import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
scala> Statistics.colStats(sc.parallelize(Seq(Vectors.dense(1.0),Vectors.dense(2.0),Vectors.dense(3.0)))).variance
res10: org.apache.spark.mllib.linalg.Vector = [1.0]

@dongjoon-hyun
Copy link
Member Author

Although we can not change old API, I think it's a good idea to add popVariance and popStdev clearly.

If everything in this PR is now allowed, what about just adding an explicit note on old StatCounter.variance and StatCounter.stdev?

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.util.StatCounter

@SparkQA
Copy link

SparkQA commented Jun 14, 2016

Test build #60496 has finished for PR 13403 at commit a3baf4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-15660][CORE] RDD and Dataset should show the consistent values for variance/stdev. [SPARK-15660][CORE] Update RDD variance/stdev description and add popVariance/popStdev Jun 17, 2016
@dongjoon-hyun
Copy link
Member Author

Hi, @rxin and @srowen .
Now, I update this PR like the following.

  1. Update the documentation of legacy Scala/Java API more clearly
  2. Add popVariance/popStdev functions explicitly

Could you give me some feedback?

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60715 has finished for PR 13403 at commit e41561e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2016

Test build #60716 has finished for PR 13403 at commit 602e236.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the only changed part of legacy code.

@dongjoon-hyun
Copy link
Member Author

Thank you so much for your review, @srowen !
I updated the PR according to your comments.

@SparkQA
Copy link

SparkQA commented Jun 18, 2016

Test build #60771 has finished for PR 13403 at commit 15938b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 18, 2016

CC @mengxr @jkbradley I'd kinda like to have this for 2.0.0 for completeness.

@dongjoon-hyun
Copy link
Member Author

Ping~

assertEquals(20/6.0, rdd.mean(), 0.01);
assertEquals(20/6.0, rdd.mean(), 0.01);
assertEquals(6.22222, rdd.variance(), 0.01);
assertEquals(rdd.variance(), rdd.popVariance(), 0.01);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these need to assert exact equality; there's no reason that they would ever be different at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

assertEquals(rdd.variance(), rdd.popVariance());
assertEquals(7.46667, rdd.sampleVariance(), 0.01);
assertEquals(2.49444, rdd.stdev(), 0.01);
assertEquals(rdd.stdev(), rdd.popStdev(), 0.01);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still asserts approximate equality, when it should be exactly equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Sorry, I'll review my code again!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be approximate equality but with a very small tolerance, e.g. 1e-14. Both calls trigger reduce jobs and we don't have guarantees on the ordering or reduce operations, which might lead to small numerical errors. Maybe it doesn't apply to the test case here, but it is still a best practice to not test strict equality of floating-point numbers.

@srowen
Copy link
Member

srowen commented Jun 21, 2016

OK, I suppose it's either down the over, or over then down in the API. Either way is consistent with something.

@mengxr
Copy link
Contributor

mengxr commented Jun 21, 2016

@dongjoon-hyun @srowen I made a comment just before @dongjoon-hyun updated the PR:

I think it should be approximate equality but with a very small tolerance, e.g. 1e-14. Both calls trigger reduce jobs and we don't have guarantees on the ordering or reduce operations, which might lead to small numerical errors. Maybe it doesn't apply to the test case here, but it is still a best practice to not test strict equality of floating-point numbers.

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen .
Now, I fixed them all. Sorry for missing those.

@dongjoon-hyun
Copy link
Member Author

Oh, thank you, @mengxr !
I'll update again.

@dongjoon-hyun
Copy link
Member Author

Thank you for reviewing this PR, @mengxr and @srowen !

assert(abs(1.0 - rdd.stdev) < 0.01)
assert(abs(rdd.variance - rdd.popVariance) < 1e-14)
assert(abs(rdd.stdev - rdd.popStdev) < 1e-14)
assert(abs(2.0 - rdd.sampleVariance) < 0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tolerance should be smaller too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

@dongjoon-hyun
Copy link
Member Author

Hi, @mengxr .
I updated them to use accurate values and small tolerances, too.

-    assert(abs(2.0 - rdd.sampleVariance) < 0.01)
-    assert(abs(1.41 - rdd.sampleStdev) < 0.01)
+    assert(abs(2.0 - rdd.sampleVariance) < 1e-14)
+    assert(abs(Math.sqrt(2.0) - rdd.sampleStdev) < 1e-14)

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60946 has finished for PR 13403 at commit aeb4888.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60938 has finished for PR 13403 at commit 4f0fc8a.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60948 has finished for PR 13403 at commit cb10b9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2016

Test build #60950 has finished for PR 13403 at commit 3a6396e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Compute the population standard deviation of this RDD's elements.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot: these need @Since annotations. Hm, I'm not clear whether we should merge this for 2.0.0 now. I wouldn't mind but we have an RC now. It's not super urgent. I might mark this as since 2.1.0.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jun 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I'll add @since 2.1.0 to all new functions in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I fixed the previous comment.

@dongjoon-hyun
Copy link
Member Author

Thank you, @srowen .

@SparkQA
Copy link

SparkQA commented Jun 22, 2016

Test build #61018 has finished for PR 13403 at commit a98633b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Compute the population standard deviation of this RDD's elements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use the Scala @Since annotation. Have a look how this is annotated elsewhere in Scala. For Java, yeah, we only use the @since javadoc tag because that's all that is available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I did a stupid mistake again. Then, the correct form is @Since("2.1.0")?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jun 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, by the way, some scala code also use java since annotation.

  • SparkSession.scala, StreamingQuery.scala, StreamingQueryManager.scala

Is it they designed for Java compatible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's valid, though I don't think we generally use them. I would just use the @Since tag along in Scala, since it is processed differently on purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. No doubt about that.
I'm just curious whether we need to replace them someday.
Anyway, I'm almost fixed.

@SparkQA
Copy link

SparkQA commented Jun 23, 2016

Test build #61106 has finished for PR 13403 at commit bb12a7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 23, 2016

Merged to master

@asfgit asfgit closed this in 5eef1e6 Jun 23, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you for everything, @srowen , @mengxr , @rxin .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-15660 branch July 20, 2016 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants