[MINOR][DOC] Fix python variance() documentation #24895

tools4origins · 2019-06-17T14:52:51Z

What changes were proposed in this pull request?

The Python documentation incorrectly says that variance() acts as var_pop whereas it acts like var_samp here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

How was this patch tested?

Using variance() in pyspark 2.4.3 returns:

>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+

dongjoon-hyun · 2019-06-17T15:57:20Z

Thank you for your first contribution, @tools4origins .

dongjoon-hyun · 2019-06-17T15:57:26Z

ok to test

dongjoon-hyun · 2019-06-17T16:02:45Z

python/pyspark/sql/functions.py

Shall we add unbiased sample variance like the following two sentences?

'variance': 'Aggregate function: returns the unbiased sample variance of the values in a group.', 'var_samp': 'Aggregate function: returns the unbiased sample variance of the values in a group.',

Hello! It seems that sample variance is an unbiased estimator of the variance, so unbiased sample variance might be an unnecessary repetition?

The term unbiased sample variance is more precise. Please see https://en.wikipedia.org/wiki/Variance .

Just say that 'variance' is an alias for 'var_samp' here rather than duplicate anything.

var_pop in databases is the variance of the values (e.g. divided by n). It's 'correct' if the data it's given is the whole 'population', as it is simply the value of the statistic.

var_samp treats the data as a sample of a larger unknown population. It's still returning the population variance, but an unbiased estimate of it (e.g. divided by n-1).

Therefore the var_samp description is maybe best as "an unbiased estimate of the population variance given the values as a sample". "Unbiased sample variance" is OK too though I've never liked that description. The sample variance is just a statistic of given values and can't be 'biased'. It's only biased or not as an estimator of population variance.

Anyway, I'm OK fixing all the docs accordingly, at least to read "unbiased sample variance".

+1 for @srowen 's recommendation about alias.

SparkQA · 2019-06-17T16:34:03Z

Test build #106592 has finished for PR 24895 at commit b434a5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-19T11:13:13Z

Test build #106673 has finished for PR 24895 at commit 70f7ad3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-06-19T12:58:51Z

Looks OK, but your line is too long now.

python/pyspark/sql/functions.py

SparkQA · 2019-06-20T08:56:19Z

Test build #106716 has finished for PR 24895 at commit 75db266.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-20T09:00:09Z

Test build #106717 has finished for PR 24895 at commit a894275.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @tools4origins and @srowen .
Merged to master/2.4/2.3.

## What changes were proposed in this pull request? The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column- The alias is set on this line: https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786 ## How was this patch tested? Using variance() in pyspark 2.4.3 returns: ``` >>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show() +-----------+ |var_samp(a)| +-----------+ | 1.0| +-----------+ ``` Closes #24895 from tools4origins/patch-1. Authored-by: tools4origins <tools4origins@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c5d57) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

tools4origins · 2019-06-21T05:15:28Z

Thanks a lot for your time, your awesome inputs and this amazing projects!

## What changes were proposed in this pull request? The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column- The alias is set on this line: https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786 ## How was this patch tested? Using variance() in pyspark 2.4.3 returns: ``` >>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show() +-----------+ |var_samp(a)| +-----------+ | 1.0| +-----------+ ``` Closes apache#24895 from tools4origins/patch-1. Authored-by: tools4origins <tools4origins@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 25c5d57) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun changed the title ~~[DOC] Fix python variance() documentation~~ [MINOR][DOC] Fix python variance() documentation Jun 17, 2019

dongjoon-hyun reviewed Jun 17, 2019

View reviewed changes

dongjoon-hyun added DOCUMENTATION PYSPARK labels Jun 17, 2019

tools4origins force-pushed the patch-1 branch from b434a5e to 70f7ad3 Compare June 19, 2019 11:11

srowen approved these changes Jun 19, 2019

View reviewed changes

dongjoon-hyun reviewed Jun 19, 2019

View reviewed changes

python/pyspark/sql/functions.py Outdated Show resolved Hide resolved

tools4origins force-pushed the patch-1 branch 2 times, most recently from 9dc9ae5 to 75db266 Compare June 20, 2019 08:23

tools4origins added 2 commits June 20, 2019 10:26

[MINOR][DOC] Fix python variance() documentation

a2904b8

[MINOR][DOC] Fix python stddev() documentation

a894275

tools4origins force-pushed the patch-1 branch from 75db266 to a894275 Compare June 20, 2019 08:27

dongjoon-hyun approved these changes Jun 20, 2019

View reviewed changes

dongjoon-hyun closed this in 25c5d57 Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MINOR][DOC] Fix python variance() documentation #24895

[MINOR][DOC] Fix python variance() documentation #24895

Uh oh!

tools4origins commented Jun 17, 2019 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jun 17, 2019

Uh oh!

dongjoon-hyun commented Jun 17, 2019

Uh oh!

dongjoon-hyun Jun 17, 2019

Uh oh!

tools4origins Jun 17, 2019

Uh oh!

dongjoon-hyun Jun 17, 2019 •

edited

Loading

Uh oh!

srowen Jun 18, 2019

Uh oh!

dongjoon-hyun Jun 18, 2019

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Uh oh!

SparkQA commented Jun 20, 2019

Uh oh!

SparkQA commented Jun 20, 2019

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

tools4origins commented Jun 21, 2019

Uh oh!

Uh oh!

[MINOR][DOC] Fix python variance() documentation #24895

[MINOR][DOC] Fix python variance() documentation #24895

Uh oh!

Conversation

tools4origins commented Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dongjoon-hyun commented Jun 17, 2019

Uh oh!

dongjoon-hyun commented Jun 17, 2019

Uh oh!

dongjoon-hyun Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

tools4origins Jun 17, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 18, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 17, 2019

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

srowen commented Jun 19, 2019

Uh oh!

Uh oh!

SparkQA commented Jun 20, 2019

Uh oh!

SparkQA commented Jun 20, 2019

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tools4origins commented Jun 21, 2019

Uh oh!

Uh oh!

tools4origins commented Jun 17, 2019 •

edited

Loading

dongjoon-hyun Jun 17, 2019 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading