Skip to content

Conversation

tools4origins
Copy link
Contributor

@tools4origins tools4origins commented Jun 17, 2019

What changes were proposed in this pull request?

The Python documentation incorrectly says that variance() acts as var_pop whereas it acts like var_samp here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

How was this patch tested?

Using variance() in pyspark 2.4.3 returns:

>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+

@dongjoon-hyun dongjoon-hyun changed the title [DOC] Fix python variance() documentation [MINOR][DOC] Fix python variance() documentation Jun 17, 2019
@dongjoon-hyun
Copy link
Member

Thank you for your first contribution, @tools4origins .

@dongjoon-hyun
Copy link
Member

ok to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add unbiased sample variance like the following two sentences?

'variance': 'Aggregate function: returns the unbiased sample variance of the values in a group.',
'var_samp': 'Aggregate function: returns the unbiased sample variance of the values in a group.',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! It seems that sample variance is an unbiased estimator of the variance, so unbiased sample variance might be an unnecessary repetition?

Copy link
Member

@dongjoon-hyun dongjoon-hyun Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term unbiased sample variance is more precise. Please see https://en.wikipedia.org/wiki/Variance .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say that 'variance' is an alias for 'var_samp' here rather than duplicate anything.

  • var_pop in databases is the variance of the values (e.g. divided by n). It's 'correct' if the data it's given is the whole 'population', as it is simply the value of the statistic.
  • var_samp treats the data as a sample of a larger unknown population. It's still returning the population variance, but an unbiased estimate of it (e.g. divided by n-1).

Therefore the var_samp description is maybe best as "an unbiased estimate of the population variance given the values as a sample". "Unbiased sample variance" is OK too though I've never liked that description. The sample variance is just a statistic of given values and can't be 'biased'. It's only biased or not as an estimator of population variance.

Anyway, I'm OK fixing all the docs accordingly, at least to read "unbiased sample variance".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @srowen 's recommendation about alias.

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106592 has finished for PR 24895 at commit b434a5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #106673 has finished for PR 24895 at commit 70f7ad3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jun 19, 2019

Looks OK, but your line is too long now.

@tools4origins tools4origins force-pushed the patch-1 branch 2 times, most recently from 9dc9ae5 to 75db266 Compare June 20, 2019 08:23
@SparkQA
Copy link

SparkQA commented Jun 20, 2019

Test build #106716 has finished for PR 24895 at commit 75db266.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 20, 2019

Test build #106717 has finished for PR 24895 at commit a894275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @tools4origins and @srowen .
Merged to master/2.4/2.3.

dongjoon-hyun pushed a commit that referenced this pull request Jun 20, 2019
## What changes were proposed in this pull request?

The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

## How was this patch tested?
Using variance() in pyspark 2.4.3 returns:
```
>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+
```

Closes #24895 from tools4origins/patch-1.

Authored-by: tools4origins <tools4origins@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 25c5d57)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Jun 20, 2019
## What changes were proposed in this pull request?

The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

## How was this patch tested?
Using variance() in pyspark 2.4.3 returns:
```
>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+
```

Closes #24895 from tools4origins/patch-1.

Authored-by: tools4origins <tools4origins@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 25c5d57)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@tools4origins
Copy link
Contributor Author

Thanks a lot for your time, your awesome inputs and this amazing projects!

rluta pushed a commit to rluta/spark that referenced this pull request Sep 17, 2019
## What changes were proposed in this pull request?

The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

## How was this patch tested?
Using variance() in pyspark 2.4.3 returns:
```
>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+
```

Closes apache#24895 from tools4origins/patch-1.

Authored-by: tools4origins <tools4origins@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 25c5d57)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Sep 26, 2019
## What changes were proposed in this pull request?

The Python documentation incorrectly says that `variance()` acts as `var_pop` whereas it acts like `var_samp` here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.variance

It was not the case in Spark 1.6 doc but it is in Spark 2.0 doc:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html

The Scala documentation is correct: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#variance-org.apache.spark.sql.Column-

The alias is set on this line:
https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L786

## How was this patch tested?
Using variance() in pyspark 2.4.3 returns:
```
>>> spark.createDataFrame([(1, ), (2, ), (3, )], "a: int").select(variance("a")).show()
+-----------+
|var_samp(a)|
+-----------+
|        1.0|
+-----------+
```

Closes apache#24895 from tools4origins/patch-1.

Authored-by: tools4origins <tools4origins@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 25c5d57)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants