Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop #10876

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

PySpark support covar_samp and covar_pop.

cc @rxin @davies @marmbrus

@SparkQA
Copy link

SparkQA commented Jan 22, 2016

Test build #49909 has finished for PR 10876 at commit 07b58d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> df = sqlContext.createDataFrame(zip(a, b), ["a", "b"])
>>> covDf = df.agg(covar_pop("a", "b").alias('c'))
>>> covDf.select("c").collect()
[Row(c=565.25)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe compare with a tolerance as done in the other doctests since floating point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jan 27, 2016

Test build #50164 has finished for PR 10876 at commit 645a5d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 28, 2016

This looks good to me, maybe @davies or @jkbradley could take a look?

>>> b = range(20)
>>> df = sqlContext.createDataFrame(zip(a, b), ["a", "b"])
>>> covDf = df.agg(covar_pop("a", "b").alias('c'))
>>> covDf.selectExpr('abs(c - 565.25) < 1e-16 as t').collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are going to be part of the API docs, I'd like to be as simple as possible.
Also the Python API is only a wrapper of Scala one, so we don't need to test the correctness here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just as a side note - it seems the other functions (including corr) do have these tests for correctness in their doctests. We should probably be consistent with them one way or the other yes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should cleanup them up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies OK, I will cleanup them.

@davies
Copy link
Contributor

davies commented Jan 28, 2016

@yanboliang Could you also add Python API for corr?

@felixcheung
Copy link
Member

@davies
Copy link
Contributor

davies commented Jan 28, 2016

@felixcheung I see, thanks!

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50344 has finished for PR 10876 at commit ad15ff4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

ping @davies

@davies
Copy link
Contributor

davies commented Feb 5, 2016

@yanboliang I'm sorry that I didn't make it clear: It's good to have a simple test, that could also be used as an example to demo how to use these functions (it should be as simple as possible).

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51104 has finished for PR 10876 at commit 8029911.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2016

Test build #51106 has finished for PR 10876 at commit c29ff29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


>>> a = [x * x - 2 * x + 3.5 for x in range(20)]
>>> b = range(20)
>>> df = sqlContext.createDataFrame(zip(a, b), ["a", "b"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a == b, then the result will be zero?

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51181 has finished for PR 10876 at commit 374931a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2016

Test build #51183 has finished for PR 10876 at commit 525d1f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Feb 12, 2016

LGTM, merging into master, thanks!

@asfgit asfgit closed this in 90de6b2 Feb 12, 2016
@yanboliang yanboliang deleted the spark-12962 branch February 14, 2016 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants