Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-10291] [PySpark] statsByKey method for RDD #8539

Closed
wants to merge 1 commit into from

Conversation

eshilts
Copy link

@eshilts eshilts commented Aug 31, 2015

Added statsByKey() method for computing summary statistics of each key in an RDD.

x = sc.parallelize([("key_a", 1.0), ("key_a", 2.0), ("key_b", 2.0), ("key_b", 3.0)])
s = sorted(x.statsByKey().collect())
s[0]
#('key_a', (count: 2, mean: 1.5, stdev: 0.5, max: 2.0, min: 1.0))
s[1]
#('key_b', (count: 2, mean: 2.5, stdev: 0.5, max: 3.0, min: 2.0))

https://issues.apache.org/jira/browse/SPARK-10291

Added statsByKey() method for computing summary statistics of each key
in an RDD.
@eshilts
Copy link
Author

eshilts commented Oct 28, 2015

This is ready to test.

I often manually calc mean, stddev, etc across keys and this RDD method would make it a lot easier.

@andrewor14
Copy link
Contributor

ok to test. What do you think @srowen @JoshRosen?

@andrewor14
Copy link
Contributor

If we want something like this it would be good to add the Scala API first though.

@SparkQA
Copy link

SparkQA commented Dec 15, 2015

Test build #47706 has finished for PR 8539 at commit fdc858f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 15, 2015

Yeah, this can't just exist in the Python API. It's not simplifying much, since there's already a whole class to do the accumulation of sufficient statistics; it's just a call to combineByKey. I appreciate the value of utility methods but have to weight it against adding another item to a core API and how often it'd be used. This is also straightforward to express in Spark SQL on a dataframe, no?

@srowen
Copy link
Member

srowen commented Jan 1, 2016

Do you mind closing this PR?

@asfgit asfgit closed this in 085f510 Feb 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants