[SPARK-10291] [PySpark] statsByKey method for RDD #8539

eshilts · 2015-08-31T19:53:27Z

Added statsByKey() method for computing summary statistics of each key in an RDD.

x = sc.parallelize([("key_a", 1.0), ("key_a", 2.0), ("key_b", 2.0), ("key_b", 3.0)])
s = sorted(x.statsByKey().collect())
s[0]
#('key_a', (count: 2, mean: 1.5, stdev: 0.5, max: 2.0, min: 1.0))
s[1]
#('key_b', (count: 2, mean: 2.5, stdev: 0.5, max: 3.0, min: 2.0))

https://issues.apache.org/jira/browse/SPARK-10291

Added statsByKey() method for computing summary statistics of each key in an RDD.

eshilts · 2015-10-28T23:24:53Z

This is ready to test.

I often manually calc mean, stddev, etc across keys and this RDD method would make it a lot easier.

andrewor14 · 2015-12-15T01:43:00Z

ok to test. What do you think @srowen @JoshRosen?

andrewor14 · 2015-12-15T01:43:57Z

If we want something like this it would be good to add the Scala API first though.

SparkQA · 2015-12-15T02:30:48Z

Test build #47706 has finished for PR 8539 at commit fdc858f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-12-15T11:07:34Z

Yeah, this can't just exist in the Python API. It's not simplifying much, since there's already a whole class to do the accumulation of sufficient statistics; it's just a call to combineByKey. I appreciate the value of utility methods but have to weight it against adding another item to a core API and how often it'd be used. This is also straightforward to express in Spark SQL on a dataframe, no?

srowen · 2016-01-01T16:51:41Z

Do you mind closing this PR?

[SPARK-10291] [PySpark] statsByKey method for RDD

fdc858f

Added statsByKey() method for computing summary statistics of each key in an RDD.

eshilts force-pushed the SPARK-10291-statsByKey branch from 43d2662 to fdc858f Compare October 28, 2015 23:22

asfgit closed this in 085f510 Feb 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-10291] [PySpark] statsByKey method for RDD #8539

[SPARK-10291] [PySpark] statsByKey method for RDD #8539

eshilts commented Aug 31, 2015

eshilts commented Oct 28, 2015

andrewor14 commented Dec 15, 2015

andrewor14 commented Dec 15, 2015

SparkQA commented Dec 15, 2015

srowen commented Dec 15, 2015

srowen commented Jan 1, 2016

[SPARK-10291] [PySpark] statsByKey method for RDD #8539

[SPARK-10291] [PySpark] statsByKey method for RDD #8539

Conversation

eshilts commented Aug 31, 2015

eshilts commented Oct 28, 2015

andrewor14 commented Dec 15, 2015

andrewor14 commented Dec 15, 2015

SparkQA commented Dec 15, 2015

srowen commented Dec 15, 2015

srowen commented Jan 1, 2016