**_pySpark Basics: Summary Statistics_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 3 Aug 2016, Spark v1.6.1_

_Abstract: Here we will cover several common ways to summarize data.  Many of these methods have been dicussed in other tutorials in different contexts._

_Main operations used: `describe`, `skewness`, `kurtosis`, `collect`, `select`_

***

First we will load the same csv data we've been using in many other tutorials, then pare it down to a manageable subset for ease of use:

In [2]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [3]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|')

In [6]:
df = df[['C0', 'C2', 'C3', 'C4', 'C5', 'C6']]

In [7]:
df.show(5)

+------------+-----+-----+----+---+---+
|          C0|   C2|   C3|  C4| C5| C6|
+------------+-----+-----+----+---+---+
|100002091588|OTHER|4.125|null|  0|360|
|100002091588|     |4.125|null|  1|359|
|100002091588|     |4.125|null|  2|358|
|100002091588|     |4.125|null|  3|357|
|100002091588|     |4.125|null|  4|356|
+------------+-----+-----+----+---+---+
only showing top 5 rows



# Describe

The first thing we'll do is use the `describe` method to get some basics.  Note that **describe will return a new dataframe** with the parameters, so we'll assign the results to a new variable and then call `show` on it:

In [8]:
df_described = df.describe()
df_described.show()

+-------+--------------------+-------------------+------------------+------------------+-----------------+
|summary|                  C0|                 C3|                C4|                C5|               C6|
+-------+--------------------+-------------------+------------------+------------------+-----------------+
|  count|             3526154|            3526154|           1580402|           3526154|          3526154|
|   mean| 5.50388599500189E11|  4.178168090221902|234846.78065481808| 5.134865351881966|354.7084951479714|
| stddev|2.596112361975222...|0.34382335723646484|118170.68592261616|3.3833930336063456|4.011812510792076|
|    min|        100002091588|               2.75|              0.85|                -1|              292|
|    max|        999995696635|              6.125|        1193544.39|                34|              480|
+-------+--------------------+-------------------+------------------+------------------+-----------------+



Aside from the five included in `describe`, there are a handful of other built-in aggregators that can be applied to a column:

In [9]:
from pyspark.sql.functions import skewness, kurtosis, var_pop, var_samp, stddev, stddev_pop, sumDistinct
df.select(skewness('C3')).show()

+------------------+
|  skewness(C3,0,0)|
+------------------+
|0.5197993394959969|
+------------------+



# Expanding the Describe Output

One convenient thing we might want to do is put all our summary statistics together in one spot - in essence, expand the output from `describe`.  Below I'll go into a short example:

In [16]:
from pyspark.sql import Row

columns = df_described.columns  #a list of the column names: ['summary', 'C0', 'C3', 'C4', 'C5', 'C6']
funcs   = [skewness, kurtosis]  #a list of the functions we want to add to our summary statistics (imported earlier)
fnames  = ['skew', 'kurtosis']  #a list of strings describing the functions in the same order

def new_item(func, column):
    """
    This function takes in an aggregation function and a column name, then applies the aggregation to the column,
    collects it and returns a value.  The value is in string format despite being a number, because that matches
    the output of describe.
    """
    return str(df.select(func(column)).collect()[0][0])

new_data = []
for func, fname in zip(funcs, fnames):
    row_dict = {'summary':fname}  #each row object begins with an entry for "summary"
    for column in columns[1:]:
        row_dict[column] = new_item(func, column)
    new_data.append(Row(**row_dict))  #the ** format tells Python to unpack the entries of the dictionary we just built
    
print(new_data)

[Row(C0='-0.00183847089857', C3='0.519799339496', C4='0.758411576756', C5='0.286480156084', C6='-2.69765201567', summary='skew'), Row(C0='-1.19900726351', C3='0.126057726847', C4='0.576085602656', C5='0.195187780089', C6='24.7237858944', summary='kurtosis')]


This code iterates through the entries in `funcs` and `fnames` together, then builds a new row object following the format of the standard `describe` output.  You can see from the output that it looks nearly identical to the output of `collect` when applied to a dataframe:

In [17]:
desc_collect = df_described.collect()
print(desc_collect)

[Row(summary=u'count', C0=u'3526154', C3=u'3526154', C4=u'1580402', C5=u'3526154', C6=u'3526154'), Row(summary=u'mean', C0=u'5.50388599500189E11', C3=u'4.178168090221902', C4=u'234846.78065481808', C5=u'5.134865351881966', C6=u'354.7084951479714'), Row(summary=u'stddev', C0=u'2.5961123619752228E11', C3=u'0.34382335723646484', C4=u'118170.68592261616', C5=u'3.3833930336063456', C6=u'4.011812510792076'), Row(summary=u'min', C0=u'100002091588', C3=u'2.75', C4=u'0.85', C5=u'-1', C6=u'292'), Row(summary=u'max', C0=u'999995696635', C3=u'6.125', C4=u'1193544.39', C5=u'34', C6=u'480')]


Although the columns are out of order within the rows; this is because we built them from a dictionary, and dictionary entries in Python are inherently unordered.  We will fix that below.

The next step is to join the two sets of data into one, in order to make a modified `describe` output that includes skew and kurtosis.  The same method could be used to include any other aggregations desired.

One side note: we perform the join by **first turning the `describe` output into a list of raw row objects using `collect` (as in the above codeblock), then turning it back into a dataframe.**  This is due to a bug in Spark 1.6 and earlier where `describe` output wasn't serializable.  It was raised on the Spark Jira board and fixed with a patch, however that patch is not deployed to Spark 1.6 on Amazon Web Services.  The problem is circumvented by breaking it down into the raw data and then building it back to a dataframe.  Once AWS upgrades to Spark 2.0 (or higher) one should obviously skip this step.

In [26]:
dd1 = sc.parallelize(new_data).toDF()     #turns the results from our loop into a dataframe
dd2 = sc.parallelize(desc_collect).toDF() #turns the collected results from describe back into a dataframe
dd1 = dd1.select(dd2.columns) #this forces the columns in our new_data to be in the same order as in describe

expanded_describe = dd2.unionAll(dd1)
expanded_describe.show()

+--------+--------------------+-------------------+------------------+------------------+-----------------+
| summary|                  C0|                 C3|                C4|                C5|               C6|
+--------+--------------------+-------------------+------------------+------------------+-----------------+
|   count|             3526154|            3526154|           1580402|           3526154|          3526154|
|    mean| 5.50388599500189E11|  4.178168090221902|234846.78065481808| 5.134865351881966|354.7084951479714|
|  stddev|2.596112361975222...|0.34382335723646484|118170.68592261616|3.3833930336063456|4.011812510792076|
|     min|        100002091588|               2.75|              0.85|                -1|              292|
|     max|        999995696635|              6.125|        1193544.39|                34|              480|
|    skew|   -0.00183847089857|     0.519799339496|    0.758411576756|    0.286480156084|   -2.69765201567|
|kurtosis|      -1.199007263

And now we have our expanded `describe` output.