**_pySpark Basics: Summary Statistics_**

_by Jeff Levy (jlevy@urban.org)_

_Last Updated: 7 July 2016, Spark v1.6.1_

_Abstract: Here we will cover several common ways to summarize data.  Many of these methods have been dicussed in other tutorials in different contexts._

_Main operations used:_ `describe`

***

First we will load the same csv data we've been using in the other tutorials, then pare it down to a manageable subset for ease of use:

In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [2]:
df = sqlContext.read.load('s3://ui-hfpc/Performance_2015Q1.txt',
                          format='com.databricks.spark.csv',
                          header='false',
                          inferSchema='true',
                          delimiter='|')

In [3]:
df = df[['C0', 'C2', 'C3', 'C4', 'C5', 'C6']]

In [27]:
df.show(5)

+------------+-----+-----+----+---+---+
|          C0|   C2|   C3|  C4| C5| C6|
+------------+-----+-----+----+---+---+
|100002091588|OTHER|4.125|null|  0|360|
|100002091588|     |4.125|null|  1|359|
|100002091588|     |4.125|null|  2|358|
|100002091588|     |4.125|null|  3|357|
|100002091588|     |4.125|null|  4|356|
+------------+-----+-----+----+---+---+
only showing top 5 rows



The first thing we'll do is use the `describe` method to get some basics.  Note that this will return a new dataframe with the parameters, so we'll assign the results to a new variable and then call `show` on it:

In [5]:
df_described = df.describe()
df_described.show()

+-------+--------------------+-------------------+------------------+------------------+-----------------+
|summary|                  C0|                 C3|                C4|                C5|               C6|
+-------+--------------------+-------------------+------------------+------------------+-----------------+
|  count|             3526154|            3526154|           1580402|           3526154|          3526154|
|   mean| 5.50388599500189E11|  4.178168090221902|234846.78065481802| 5.134865351881966|354.7084951479714|
| stddev|2.596112361975222...|0.34382335723646484|118170.68592261615|3.3833930336063456|4.011812510792076|
|    min|        100002091588|               2.75|              0.85|                -1|              292|
|    max|        999995696635|              6.125|        1193544.39|                34|              480|
+-------+--------------------+-------------------+------------------+------------------+-----------------+



Aside from the five included in `describe`, there are a handful of other built-in aggregators that can be applied to a column:

In [6]:
from pyspark.sql.functions import skewness, kurtosis, var_pop, var_samp, stddev, stddev_pop, sumDistinct
df.select(skewness('C3')).show()

+------------------+
|  skewness(C3,0,0)|
+------------------+
|0.5197993394959969|
+------------------+



One convenient thing we might want to do is put all our summary statistics together in one spot - in essence, expand the output from `describe`.  Below I'll go into a short example:

In [14]:
from pyspark.sql import Row
from collections import defaultdict

columns = df_described.columns  #a list of the column names: ['summary', 'C0', 'C3', 'C4', 'C5', 'C6']
funcs   = [skewness, kurtosis]  #a list of the functions we want to add to our summary statistics (imported above)
fnames  = ['skew', 'kurtosis']  #a list of strings describing the functions in the same order

def new_item(func, column):
    return str(df.select(func(column)).collect()[0][0])

new_data = []
for func, fname in zip(funcs, fnames):
    row_dict = {'summary':fname}
    for column in columns[1:]:
        row_dict[column] = new_item(func, column)
    new_data.append(Row(**row_dict))

In [16]:
df_described2 = sc.parallelize(new_data).toDF().select(columns)

In [17]:
df_described2.show()

+--------+-----------------+--------------+--------------+--------------+--------------+
| summary|               C0|            C3|            C4|            C5|            C6|
+--------+-----------------+--------------+--------------+--------------+--------------+
|    skew|-0.00183847089857|0.519799339496|0.758411576756|0.286480156084|-2.69765201567|
|kurtosis|   -1.19900726351|0.126057726847|0.576085602656|0.195187780089| 24.7237858944|
+--------+-----------------+--------------+--------------+--------------+--------------+



In [18]:
expanded_describe = df_described.unionAll(df_described2)

In [20]:
df_described.collect()

[Row(summary=u'count', C0=u'3526154', C3=u'3526154', C4=u'1580402', C5=u'3526154', C6=u'3526154'),
 Row(summary=u'mean', C0=u'5.50388599500189E11', C3=u'4.178168090221902', C4=u'234846.78065481802', C5=u'5.134865351881966', C6=u'354.7084951479714'),
 Row(summary=u'stddev', C0=u'2.5961123619752225E11', C3=u'0.34382335723646484', C4=u'118170.68592261615', C5=u'3.3833930336063456', C6=u'4.011812510792076'),
 Row(summary=u'min', C0=u'100002091588', C3=u'2.75', C4=u'0.85', C5=u'-1', C6=u'292'),
 Row(summary=u'max', C0=u'999995696635', C3=u'6.125', C4=u'1193544.39', C5=u'34', C6=u'480')]

In [21]:
df_described2.collect()

[Row(summary=u'skew', C0=u'-0.00183847089857', C3=u'0.519799339496', C4=u'0.758411576756', C5=u'0.286480156084', C6=u'-2.69765201567'),
 Row(summary=u'kurtosis', C0=u'-1.19900726351', C3=u'0.126057726847', C4=u'0.576085602656', C5=u'0.195187780089', C6=u'24.7237858944')]

In [26]:
expanded_describe.show()

Py4JJavaError: An error occurred while calling o330.showString.
: org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
	at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
	at org.apache.spark.sql.execution.ConvertToUnsafe.doExecute(rowFormatConverters.scala:38)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
	at org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
	at org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
	at scala.collection.immutable.List.foreach(List.scala:318)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
	at scala.collection.AbstractTraversable.map(Traversable.scala:105)
	at org.apache.spark.sql.execution.Union.doExecute(basicOperators.scala:144)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
	at org.apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:56)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
	at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
	at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
	at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
	at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
	at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
	at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
	at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
	at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
	at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
	at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
	at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
	at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: scala.collection.Iterator$$anon$11
Serialization stack:
	- object not serializable (class: scala.collection.Iterator$$anon$11, value: empty iterator)
	- field (class: scala.collection.Iterator$$anonfun$toStream$1, name: $outer, type: interface scala.collection.Iterator)
	- object (class scala.collection.Iterator$$anonfun$toStream$1, <function0>)
	- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0)
	- object (class scala.collection.immutable.Stream$Cons, Stream(WrappedArray(3526154, 3526154, 1580402, 3526154, 3526154), WrappedArray(5.50388599500189E11, 4.178168090221902, 234846.78065481802, 5.134865351881966, 354.7084951479714), WrappedArray(2.5961123619752225E11, 0.34382335723646484, 118170.68592261615, 3.3833930336063456, 4.011812510792076), WrappedArray(100002091588, 2.75, 0.85, -1, 292), WrappedArray(999995696635, 6.125, 1193544.39, 34, 480)))
	- field (class: scala.collection.immutable.Stream$$anonfun$zip$1, name: $outer, type: class scala.collection.immutable.Stream)
	- object (class scala.collection.immutable.Stream$$anonfun$zip$1, <function0>)
	- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0)
	- object (class scala.collection.immutable.Stream$Cons, Stream((WrappedArray(3526154, 3526154, 1580402, 3526154, 3526154),(count,<function1>)), (WrappedArray(5.50388599500189E11, 4.178168090221902, 234846.78065481802, 5.134865351881966, 354.7084951479714),(mean,<function1>)), (WrappedArray(2.5961123619752225E11, 0.34382335723646484, 118170.68592261615, 3.3833930336063456, 4.011812510792076),(stddev,<function1>)), (WrappedArray(100002091588, 2.75, 0.85, -1, 292),(min,<function1>)), (WrappedArray(999995696635, 6.125, 1193544.39, 34, 480),(max,<function1>))))
	- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream)
	- object (class scala.collection.immutable.Stream$$anonfun$map$1, <function0>)
	- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0)
	- object (class scala.collection.immutable.Stream$Cons, Stream([count,3526154,3526154,1580402,3526154,3526154], [mean,5.50388599500189E11,4.178168090221902,234846.78065481802,5.134865351881966,354.7084951479714], [stddev,2.5961123619752225E11,0.34382335723646484,118170.68592261615,3.3833930336063456,4.011812510792076], [min,100002091588,2.75,0.85,-1,292], [max,999995696635,6.125,1193544.39,34,480]))
	- field (class: scala.collection.immutable.Stream$$anonfun$map$1, name: $outer, type: class scala.collection.immutable.Stream)
	- object (class scala.collection.immutable.Stream$$anonfun$map$1, <function0>)
	- field (class: scala.collection.immutable.Stream$Cons, name: tl, type: interface scala.Function0)
	- object (class scala.collection.immutable.Stream$Cons, Stream([count,3526154,3526154,1580402,3526154,3526154], [mean,5.50388599500189E11,4.178168090221902,234846.78065481802,5.134865351881966,354.7084951479714], [stddev,2.5961123619752225E11,0.34382335723646484,118170.68592261615,3.3833930336063456,4.011812510792076], [min,100002091588,2.75,0.85,-1,292], [max,999995696635,6.125,1193544.39,34,480]))
	- field (class: org.apache.spark.sql.execution.LocalTableScan, name: rows, type: interface scala.collection.Seq)
	- object (class org.apache.spark.sql.execution.LocalTableScan, LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], [[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221902,234846.78065481802,5.134865351881966,354.7084951479714],[stddev,2.5961123619752225E11,0.34382335723646484,118170.68592261615,3.3833930336063456,4.011812510792076],[min,100002091588,2.75,0.85,-1,292],[max,999995696635,6.125,1193544.39,34,480]]
)
	- field (class: org.apache.spark.sql.execution.ConvertToUnsafe, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
	- object (class org.apache.spark.sql.execution.ConvertToUnsafe, ConvertToUnsafe
+- LocalTableScan [summary#228,C0#229,C3#230,C4#231,C5#232,C6#233], [[count,3526154,3526154,1580402,3526154,3526154],[mean,5.50388599500189E11,4.178168090221902,234846.78065481802,5.134865351881966,354.7084951479714],[stddev,2.5961123619752225E11,0.34382335723646484,118170.68592261615,3.3833930336063456,4.011812510792076],[min,100002091588,2.75,0.85,-1,292],[max,999995696635,6.125,1193544.39,34,480]]
)
	- field (class: org.apache.spark.sql.execution.ConvertToUnsafe$$anonfun$1, name: $outer, type: class org.apache.spark.sql.execution.ConvertToUnsafe)
	- object (class org.apache.spark.sql.execution.ConvertToUnsafe$$anonfun$1, <function1>)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
	... 57 more


In [24]:
dd1 = df_described.collect()
dd2 = df_described2.collect()
dd3 = dd1 + dd2
new_df = sc.parallelize(dd3).toDF()
new_df.show()

+--------+--------------------+-------------------+------------------+------------------+-----------------+
| summary|                  C0|                 C3|                C4|                C5|               C6|
+--------+--------------------+-------------------+------------------+------------------+-----------------+
|   count|             3526154|            3526154|           1580402|           3526154|          3526154|
|    mean| 5.50388599500189E11|  4.178168090221902|234846.78065481802| 5.134865351881966|354.7084951479714|
|  stddev|2.596112361975222...|0.34382335723646484|118170.68592261615|3.3833930336063456|4.011812510792076|
|     min|        100002091588|               2.75|              0.85|                -1|              292|
|     max|        999995696635|              6.125|        1193544.39|                34|              480|
|    skew|   -0.00183847089857|     0.519799339496|    0.758411576756|    0.286480156084|   -2.69765201567|
|kurtosis|      -1.199007263

In [25]:
dd1_df = sc.parallelize(dd1).toDF()
dd2_df = sc.parallelize(dd2).toDF()
dd1_df.unionAll(dd2_df).show()

+--------+--------------------+-------------------+------------------+------------------+-----------------+
| summary|                  C0|                 C3|                C4|                C5|               C6|
+--------+--------------------+-------------------+------------------+------------------+-----------------+
|   count|             3526154|            3526154|           1580402|           3526154|          3526154|
|    mean| 5.50388599500189E11|  4.178168090221902|234846.78065481802| 5.134865351881966|354.7084951479714|
|  stddev|2.596112361975222...|0.34382335723646484|118170.68592261615|3.3833930336063456|4.011812510792076|
|     min|        100002091588|               2.75|              0.85|                -1|              292|
|     max|        999995696635|              6.125|        1193544.39|                34|              480|
|    skew|   -0.00183847089857|     0.519799339496|    0.758411576756|    0.286480156084|   -2.69765201567|
|kurtosis|      -1.199007263

NOTE: the above error is currently unsolved, and a work-around such as the last two code blocks is necessary to stack this data right now.  it was discussed unsuccessfully here: http://stackoverflow.com/questions/38255145/task-not-serializable-error-in-pyspark-on-unionall?noredirect=1#comment63932376_38255145