# Aggregations

This notebook shows some basic aggregation methods provided by Spark.

More comples aggregations types (windows an cubes) exist but are not covered here.

In [62]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/work/data/retail-data/all/*.csv")\
  .coalesce(5)
df.cache()
df.printSchema()

[Stage 134:>                                                        (0 + 2) / 2]

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



                                                                                22/11/06 23:40:56 WARN CacheManager: Asked to cache already cached data.


In [63]:
df.summary().show(vertical=True)



-RECORD 0---------------------------
 summary     | count                
 InvoiceNo   | 541909               
 StockCode   | 541909               
 Description | 540455               
 Quantity    | 541909               
 InvoiceDate | 541909               
 UnitPrice   | 541909               
 CustomerID  | 406829               
 Country     | 541909               
-RECORD 1---------------------------
 summary     | mean                 
 InvoiceNo   | 559965.752026781     
 StockCode   | 27623.240210938104   
 Description | 20713.0              
 Quantity    | 9.55224954743324     
 InvoiceDate | null                 
 UnitPrice   | 4.611113626082961    
 CustomerID  | 15287.690570239585   
 Country     | null                 
-RECORD 2---------------------------
 summary     | stddev               
 InvoiceNo   | 13428.41728080439    
 StockCode   | 16799.737628427712   
 Description | NaN                  
 Quantity    | 218.08115785023486   
 InvoiceDate | null                 
 

                                                                                

An DataFrame aggregation example:

In [64]:
df.count()

541909

## Aggregation functions

### Count
Count on a single column will not count null values

In [65]:
from pyspark.sql.functions import count

df.select(count('CustomerID'), count('StockCode'), count('*')).show()

+-----------------+----------------+--------+
|count(CustomerID)|count(StockCode)|count(1)|
+-----------------+----------------+--------+
|           406829|          541909|  541909|
+-----------------+----------------+--------+



[Stage 139:>                                                        (0 + 2) / 2]                                                                                

### countDistinct

In [66]:
from pyspark.sql.functions import countDistinct
%time df.select(countDistinct('StockCode')).show()




+-------------------------+
|count(DISTINCT StockCode)|
+-------------------------+
|                     4070|
+-------------------------+

CPU times: user 10 ms, sys: 2.89 ms, total: 12.9 ms
Wall time: 2.35 s




### approx_count_distinct

In [67]:
from pyspark.sql.functions import approx_count_distinct
# the second parameter is the maximum estimation error allowed
%time df.select(approx_count_distinct("StockCode", 0.1)).show()

+--------------------------------+
|approx_count_distinct(StockCode)|
+--------------------------------+
|                            3364|
+--------------------------------+

CPU times: user 4.9 ms, sys: 141 µs, total: 5.04 ms
Wall time: 701 ms


### first and last

The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned

In [68]:
from pyspark.sql.functions import first, last
df.select(first("Quantity"), last("Quantity")).show()

+----------------------+---------------------+
|first(Quantity, false)|last(Quantity, false)|
+----------------------+---------------------+
|                     6|                    3|
+----------------------+---------------------+



In [69]:
print(df.head()['Quantity'])
print(df.tail(1)[0]['Quantity'])

6
3


[Stage 149:>                                                        (0 + 1) / 1]                                                                                

### min and max

In [70]:
from pyspark.sql.functions import min, max
df.select(min("Quantity"), max("Quantity")).show()

+-------------+-------------+
|min(Quantity)|max(Quantity)|
+-------------+-------------+
|       -80995|        80995|
+-------------+-------------+



In [71]:
### sum

In [72]:
from pyspark.sql.functions import sum
df.select(sum("Quantity")).show()

+-------------+
|sum(Quantity)|
+-------------+
|      5176450|
+-------------+



In [73]:
### sumDistinct()

In [74]:
from pyspark.sql.functions import sumDistinct
df.select(sumDistinct("Quantity")).show()



+----------------------+
|sum(DISTINCT Quantity)|
+----------------------+
|                 29310|
+----------------------+



                                                                                

### Average
Different ways of calculating the average: sum/count, avg and mean

In [75]:
from pyspark.sql.functions import sum, count, avg, expr

df.select(
    count("Quantity").alias("total_transactions"),
    sum("Quantity").alias("total_purchases"),
    avg("Quantity").alias("avg_purchases"),
    expr("mean(Quantity)").alias("mean_purchases"))\
  .selectExpr(
    "total_purchases/total_transactions",
    "avg_purchases",
    "mean_purchases").show()

+--------------------------------------+----------------+----------------+
|(total_purchases / total_transactions)|   avg_purchases|  mean_purchases|
+--------------------------------------+----------------+----------------+
|                      9.55224954743324|9.55224954743324|9.55224954743324|
+--------------------------------------+----------------+----------------+



### Variance and Standard Deviation
For entire population and for a sample

In [76]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp
df.select(var_pop("Quantity"), var_samp("Quantity"),
  stddev_pop("Quantity"), stddev_samp("Quantity")).show()

+------------------+------------------+--------------------+---------------------+
| var_pop(Quantity)|var_samp(Quantity)|stddev_pop(Quantity)|stddev_samp(Quantity)|
+------------------+------------------+--------------------+---------------------+
|47559.303646609354| 47559.39140929905|  218.08095663447864|   218.08115785023486|
+------------------+------------------+--------------------+---------------------+



### Covariance and Correlation

In [77]:
from pyspark.sql.functions import corr, covar_pop, covar_samp
df.select(corr("InvoiceNo", "Quantity"), covar_samp("InvoiceNo", "Quantity"),
    covar_pop("InvoiceNo", "Quantity")).show()

[Stage 161:>                                                        (0 + 2) / 2]

+-------------------------+-------------------------------+------------------------------+
|corr(InvoiceNo, Quantity)|covar_samp(InvoiceNo, Quantity)|covar_pop(InvoiceNo, Quantity)|
+-------------------------+-------------------------------+------------------------------+
|     4.912186085637639E-4|             1052.7280543913773|            1052.7260778752732|
+-------------------------+-------------------------------+------------------------------+





## Aggregating to Complex Types

### Collect as list and as set

In [78]:
from pyspark.sql.functions import collect_set, collect_list
df.agg(collect_set("Country"), collect_list("Country")).show(vertical=True)



-RECORD 0-------------------------------------
 collect_set(Country)  | [Portugal, Italy,... 
 collect_list(Country) | [United Kingdom, ... 



                                                                                

## Grouping

### Group by

In [79]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

[Stage 165:>                                                        (0 + 2) / 2]

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
|   541225|     14099|   19|
|   541978|     13551|    4|
|   542093|     17677|   16|
|   543188|     12567|   63|
|   543590|     17377|   19|
|  C543757|     13115|    1|
|  C544318|     12989|    1|
|   544578|     12365|    1|
|   545165|     16339|   20|
|   545289|     14732|   30|
+---------+----------+-----+
only showing top 20 rows





In [80]:
type(df.groupBy("InvoiceNo", "CustomerId"))

pyspark.sql.group.GroupedData

### Grouping with expressions

In [81]:
from pyspark.sql.functions import count

df.groupBy("InvoiceNo").agg(
    count("Quantity").alias("quantity_count"),
    expr("sum(Quantity) as quantity_sum")).show()

+---------+--------------+------------+
|InvoiceNo|quantity_count|quantity_sum|
+---------+--------------+------------+
|   536596|             6|           9|
|   536938|            14|         464|
|   537252|             1|          31|
|   537691|            20|         163|
|   538041|             1|          30|
|   538184|            26|         314|
|   538517|            53|         161|
|   538879|            19|         402|
|   539275|             6|         156|
|   539630|            12|         244|
|   540499|            24|          90|
|   540540|            22|          47|
|  C540850|             1|          -1|
|   540976|            48|         505|
|   541432|             4|          49|
|   541518|           101|        2334|
|   541783|            35|         396|
|   542026|             9|          69|
|   542375|             6|          48|
|  C542604|             8|         -64|
+---------+--------------+------------+
only showing top 20 rows

