# Aggregations with DataFrames

Aggregations are all wide tranformations, i.e. requires shuffling during the execution. Moreover, transformations involving data shuffling are synchronous operations, that means we cannot move forward to the next operation in the pipeline until all the workers have not completed with the shuffling transformation. Contrary to narrow operations, which are fully asynchronous as there is no data exchange, hence no coordination needed among worker nodes to execute the pipeline.

Here are the list of the aggregation operations available for Dataframes:

* count(), sum() and other expressions use in the SELECT statement

* groupBy()

* window()

In [2]:
import pyspark
from pyspark.sql import SparkSession

sc = pyspark.SparkContext()
spark = SparkSession(sc)

ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-1-153f8b0be037>:4 

In [3]:
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load('../data/retail.csv')

In [7]:
print (df.rdd.getNumPartitions())
df.cache()
df.createOrReplaceTempView("retail")
print (df.schema)

6
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))


## Count and Count Distinct

In [14]:
from pyspark.sql.functions import count, countDistinct
df.select(count('StockCode')).show()
res = df.select(countDistinct('StockCode')).collect()
print (res)
exact_count = res[0][0]

+----------------+
|count(StockCode)|
+----------------+
|          541909|
+----------------+

[Row(count(DISTINCT StockCode)=4070)]


In real-world Big Data application often we're not really interested in the preciese result, a reasonable estimation is good enough. That's why using probabilistic approximations are often can help us out when we take care of the performance and execution time but having a precise answer is less important. Spark offers some functions which can estimate the distinct count for example, but runs much faster than its exact counterpart.

In [16]:
from pyspark.sql.functions import approx_count_distinct
max_estimated_error = 0.01
res = df.select(approx_count_distinct('StockCode', max_estimated_error)).collect()
print(res)
est_count = res[0][0]
print (abs(exact_count - est_count))

[Row(approx_count_distinct(StockCode)=4079)]
9


## Grouping and Windowing

In [17]:
df.groupBy("InvoiceNo", "CustomerId").count().show()

+---------+----------+-----+
|InvoiceNo|CustomerId|count|
+---------+----------+-----+
|   536846|     14573|   76|
|   537026|     12395|   12|
|   537883|     14437|    5|
|   538068|     17978|   12|
|   538279|     14952|    7|
|   538800|     16458|   10|
|   538942|     17346|   12|
|  C539947|     13854|    1|
|   540096|     13253|   16|
|   540530|     14755|   27|
|   541225|     14099|   19|
|   541978|     13551|    4|
|   542093|     17677|   16|
|   543188|     12567|   63|
|   543590|     17377|   19|
|  C543757|     13115|    1|
|  C544318|     12989|    1|
|   544578|     12365|    1|
|   536596|      null|    6|
|   537252|      null|    1|
+---------+----------+-----+
only showing top 20 rows



In [19]:
from pyspark.sql.functions import expr
df.groupBy("InvoiceNo") \
    .agg(expr("count(Quantity)")).show()

+---------+---------------+
|InvoiceNo|count(Quantity)|
+---------+---------------+
|   536596|              6|
|   536938|             14|
|   537252|              1|
|   537691|             20|
|   538041|              1|
|   538184|             26|
|   538517|             53|
|   538879|             19|
|   539275|              6|
|   539630|             12|
|   540499|             24|
|   540540|             22|
|  C540850|              1|
|   540976|             48|
|   541432|              4|
|   541518|            101|
|   541783|             35|
|   542026|              9|
|   542375|              6|
|  C542604|              8|
+---------+---------------+
only showing top 20 rows



In [20]:
df.groupBy("InvoiceNo") \
    .agg(count("Quantity")).show()

+---------+---------------+
|InvoiceNo|count(Quantity)|
+---------+---------------+
|   536596|              6|
|   536938|             14|
|   537252|              1|
|   537691|             20|
|   538041|              1|
|   538184|             26|
|   538517|             53|
|   538879|             19|
|   539275|              6|
|   539630|             12|
|   540499|             24|
|   540540|             22|
|  C540850|              1|
|   540976|             48|
|   541432|              4|
|   541518|            101|
|   541783|             35|
|   542026|              9|
|   542375|              6|
|  C542604|              8|
+---------+---------------+
only showing top 20 rows



## Window based Aggregations

In [23]:
from pyspark.sql.functions import to_date
# adds date column which converts invoice date string to a date type
dfWithDate = df.withColumn('date', to_date("InvoiceDate", 'MM/d/yyyy H:mm'))
dfWithDate.schema

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true),StructField(date,DateType,true)))

In [46]:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, column, max, rank, dense_rank
windowSpec = Window \
    .partitionBy("CustomerId", "date") \
    .orderBy(desc("Quantity")) \
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)

In [47]:
maxPurchaseQuantity = max(column("Quantity")).over(windowSpec)
type(maxPurchaseQuantity)

pyspark.sql.column.Column

In [49]:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
dfWithDate.select(
    column("CustomerID"),
    column("date"),
    column("Quantity"),
    maxPurchaseQuantity.alias("MaxQ")
).show(100)

+----------+----------+--------+----+-----+-----+
|CustomerID|      date|Quantity|MaxQ|Prank|Prank|
+----------+----------+--------+----+-----+-----+
|     12477|2011-04-14|     100| 100|    1|    1|
|     12477|2011-04-14|      72| 100|    2|    2|
|     12477|2011-04-14|      36| 100|    3|    3|
|     12477|2011-04-14|      36| 100|    3|    3|
|     12477|2011-04-14|      36| 100|    3|    3|
|     12477|2011-04-14|      24| 100|    6|    4|
|     12477|2011-04-14|      24| 100|    6|    4|
|     12477|2011-04-14|      24| 100|    6|    4|
|     12477|2011-04-14|      20| 100|    9|    5|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
|     12477|2011-04-14|      12| 100|   10|    6|
