# Spark Learning Note - Data Aggregations and Table Joining

Jia Geng | gjia0214@gmail.com

In [1]:
# check java version 
# use sudo update-alternatives --config java to switch java version if needed.
!java -version

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~19.10-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


In [2]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('Spark Learning').getOrCreate()
spark

In [3]:
data_example = '/home/jgeng/Documents/Git/SparkLearning/book_data/retail-data/all/online-retail-dataset.csv'

In [4]:
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example)
df.printSchema()
df.show(3)
df.cache()  # cache is lazy operation, it does not cache data until use it
df.count()  # since count is an action on all data, call this will cache all data on memory!!!

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
+---------+---------+--------------------+--

541909

## 1. Aggregation

Aggregation is to group the rows by a key and grouping function. In spark, the groupby operation will return a `RelationalGroupedDataset` object.

Grouping types in spark include:
- Dataframe level aggregation.
- **group by**: Aggregate using one or more keys and one or more grouping functions
- **window**: Aggregate using one or more keys and one or more grouping functions. Functions are related to the current row.
- **group set**: Aggregate at multiple different levels
    - **roll up**: one or more keys and one or more values, summarized hierarchically
    - **cube**: one or more keys and one or more values, summarized across all combinations of columns

### 1.1 DataFrame Level Aggregation

Common aggregation functions on dataframes are under `pyspark.sql.functions`. Work on columns.
- `count()`: `df.count()` is action.
- `countDistinct()`: can be slow when data is large
- `approx_count_distinct(col_name, prec)`: faster option, take a precision param
- `first()`, `last()`: get first/last value of a column
- `min()`, `max()`, `sum()`, `sumDistinct()`, `avg()`: work as it means
- `var_pop()`, `var_sample()`, `stddev_pop()`, `stddev_sample()`: work as it means
- `skewness()`, `kurtosis()`
    - skewness: Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
        - normal dist. skewness = 0 (symmetry, left/right tails are same)
        - positive skewness: right skew - right tail is longer
        - negative skewness: left skew - left tail is longer
    - kurtosis: Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.
        - normal dist. kurtosis = 0
        - positive kurtosis: heavy tailed
        - negative kurtosis: light tailed
    
- `corr()`, `covar_pop()`, `covar_sample()`

Spark also support aggregate column values into an array using `collect_set()` or `collect_list()` fucntion.

In [32]:
from pyspark.sql.functions import countDistinct, approx_count_distinct, col, struct, array
# count, distince count, count coloumn 
df.select(countDistinct(col('StockCode')).alias('DistinctCount')).show()

# work faster when data is very large
df.select(approx_count_distinct(col('StockCode'), 0.01).alias('DistinctCount')).show()

# can count distinct multiple columns
df.select(countDistinct(col('StockCode'), col('Quantity')).alias('DistinctCount')).show()  

# this would work on multiple columns but slower
df.select(approx_count_distinct(struct(col('StockCode'), col('Quantity')), 0.01).alias('Approx')).show()

# this would also work on multiple columns but slower
df.select(approx_count_distinct(array(col('StockCode'), col('Quantity')), 0.01).alias('Approx')).show()

# refresh the use of distinct() to show all distinct rows
df.select(col('StockCode'), col('Quantity')).distinct().show(3)
df.select(col('StockCode'), col('Quantity')).distinct().count()  # same results as countDistinct!

+-------------+
|DistinctCount|
+-------------+
|         4070|
+-------------+

+-------------+
|DistinctCount|
+-------------+
|         4079|
+-------------+

+-------------+
|DistinctCount|
+-------------+
|        45280|
+-------------+

+------+
|Approx|
+------+
| 45378|
+------+

+------+
|Approx|
+------+
| 45314|
+------+

+---------+--------+
|StockCode|Quantity|
+---------+--------+
|    21485|       6|
|    84347|       3|
|    22454|       2|
+---------+--------+
only showing top 3 rows



45280

In [38]:
from pyspark.sql.functions import min, max, first, last, sum, avg, var_pop, skewness, kurtosis

# some column based stats
min_quantity = min(df.Quantity)
max_quantity = max(df.Quantity)
first_quantity = first(df.Quantity)
last_quantity = last(df.Quantity)
sum_quantity = sum(df.Quantity)
avg_quantity = avg(df.Quantity)
var_quantity = var_pop(df.Quantity)
skewness_quantity = skewness(df.Quantity)
kurtosis_quantity = kurtosis(df.Quantity)

df.select(min_quantity.alias('min'), max_quantity.alias('max'), 
          first_quantity.alias('first'), last_quantity.alias('last'),
          sum_quantity.alias('sum'), avg_quantity.alias('avg'),
          var_quantity.alias('var'), skewness_quantity.alias('skewness'),
          kurtosis_quantity.alias('kurtosis')).show()

+------+-----+-----+----+-------+----------------+------------------+-------------------+------------------+
|   min|  max|first|last|    sum|             avg|               var|           skewness|          kurtosis|
+------+-----+-----+----+-------+----------------+------------------+-------------------+------------------+
|-80995|80995|    6|   3|5176450|9.55224954743324|47559.303646609165|-0.2640755761052369|119768.05495536828|
+------+-----+-----+----+-------+----------------+------------------+-------------------+------------------+



In [44]:
from pyspark.sql.functions import corr, covar_pop

# correlation between two columns
cor_qp = corr(df.Quantity, df.UnitPrice)

# correlation is covariance normalized by variance (pop/sample)
covar_qp = covar_pop(df.Quantity, df.UnitPrice)

# print it out
df.select(cor_qp.alias('Correlation'), covar_qp.alias('Covariance')).show()

+--------------------+-------------------+
|         Correlation|         Covariance|
+--------------------+-------------------+
|-0.00123492454487...|-26.058713170967746|
+--------------------+-------------------+



In [47]:
from pyspark.sql.functions import collect_set, collect_list

agged_df = df.agg(collect_set(col('Quantity')), collect_list(col('Quantity')))
agged_df.show()

+---------------------+----------------------+
|collect_set(Quantity)|collect_list(Quantity)|
+---------------------+----------------------+
| [-42, 306, 256, 1...|  [6, 6, 8, 6, 6, 2...|
+---------------------+----------------------+



### 1.2 GroupBy and Aggregate

More common task is to perform calculation based on the groups in the data. This is usually a two stage process:
- group by some keys: `.groupBy(col_names, ...)`, support multiple comlumns
- aggregate by some function `.agg(func(col), ...)`

In [48]:
df.show(1)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
only showing top 1 row



In [55]:
from pyspark.sql.functions import count, expr, col

# group by can work with multiple columns
df.groupBy('StockCode', 'Country').agg(count(col('StockCode')).alias('Count'), 
                                       avg(col('UnitPrice')).alias('Avg')).show(3)

# can use expr for full string implementation
df.groupBy('StockCode', 'Country').agg(expr('count(StockCode)').alias('Count'), 
                                       expr('avg(UnitPrice)').alias('Avg')).show(3)

+---------+--------------+-----+------------------+
|StockCode|       Country|Count|               Avg|
+---------+--------------+-----+------------------+
|    22154|United Kingdom|  170|0.5414117647058824|
|    22478|United Kingdom|  133|1.8110526315789475|
|    22844|United Kingdom|  402|10.921791044776118|
+---------+--------------+-----+------------------+
only showing top 3 rows

+---------+--------------+-----+------------------+
|StockCode|       Country|Count|               Avg|
+---------+--------------+-----+------------------+
|    22154|United Kingdom|  170|0.5414117647058824|
|    22478|United Kingdom|  133|1.8110526315789475|
|    22844|United Kingdom|  402|10.921791044776118|
+---------+--------------+-----+------------------+
only showing top 3 rows



### 1.3 Window Functions

A window is a specification of which rows should be used for the computation (aggregations).
- for groupBy, each row can only go into one group
- **for window, a row can go into multiple groups. e.g. rolling average**