# Spark Learning Note - Data Aggregations and Table Joining

Jia Geng | gjia0214@gmail.com

In [1]:
# check java version 
# use sudo update-alternatives --config java to switch java version if needed.
!java -version

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~19.10-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


In [2]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('Spark Learning').getOrCreate()
spark

In [3]:
data_example = '/home/jgeng/Documents/Git/SparkLearning/book_data/retail-data/all/online-retail-dataset.csv'

## Aggregation

Aggregation is to group the rows by a key and grouping function. In spark, the groupby operation will return a `RelationalGroupedDataset` object.

Grouping types in spark include:
- summary of whole DataFrame, e.g. `df.count()`
- **group by**: Aggregate using one or more keys and one or more grouping functions
- **window**: Aggregate using one or more keys and one or more grouping functions. Functions are related to the current row.
- **group set**: Aggregate at multiple different levels
    - **roll up**: one or more keys and one or more values, summarized hierarchically
    - **cube**: one or more keys and one or more values, summarized across all combinations of columns

In [15]:
df = spark.read.format('csv').option('header', True).option('inferSchema', True).load(data_example)
df.printSchema()
df.show(3)
df.cache()  # cache is lazy operation, it does not cache data until use it
df.count()  # since count is an action on all data, call this will cache all data on memory!!!

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|     2.75|     17850|United Kingdom|
+---------+---------+--------------------+--

541909

### 1.1 Aggregation Functions on DataFrame

Common aggregation functions on dataframes are under `pyspark.sql.functions`. Work on columns.
- `count()`: `df.count()` if action. `count(col)` is transformation.
- `countDistinct()`: can be slow when data is large
- `approx_count_distinct(col_name, prec)`: faster option, take a precision param
- `first()`, `last()`: get first/last value of a column
- `min()`, `max()`, `sum()`, `sumDistinct()`, `avg()`: work as it means
- `var_pop()`, `var_sample()`, `stddev_pop()`, `stddev_sample()`: work as it means
- `skewness()`, `kirtosis()`
    - skewness: 
    - kirtosis: 
- `corr()`, `covar_pop()`, `covar_sample()`