# The Right Way of Doing Aggregates

### Library Imports

In [8]:
from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Create a `SparkSession`. No need to create `SparkContext` as you automatically get it as part of the `SparkSession`.

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Exploring Joins") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sc = spark.sparkContext

### Initial Datasets

In [3]:
data = [
    (1, datetime(2018, 1, 1), 1), 
    (1, datetime(2018, 1, 1), 2),
    (2, datetime(2018, 1, 2), 2),
]

df = spark.createDataFrame(data, ['id', 'date', 'value'])
df.toPandas()

Unnamed: 0,id,date,value
0,1,2018-01-01,1
1,1,2018-01-01,2
2,2,2018-01-02,2


In [4]:
groupby_columns = ['id', 'date']

## Option 1: Using a Dictionary

In [5]:
df_1 = df \
    .groupby(groupby_columns) \
    .agg({
        "value": "sum",
        "value": "count",
    })

df_1.toPandas()

Unnamed: 0,id,date,count(value)
0,1,2018-01-01,2
1,2,2018-01-02,1


## Option 2: Using List of Columns

In [6]:
df_2 = df \
    .groupby(groupby_columns) \
    .agg(
        F.sum("value"),
        F.count("value"),
    )

df_2.toPandas()

Unnamed: 0,id,date,sum(value),count(value)
0,1,2018-01-01,3,2
1,2,2018-01-02,2,1


## Option 3: Using List of Columns, with Aliases

In [7]:
df_3 = df \
    .groupby(groupby_columns) \
    .agg(
        F.sum("value").alias("sum_of_value_per_day"),
        F.count("value").alias("count_of_value_per_day"),
    )

df_3.toPandas()

Unnamed: 0,id,date,sum_of_value_per_day,count_of_value_per_day
0,1,2018-01-01,3,2
1,2,2018-01-02,2,1


# TL;DR

**I encourage using option #3.**

This creates more elegant and meaning names for the new aggregate columns.

A `withColumnRenamed` can be performed after the aggregates, but why not do it with an `alias`? It's easier as well.