# Groupby 3: Multiple aggregations
By the end of this lecture you will be able to:
- do multiple aggregations on multiple columns

In [None]:
import polars as pl

In [None]:
csvFile = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## General case: Aggregations in a list
We can pass a `list` to `.agg` to set out different aggregations

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        [
            pl.col('Age').mean(),
            pl.col("Fare").max()
        ]
    )
)

When there are multiple aggregations Polars calculates them in parallel.

## Multiple aggregations on a column

Calling multiple aggregations on the same column produces columns of the same name.

We use an `alias` to ensure 
column names are unique.

For example, we get the min, mean and max of the `Age` column

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        [
            pl.col('Age').min().alias('Age_min'),
            pl.col('Age').mean().alias('Age_mean'),
            pl.col('Age').max().alias('Age_max')
        ]
    )
)

There are more efficient ways to write code to do multiple columns and/or aggregations in `agg`.

## Same aggregation on multiple columns
To do the same aggregation on multiple columns we can loop over the columns in a list comprehension.

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        [
            pl.col(colName).mean() for colName in ["Age","Fare"]
        ]
    )         
)

We can also use the methods for selecting multiple columns we met previously including:
- using `pl.all`
- passing a dtype to `pl.col`
- passing a regex to `pl.col`

We see examples of these below and in the exercises.

## Multiple aggregations on multiple columns

Using `alias` is tedious for multiple aggregations on multiple columns.

Instead we add a prefix or suffix to the column name. 

For example with a `suffix`

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        [
        pl.col(pl.Float64).mean().suffix("_mean"),
        pl.col(pl.Float64).min().suffix("_min")
        ]
    )
)

# Exercises

In the exercises you will develop your understanding of:
- doing aggregations on a column
- doing aggregations on multiple columns
- renaming columns with a prefix or suffix
- re-ordering columns with a suffix

## Exercise 1: 
Grouping by `Pclass` and `Survived` get the youngest, average and oldest ages in each group

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Exercise 1 cont: Round the average `Age` column to one decimal place. Sort the output by `Survived` and `Pclass`

Exercise 1 cont: Filter the output to have only the passengers that survived

## Exercise 2 - aggregate multiple columns

Group by `Pclass` and get the mean of all the floating point columns

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Group by `Pclass` and get the mean of all the :
- floating point columns and
- integer columns

Hint: pass a list of dtypes to `pl.col`

Add the suffix "_mean" to the floating point and integer columns

Get the `mean` and `max` of these columns

Re-order the columns into alphabetical order after the group key column using `pipe`.

(See the lecture Transforming DataFrames in the Selecting Columns Section if you haven't come across `pipe` before).

# Solutions

## Solution to Exercise 1
Grouping by `Pclass` and `Survived` get the youngest, average and oldest ages

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(["Pclass","Survived"])
    .agg(
        [
            pl.col("Age").min().alias("Age_min"),
            pl.col("Age").mean().alias("Age_mean"),
            pl.col("Age").max().alias("Age_max")
        ]
    )
)

Round the average Age column and sort by `Survived` and `Pclass`

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(["Pclass","Survived"])
    .agg(
        [
            pl.col("Age").min().alias("Age_min"),
            pl.col("Age").mean().round(0).alias("Age_mean"),
            pl.col("Age").max().alias("Age_max")
        ]
    )
    .sort(["Survived","Pclass"])
)

Filter the output to have only the passengers that survived

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(["Pclass","Survived"])
    .agg(
        [
            pl.col("Age").min().alias("Age_min"),
            pl.col("Age").mean().round(0).alias("Age_mean"),
            pl.col("Age").max().alias("Age_max")
        ]
    )
    # In eager mode we should apply the filter before the sort
    .filter(pl.col("Survived") == 1)
    .sort(["Pclass","Survived"])
)

## Solution to Exercise 2 - aggregate multiple columns

Groupby `Pclass` and get the mean of all the floating point columns

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(pl.col('Pclass'))
    .agg(
        pl.col(pl.Float64).mean()
    )
)

Floating point and integer columns

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(pl.col('Pclass'))
    .agg(
        pl.col([pl.Float64,pl.Int64]).mean()
    )
)

Add a suffix to the output

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(pl.col('Pclass'))
    .agg(
        pl.col([pl.Float64,pl.Int64]).mean().suffix("_mean")
    )
)

Get the `mean` and `max` of each passenger class

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Pclass")
    .agg(
        [
            pl.col([pl.Float64,pl.Int64]).mean().suffix("_mean"),
            pl.col([pl.Float64,pl.Int64]).max().suffix("_max"),
        ]
    )
)

Re-order the columns into alphabetical order using `pipe`

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Pclass")
    .agg(
        [
            pl.col([pl.Float64,pl.Int64]).mean().suffix("_mean"),
            pl.col([pl.Float64,pl.Int64]).max().suffix("_max"),
        ]
    )
    .pipe(lambda tempDf: tempDf.select(["Pclass"] + sorted(tempDf.columns[1:])))
)