# Groupby 2: Aggregation and expressions
By the end of this section you will be able to:
- do an aggregation with `agg`
- sort the output of an aggregation
- transform a column before doing `groupby`
- filter an aggregation

In [None]:
import polars as pl

In [None]:
csvFile = '../data/titanic.csv'

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Aggregate a single column

We aggregate a single column by calling `agg` after `groupby`

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        pl.col('Age').mean()
    )
)

We must use expressions inside `agg`.

We can sort the output `DataFrame` to order by group key or aggregation columns

In [None]:
(
    df
    .groupby('Pclass')
    .agg(
        pl.col('Age').mean()
    )
    .sort('Pclass')
)

## Groupby on an expression
We can use expressions inside the `groupby` method to transform the column before grouping.

We want to group by `Age` in decades instead of individual years. 

To do this we must:
- convert the `Age` column from years to decades
- cast the output to integer
- group by the decades

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(
        (pl.col("Age")/10).round(0).cast(pl.Int64).alias("Decade")
    )
    .agg(
        pl.col("Fare").mean()
    )
    .sort("Decade",descending=True)   
)

## Apply a filter on an aggregation

We may want to filter the results after doing the aggregation so that only some of the aggregates appear in the output.

> In SQL this is done using a `HAVING` statement

We do this in Polars with an additonal `filter` after calling `agg`.

In this example we get the average fare by passenger class but only if the average fare is greater than £20

In [None]:
(
    pl.read_csv(csvFile)
    .groupby("Pclass")
    .agg(
        pl.col("Fare").mean()
    )
    .filter(
        pl.col("Fare") > 20
    )
    .sort("Fare")
)

## Exercises

In the exercises you will develop your understanding of
- doing `groupby` on a column
- using an expression in a `groupby`
- doing `groupby` on multiple columns
- applying a `filter` on an aggregation

### Exercise 1: Group by a single column
Get the average fare by `Age` 

In [None]:
(
    pl.read_csv(csvFile)
    .<blank>
)

Round the `Age` column to the nearest year before doing the groupby. Sort the output by age.

Continuing from the previous cell, output the rows where the average fare is greater than 30

### Exercise 2: Group by multiple columns

Group by the `Pclass` and `Survived` columns. Count the number of passengers in each group in a column called `counts`

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Add a column with the percentage of the total passengers in each group. Do this by diving the values in `counts` by the sum of the values in `counts`.  Round the percentages to 2 significant figures.

In [None]:
# Continue with your code from the previous section using `with_column` after `agg`


## Solutions

### Solution to Exercise 1
Get the average fare by `Age` 

In [None]:
(
    pl.read_csv(csvFile)
    .groupby('Age')
    .agg(
        pl.col("Fare").mean()
    ).head(2)
)

Round the `Age` column to the nearest year before doing the groupby. Sort the output by age.

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(
        pl.col('Age').round(0)
    )
    .agg(
        pl.col("Fare").mean()
    )
    .sort('Age')
)

Continuing from the previous cell, output the rows where the average fare is greater than 30

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(
        pl.col('Age').round(0)
    )
    .agg(
        pl.col("Fare").mean()
    )
    .sort("Age")
    .filter(pl.col("Fare") > 30)
    .head()
)

### Solution to Exercise 2: group by multiple columns 

Group by the `Pclass` and `Survived` columns. Count the number of passengers in each group in a column called `counts`

In [None]:
(
    pl.read_csv(csvFile)
    .groupby(
        ["Pclass","Survived"]
    )
    .agg(
        pl.col("Name").count().alias("counts")
    )    
)

Add a column with the percentage of the total passengers in each group. Do this by diving the values in `counts` by the sum of the values in `counts`.  Round the percentages to 2 significant figures.

In [None]:
# Continue with your code from the previous section using `with_column` after `agg`
(
    pl.read_csv(csvFile)
    .groupby(
        ["Pclass","Survived"]
    )
    .agg(
        pl.col("Name").count().alias("counts")
    )
    .with_columns(
        (100*(
            pl.col("counts")/pl.col("counts").sum()
        ).round(2)).alias("percent")
    )
    
)