# Statistics
By the end of this lecture you will be able to:
- calculate statistics on a `DataFrame` or expression
- get a summary with `describe`
- calculate cumulative statistics

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Statistics on a `DataFrame`

We can call statistical methods on a `DataFrame`

In [None]:
df.mean()

## Summary of a `DataFrame`

We can get an overview of the `DataFrame` with `describe`

In [None]:
df.describe()

For string columns the values are cast to the dtype of that column.

## Statistics in an expression
We can calculate statistics in an expression

In [None]:
(
    df
    .select(
        pl.col('Fare').mean()
    )
)

The statistics available include:
- count
- sum
- product
- min
- median
- mean
- max
- std (standard deviation)
- var (variance)
- skew
- kurtosis
- entropy

### Multiple statistics
We can use `suffix` when calculating multiple statistics on the same column or columns

In [None]:
(
    df.select(
        [
            pl.col(pl.Float64).min().suffix("_min"),
            pl.col(pl.Float64).max().suffix("_max"),
        ]
    ).select(pl.all())
)

## Cumulative statistics
We can also calcualte cumulative statistics

In [None]:
(
    df
    .select(
        pl.col("Fare").cummax()
    )
    .head(3)
)

# Exercises

In the exercises you will develop your understanding of:
- calculating statistics
- calculating cumulative statistics

### Exercise 1 - calculating multiple statistics
Calculate the mean and median of the `Age` column for passengers in 1st class

In [None]:
df = pl.read_csv(csvFile)
(
    df
    <blank>
)

Calculate the mean and median of all of the floating point columns for passengers in 1st class

### Exercise 2
Calculate the cumulative sum of the `Age` and `Fare` columns

In [None]:
df = pl.read_csv(csvFile)
(
    df
    <blank>
)

# Solutions

### Solution to Exercise 1 
Calculate the mean and median of the `Age` column for passengers in 1st class

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .filter(
        pl.col('Pclass') == 1
    )
    .select(
        [
            pl.col('Age').mean().alias('Age_mean'),
            pl.col('Age').median().alias('Age_median')
        ]
    )
)

Calculate the mean and median of all of the floating point columns for passengers in 1st class

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .filter(
        pl.col('Pclass')==1
    )
    .select(
        [
            pl.col(pl.Float64).mean().suffix('_mean'),
            pl.col(pl.Float64).median().suffix('_median')
        ]
    )
)

### Solution to Exercise 2 
Calculate the cumulative sum of the `Age` and `Fare` columns

In [None]:
df = pl.read_csv(csvFile)
(
    df
    .select(
        pl.col(["Age","Fare"]).cumsum()
    )
)