# Statistics
By the end of this section you will learn how to:
- calculate statistics on a `DataFrame` or expression
- get a summary with `describe`
- calculate cumulative, rolling and exponentially-weighted statistics

In [None]:
import polars as pl

In [None]:
csvFile = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csvFile)
df.head(3)

## Statistics on a `DataFrame`

We can call statistical methods on a `DataFrame`

In [None]:
df.mean()

## Summary of a `DataFrame`

We can get an overview of the `DataFrame` with `describe`

In [None]:
df.describe()

For string columns the values are cast to the dtype of that column.

## Statistics in an expression
We can calculate statistics in an expression

In [None]:
(
    df
    .select(
        pl.col('Fare').mean()
    )
)

The statistics available include:
- count
- sum
- product
- min
- median
- mean
- max
- std (standard deviation)
- var (variance)
- skew
- kurtosis
- entropy

## Rolling statistics
We can calculate rolling statistics in an expression.

We first create a simple `DataFrame` with sequential values

In [None]:
df = (
    pl.DataFrame(
        {
            "value":range(12),
        }
    )
)
df.head()

We take the rolling mean over 4 values by setting the `window_size` to be 4

In [None]:
(
    df
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=4)
    )
    .head(5)
)

Note that by default the first non-`null` value is on the 4th row.

We can calculate the statistic with fewer values than the `window_size` by setting the `min_periods` argument

In [None]:
(
    df
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=4),
        rolling_mean_value_min_periods = pl.col("value").rolling_mean(window_size=4,min_periods=1)

    )
).head()

In the examples above the statistics are *backward-looking*. That is, the value on the 4th row is the average of the first four rows. We can instead center the statistic with the `center` argument (note that we use a window size of 5 here)

In [None]:
(
    df
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=5),
        rolling_mean_value_center = pl.col("value").rolling_mean(window_size=5,center=True)
    ).head(5)
)

In this case the values on the third row is the mean of the first five rows.

See the full range of rolling statistics here: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/computation.html

## Weighted statistics
We can supply a custom weighting to a rolling statistic with the `weights` argument.

However, Polars has exponentially-weighted statistics available as expressions.

Use the `span` parameter to set the number of rows used for the calculation

In [None]:
(
    df
    .with_columns(
        rolling_mean_value = pl.col("value").rolling_mean(window_size=4),
        ewm_mean_value = pl.col("value").ewm_mean(span=4)
    ).head(5)
)

For the `ewm_mean` the `min_periods` is 1 by default.

Exponentially-weighted statistics available are:
- `ewm_mean`
- `ewm_std`
- `ewm_var`

### Multiple statistics
We can use `suffix` when calculating multiple statistics on the same column or columns

In [None]:
(
    df.select(
        [
            pl.col(pl.Float64).min().suffix("_min"),
            pl.col(pl.Float64).max().suffix("_max"),
        ]
    )
)

We can also do arithmetic with statistics. 

In this example we calculate a min-max scaler for the floating point columns

In [None]:
(
    df
    .with_columns(
        ((pl.col(pl.Float64) - pl.col(pl.Float64).min()) / (pl.col(pl.Float64).max() - pl.col(pl.Float64).min())).suffix("_scaled")
    )
    .select(pl.col(pl.Float64))
    .head()
)

# Exercises

In the exercises you will develop your understanding of:
- calculating statistics on a column
- calculating statistics on multiple columns of the same dtype
- calculating cumulative statistics

### Exercise 1 - calculating multiple statistics
Calculate the mean and median of the `Age` column for passengers in 1st class

In [None]:
(
    pl.read_csv(csvFile)
    <blank>
)

Add a new column called `Age_delta` that is the difference between the age and the average age of all passengers

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        <blank>
    )
    .select(
        'Age','Age_delta'
    )
    .head(10)
)

Add another column called `Age_z` that has the z-score for the `Age` where the z-score is the (age - average age of the column) divided by the standard deviation of the age column

Create these new columns for all floating point columns in the CSV. Add a `pipe` command if you want to sort the columns alphabetically

### Exercise 2
We have the following `DataFrame` with values that occur in sequences in the `records` column

In [None]:
records = (
    pl.DataFrame(
        {
            "values":['A','A','A','B','B','A','A']
        }
    )
)
records

We want to identify groups of rows with the same consecutive values in the `values` column to get the following output

In [None]:
(
    pl.DataFrame(
        {
            "values":['A','A','A','B','B','A','A'],
            "groups":[0,0,0,1,1,2,2]
        }
    )
)

Try this yourself or follow the step-by-step guide below if you need help. 

Note that one way to do this involves the `shift` expression that we haven't met before

In [None]:
(
    records
    .with_columns(
        pl.col("values").shift(1).alias("shifted")
    )
)

Step-by-step approach:

Check if the value in each row is **not** equal to the value in the previous row in a column called `equalsPrevious`

Use a cumulative function on `equalsPrevious` to increment an integer value whenever a row that is not equal to the previous value is encountered. 

You may need to do some arithmetic to get the same result as set out in the `groups` column above

# Solutions

### Solution to Exercise 1 
Calculate the mean and median of the `Age` column for passengers in 1st class

In [None]:
(
    pl.read_csv(csvFile)
    .filter(
        pl.col('Pclass') == 1
    )
    .select(
        [
            pl.col('Age').mean().alias('Age_mean'),
            pl.col('Age').median().alias('Age_median')
        ]
    )
)

Add a new column called `Age_delta` that is the difference between the age and the average age of all passengers

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        (pl.col('Age') - pl.col('Age').mean()).alias('Age_delta')
    )
    .select(
        'Age','Age_delta'
    )
    .head(10)
)

Add a further column called `Age_z` that has the z-score for the `Age`: this is the (age - average age of the column) divided by the standard deviation of the age column

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        [
            (pl.col('Age') - pl.col('Age').mean()).alias('Age_delta'),
            ((pl.col('Age') - pl.col('Age').mean())/pl.col('Age').std()).alias('Age_z')
        ]
    )
    .select(
        'Age','Age_delta','Age_z'
    )
    .head(10)
)

Create these new columns for all floating point columns in the CSV. Add a `pipe` command if you want to sort the columns alphabetically

In [None]:
(
    pl.read_csv(csvFile)
    .with_columns(
        [
            (pl.col(pl.Float64) - pl.col(pl.Float64).mean()).suffix('_delta'),
            ((pl.col(pl.Float64) - pl.col(pl.Float64).mean())/pl.col(pl.Float64).std()).suffix('_z')
        ]
    )
    .select(
        pl.col(pl.Float64)
    )
    .pipe(lambda df:df.select(sorted(df.columns)))
    .head(10)
)

### Solution to exercise 2
We have the following `DataFrame` with values that occur in sequences in the `records` column

In [None]:
records = (
    pl.DataFrame(
        {
            "values":['A','A','A','B','B','A','A']
        }
    )
)
records

We want to identify groups of rows with the same consecutive values to get the following output. The column `groups` shows how long the sequence which that row belongs to it.

In [None]:
(
    pl.DataFrame(
        {
            "values":['A','A','A','B','B','A','A'],
            "groups":[0,0,0,1,1,2,2]
        }
    )
)

Check if the value in each row is **not** equal to the value in the previous row in a column called `equalsPrevious`

In [None]:
(
    records
    .with_columns(
        (pl.col('values') != pl.col('values').shift(1)).alias('equalsPrevious')
    )
)

Use a cumulative function on `equalsPrevious` to increment an integer value whenever a row that is not equal to the previous value is encountered. 

You may need to do some arithmetic to get the same result as set out in the `groups` column above

In [None]:
(
    records
    .with_columns(
        (pl.col('values') != pl.col('values').shift(1)).alias('equalsPrevious')
    )
    .with_columns(
        (pl.col('equalsPrevious').cumsum()-1).alias('groups')
    )
)