## Replacing missing values
By the end of this section you learn how to:
- replace missing values with a constant
- replace missing values with a strategy

In [None]:
from datetime import date

import polars as pl

We create a simple `DataFrame` for this lecture

In [None]:
df = pl.DataFrame(
    {
        "col1":[0,None,2,3],
        "col2":[0,None,None,3],
        "strings":["a",None,"c","d"]
    }
)
df

## Replace missing values with a constant
We replace missing values in an expression using `fill_null`

In [None]:
(
    df
    .with_columns(
        pl.all().fill_null(0).suffix("_new")
    )
)

Note that `fill_null` replaced `null` with a string `"0"` in `strings_new`

We can also replace the missing values with a string 

In [None]:
(
    df
    .with_columns(
        pl.all().fill_null("missing").suffix("_new")
    )
)

In this case `fill_null` has `cast` the columns from integer to string dtype!



## Replace missing values with a strategy
We can also replace missing values with a stragegy including:
- forward: replace with the previous non-`null` value
- backward: replace with the next non-`null` value
- min: replace with the smallest value in the `Series`
- max: replace with the largest value in the `Series`
- mean: replace with the mean value in the `Series`
- zero: replace with `0`
- one: replace with `1`

### Forward strategy
In the forward strategy the missing values are replaced with the previous non-`null` values

In [None]:
(
    df
    .with_columns(
        pl.all().fill_null(strategy="forward").suffix("_new")
    )
)

We can set a limit on how many rows to fill-forward or backward with `limit`

In [None]:
(
    df
    .with_columns(
        pl.all().fill_null(strategy="forward",limit=1).suffix("_new")
    )
)

## Replacing missing values by group
In this example we have missing values in `col1` and we want to fill them with a fill-forward strategy. 

However, we want to fill forward with respect to the groups in the `group` column.

In [None]:
df = pl.DataFrame(
    {
        "group":["A","B","A","B","A","B"],
        "col1":[0,1,None,1,2,None],
    }
)
df

We can do this using a *window expression* with `over`

In [None]:
(
    df
    .with_columns(
        pl.col("col1").fill_null(strategy="forward").over("group").suffix("_filled")
    )
)

We see more of window expressions in the statistics and grouping Section.

## Exercises
In the exercises you will develop your understanding of:
- replacing missing values with a constant
- replacing missing values with a strategy
- replacing missing values by group

### Exercise 1
Filter the `DataFrame` to have only the two rows with missing values in the `Embarked` column and then replace the missing values in the `Embarked` column with the string `"unknown"`

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
   <blank>
)

### Exercise 2
Add a new column called `Age_filled` where missing values are replaced with the  value from the following row.

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
    .select(["Age","Age_filled"])
)

Do the same but this time with respect to the following row from the same passenger class 

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
    .select(["Pclass","Age","Age_filled"])
)

Add three new columns called `Age_mean`, `Age_median` and `Age_interpolated` where missing values are replaced with the:
- mean
- median and
- interpolated values

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    <blank>
    .select(["Age","Age_mean","Age_median","Age_interpolated"])
    .filter(pl.col("Age").is_null())
)

## Solutions

### Solution to Exercise 1
Filter the `DataFrame` to have only the two rows with missing values in the `Embarked` column and then replace the missing values in the `Embarked` column with `"unknown"`

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .filter(
        pl.col("Embarked").is_null()
    )
    .with_columns(
        pl.col("Embarked").fill_null("unknown")
    )
)

### Solution to Exercise 2
Add a new column called `Age_filled` where missing values are replaced with the  value from the following row.

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col("Age").fill_null(strategy="backward").alias("Age_filled")
    )
    .select(["Age","Age_filled"])
)

Do the same but this time with respect to the following row from the same passenger class 

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col("Age").fill_null(strategy="backward").over("Pclass").alias("Age_filled")
    )
    .select(["Pclass","Age","Age_filled"])
)

Add three new columns called `Age_mean`, `Age_median` and `Age_interpolated` where missing values are replaced with the mean, median and interpolated values.

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        [
            pl.col("Age").fill_null(strategy="mean").alias("Age_mean"),
            pl.col("Age").fill_null(pl.col("Age").median()).alias("Age_median"),
            pl.col("Age").interpolate().alias("Age_interpolated"),

        ]
            )
    .select(["Age","Age_mean","Age_median","Age_interpolated"])
    .filter(pl.col("Age").is_null())
)