## Replacing missing values with expressions
By the end of this section, you will learn:
- replace missing values with an expression on the same column
- replace missing values based on other columns
- replace missing values based on a condition
- replace missing values with interpolation

In [2]:
import polars as pl

We create a simple `DataFrame` for this lecture

In [None]:
df = pl.DataFrame(
    {
        'col1':[0,None,2,3],
        "col2":[0,None,None,3],
        "col3":[4,5,6,7]
    }
)
df

## Replace missing values with an expression
We are not limited to the built-in strategies as we can also use expressions to replace missing values.

### Using an expression from the same column
For example we can replace the missing values with the median of the non-`null` values for that column

In [None]:
(
    df
    .with_columns(
        pl.col("col1").fill_null(pl.median("col1")).suffix("_new"),
    )
)


### Using an expression from a different column

We can also replace missing values with the values from another column

In [None]:
(
    df
    .with_columns(
        pl.col("col2").fill_null(pl.col("col3")).suffix("_new"),
    )
)


## Replace missing values based on a condition
We can replace missing values based on a condition on other columns using `pl.when.then.otherwise`.

In this example we replace missing values in `col2` with `col1` as long as `col1` is not null. Otherwise we use `col3`

In [None]:
(
    df
    .with_columns(
        pl.when(
            pl.col("col1").is_not_null()
        )
        .then(
            pl.col("col2").fill_null(pl.col("col1"))
        )
        .otherwise(
            pl.col("col2").fill_null(pl.col("col3"))
        )   
        .suffix("_new"),
    )
)


## Interpolation
We can replace missing values with linear interpolation

In [None]:
(
    df
    .with_columns(
        pl.all().interpolate().suffix("_new"),
    )
)

## Replacing missing values based on a sequence of columns
We can replace missing values based on a sequence of columns with `coalesce`.

In this example we have 3 columns `a`,`b` and `c`

In [None]:
dfCoalesce = pl.DataFrame(
    data=[
        (None, 1.0, 1.0),
        (None, 2.0, 2.0),
        (None, None, 3.0),
        (None, None, None),
    ],
    schema=[("a", pl.Float64), ("b", pl.Float64), ("c", pl.Float64)],
)
dfCoalesce

We want to create a new column that has the first non-`null` value as we go through these columns in order. 

We do this with `pl.coalesce` where we can also specify a fill value if all of the columns are `null`

In [None]:
(
    dfCoalesce
    .with_columns(
        pl.coalesce(["a", "b", "c", 9.0]).alias("d")
    )
)

## Exercises
In the exercises you will develop your understanding of:
- replacing missing values with an expression
- replacing missing values with interpolation
- replacing missing values with `coalesce`

### Exercise 1
Replace `null` values in the `Age` column to have the `median` of the `Age` column

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
   <blank>
)

Replace `null` values in the `Age` column to have the `median` of the `Age` column **based on whether the passenger is `male` or `female` in the `Sex` column**.

Expand the following cell if you want a hint

In [None]:
#Hint: in each fill_null call you need to apply a filter to the `Sex` column before you can call median

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        <blank>
        .alias("Age_filled")
    )
    .select(["Sex","Age","Age_filled"])
    .filter(pl.col("Age").is_null())
    .head()
)

### Exercise 2
We have the following `DataFrame` with 3 columns

In [None]:
df = pl.DataFrame(
    {
        "a":[10,None,22,1],
        "b":[8,12,19,None],        
        "c":[5,None,19,None],
    }
)

Add a new column with values from column `c`. If `c` is `null` then use the value from column `b` and if `b` is also `null` use the value from column `a`

In [None]:
(
    df
    <blank>
)

## Solutions

### Solution to Exercise 1
Replace `null` values in the `Age` column to have the `median` of the `Age` column

In [None]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.col("Age").fill_null(pl.col("Age").median())
    )
    .head(10)
)

Replace `null` values in the `Age` column to have the `median` of the `Age` column **based on whether the passenger is `male` or `female` in the `Sex` column**.

Expand the following cell if you want a hint

In [None]:
#Hint: in each fill_null call you need to apply a filter to the `Sex` column before you can call median

In [3]:
csvFile = "../data/titanic.csv"
(
    pl.read_csv(csvFile)
    .with_columns(
        pl.when(
            pl.col("Sex") == "female"
        )
        .then(
            pl.col("Age").fill_null(pl.col("Age").filter(pl.col("Sex") == "female").median())
        )
        .otherwise(
            pl.col("Age").fill_null(pl.col("Age").filter(pl.col("Sex") == "male").median())
        )
        .alias("Age_filled")
    )
    .select(["Sex","Age","Age_filled"])
    .filter(pl.col("Age").is_null())
    .head()
)

Sex,Age,Age_filled
str,f64,f64
"""male""",,29.0
"""male""",,29.0
"""female""",,27.0
"""male""",,29.0
"""female""",,27.0


### Solution to Exercise 2
We have the following `DataFrame` with 3 columns

In [4]:
df = pl.DataFrame(
    {
        "a":[10,None,22,1],
        "b":[8,12,19,None],        
        "c":[5,None,19,None],
    }
)

Add a new column with values from column `c`. If `c` is `null` then use the value from column `b` and if `b` is also `null` use the value from column `a`

In [5]:
(
    df
    .with_columns(
        pl.coalesce(["c","b","a"]).alias("d")
    )
)

a,b,c,d
i64,i64,i64,i64
10.0,8.0,5.0,5
,12.0,,12
22.0,19.0,19.0,19
1.0,,,1
