## Use an expression in another `DataFrame`
By the end of this lecture you will be able to:
- use an expression in another `DataFrame`
- use data from a training `DataFrame` in a testing `DataFrame`

In [None]:
import polars as pl

In this simple example we have a `DataFrame` `df` with a column `a`. 

We require a `LazyFrame` so we call `.lazy` at the end

In [None]:
df = (
    pl.DataFrame(
        {
            "a":[0,1,2],
        }
    )
    .lazy()
)

We also have another `DataFrame` `df_other` with a column `b`. We again convert this to a `LazyFrame` at the end

In [None]:
df_other = (
    pl.DataFrame(
        {
            "b":[3,4,5]
        }
    )
    .lazy()
)

We want to add the values in column `a` of `df` with the values in `b` in `df_other`

We do this by calling `with_context(df_other)` on `df`. This allows us to use columns from `df_other` in expressions on `df`

In [None]:
(
    df
    .with_context(
        df_other
    )
    .with_columns(
        [
            pl.col("b"),
            (pl.col("a") + pl.col("b")).alias("sum")
        ]
    )
    .collect()
)

In this example the length of `df` and `df_other` are the same so we can add the entire column.

In general the lengths of `df` and `df_other` do not need to match - the output of the expression on `df_other` just needs to fit the space in `df`. We see an example of this in the exercises.

## Column names
If the column names overlap between `df` and `df_other` we need to rename the columns in `df_other` in the `with_context` statement. We also see an example of this in the exercises.

## Exercises

In the exercises you will develop your understanding of:
- using a column from another `DataFrame` in an expression
- dealing with column name overlaps

### Exercise 1
We read the Titanic CSV and split it into a train and test `DataFrame`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
train_df = df[:720]
test_df = df[720:]

We print out the median values for the `Age` in the train and test set

In [None]:
print(
    f"Train median: {train_df['Age'].median()}",
    f"Train median: {test_df['Age'].median()}"
)

We want to fill the `null` values in `test_df` with the median value from `train_df` which is 28.0.

Convert the train and test `DataFrames` into `LazyFrames`

In [None]:
train_df = <blank>
test_df = <blank>

Fill the `null` values in the `Age` column of the training `DataFrame` with the median value of the `Age` column

In [None]:
train_df_mod = (
    train_df
    .with_columns(
        pl.col("Age").<blank>
    )
)

In `test_df` fill the `null` values with the median value of the `Age` from the training `DataFrame`. Do this as a new column called `Age_filled`

In [None]:
test_df_mod = (
    test_df
    <blank>
    .collect()
)

The median of the `Age` column from `train_df` is 28.0 - check if `null` values in `test_df_mod` have been filled with 28.0

In [None]:
(
    test_df_mod
    .filter(pl.col("Age").is_null())
    .select(
        ["Age","Age_filled"]
    )
    .head(3)
)

We see that the `null` values have been filled with 27.5 - this is the median value from the `test_df` and not the `train_df`

Try to fill the `null` values again - but this time rename all the columns in `train_df` with the suffix `_train`

In [None]:
test_df_mod = (
    test_df
    <blank>
    .collect()
)

Check again to see that the `null` values in `Age_filled` are 28.0

In [None]:
(
    test_df_mod
    .filter(pl.col("Age").is_null())
    .select(
        ["Age","Age_filled"]
    )
    .head(3)
)

## Solutions

### Solution to exercise 1
We read the Titanic CSV and split it into a train and test `DataFrame`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
train_df = df[:720]
test_df = df[720:]

We print out the median values for the `Age` in the train and test set

In [None]:
print(
    f"Train median: {train_df['Age'].median()}",
    f"Train median: {test_df['Age'].median()}"
)

Convert the train and test `DataFrames` into `LazyFrames`

In [None]:
train_df = train_df.lazy()
test_df = test_df.lazy()

Fill the `null` values in the `Age` column of the training `DataFrame` with the median value of the `Age` column

In [None]:
train_df_mod = (
    train_df
    .with_columns(
        pl.col("Age").fill_null(pl.col("Age").median())
    )
)

In `test_df` fill the `null` values with the median value of the `Age` from the training `DataFrame`

In [None]:
test_df_mod = (
    test_df
    .with_context(
        train_df
    )
    .with_columns(
        pl.col("Age").fill_null(pl.col("Age").median()).alias("Age_filled")
    )
    .collect()
)

The median of the `Age` column from `train_df` is 28.0 - check if `null` values in `test_df_mod` have been filled with 28.0

In [None]:
(
    test_df_mod
    .filter(pl.col("Age").is_null())
    .select(
        ["Age","Age_filled"]
    )
    .head(3)
)

We see that the `null` values have been filled with 27.5 - this is the median value from the `test_df` and not the `train_df`

Try to fill the `null` values again - but this time rename all the columns in `train_df` with the suffix `_train`

In [None]:
test_df_mod = (
    test_df
    .with_context(
        train_df.select(pl.all().suffix("_train"))
    )
    .with_columns(
        pl.col("Age").fill_null(pl.col("Age_train").median()).alias("Age_filled")
    )
    .collect()
)

Check again to see that the `null` values in `Age_filled` are 28.0

In [None]:
(
    test_df_mod
    .filter(pl.col("Age").is_null())
    .select(
        ["Age","Age_filled"]
    )
    .head(3)
)