## Filtering one `DataFrame` by another `DataFrame`
By the end of this lecture you will be able to:
- filter a `DataFrame` to include values present in another `DataFrame`
- filter a `DataFrame` to include values not present in another `DataFrame`
- compare the performance of these `join` operations with a `filter` operation

In [None]:
import polars as pl

In [None]:
csvFile = "../data/cites_extract.csv"

We create a `DataFrame` from the CITES dataset introduced in previous lectures in this Section.

Each row of the `DataFrame` records an international trade in an endangered species.

In [None]:
dfCITES = pl.read_csv(csvFile)
dfCITES

In [None]:
isoCSVFile = "../data/countries_extract.csv"

We create a `DataFrame` from an extract of the ISO country data

In [None]:
dfISO = pl.read_csv(isoCSVFile)
dfISO

## Keep rows with values that are present in another `DataFrame`

We keep rows that are present in another `DataFrame` with a `semi` join.

In this example we keep rows from the CITES data if the Importer country is in the ISO extract

In [None]:
(
    dfCITES
    .join(
        dfISO,
        how="semi",
        left_on="Importer",
        right_on="alpha-2"
    )
)   

A `semi` join is like an inner join but we do not add any columns from the right `DataFrame` - it is purely a filtering operation

## Keep rows with values that are **not** present in another `DataFrame`
We keep rows that are not present in another `DataFrame` with an `anti` join.

In this example we keep rows from the CITES data if the Importer country **is not in the ISO extract**

In [None]:
(
    dfCITES
    .join(
        dfISO,
        how="anti",
        left_on="Importer",
        right_on="alpha-2"
    )
)

Again we do not add any columns from the right `DataFrame`

## Comparing a `semi` join with `is_in`
A semi join has similar functionality to using `is_in` within `filter`.

In the exercises we compare the performance of a semi `join` compared to `filter`.

## Exercises
In the exercises you will develop your understanding of:
- filtering a `DataFrame` with a `semi` join
- filtering a `DataFrame` with an `anti` join
- the relative performance of `filter.is_in` and a `semi`/`anti` join

### Exercise 1
We create a `DataFrame` from the Titanic data

In [None]:
pl.Config.set_fmt_str_lengths(80)
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)

Create a `DataFrame` that has the Name, Sex, Age and Survival status of **all the passengers** from the ship's manifesto

In [None]:
dfManifesto = (
    <blank>
)
dfManifesto.head(3)

Create another `DataFrame` that only has the Name of the passengers that survived

In [None]:
dfSurvival = (
    <blank>
)
dfSurvival.head()

Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did not survive - all values in `Survived` should be 0

In [None]:
(
    <blank>
    .head(3)
)
        

Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did survive - all values in `Survived` should be 1

In [None]:
(
    <blank>
    .head(3)
)
        

### Exercise 2
We create a left `DataFrame` with `N` rows and `cardinality` distinct values in the string `id` column that we join on

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
# Cardinality is half of N
cardinality = N // 2
# Create the random array of values for the join column
stringArray = [f"id{i}" for i in np.random.randint(0,cardinality,N)]

dfLeft = pl.DataFrame(
    {
        "id":stringArray
    }
)
dfLeft.head(3)

Create the right `DataFrame` with a single row for each `id`.

The right `DataFrame` only has rows for half of the `id` values in `dfLeft` so we can use it to filter `dfLeft`

In [None]:
dfRight = pl.DataFrame(
    {"id" : [f"id{i}" for i in np.arange(0,cardinality // 2)]}
)
dfRight.head(3)

Filter `dfLeft` by `dfRight` using `filter` and `is_in`. We use the `timeit` magic with 3 iterations (`r3`)

In [None]:
%%timeit -n1 -r3 
(
    <blank>
)

Filter `dfLeft` by `dfRight` with a `semi` join

In [None]:
%%timeit -n1 -r3 
(
    <blank>
)

- Vary `N` to see if the relative difference changes with scale
- Vary `cardinality` (e.g. set equal to `N // 8` or a small number like `10`) to see how changes in cardinality affect relative performance

## Solutions

### Solution to exercise 1

We create a `DataFrame` from the Titanic data

In [None]:
pl.Config.set_fmt_str_lengths(80)
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)

Create a `DataFrame` that has the Name, Sex, Age and Survival status of **all the passengers** from the ship's manifesto

In [None]:
dfManifesto = (
    df
    .select(["Name","Sex","Age","Survived"])
)
dfManifesto.head(3)

Create another `DataFrame` that only has the Name of the passengers that survived

In [None]:
dfSurvival = (
    df
    .filter(
        pl.col("Survived") == 1
    )
    .select("Name")
)
dfSurvival.head(3)

Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did not survive (all values in `Survived` should be 0)

In [None]:
(
    dfManifesto
    .join(
        dfSurvival,
        on="Name",
        how="anti"
    )
    .head(3)
)

Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did survive - all values in `Survived` should be 1

In [None]:
(
    dfManifesto
    .join(
        dfSurvival,
        on="Name",
        how="semi"
    )
    .head(3)
)        

### Solution to exercise 2
We create a left `DataFrame` with `N` rows and `cardinality` distinct values in the string `id` column that we join on

In [None]:
import numpy as np
np.random.seed(0)

N = 1_000_000
# Cardinality is half of N
cardinality = N // 2
# Create the random array of values for the join column
stringArray = [f"id{i}" for i in np.random.randint(0,cardinality,N)]

dfLeft = pl.DataFrame(
    {
        "id":stringArray
    }
)
dfLeft.head(3)

Create the right `DataFrame` with a single row for each `id`.

The right `DataFrame` only has rows for half of the `id` values in `dfLeft` so we can use it to filter `dfLeft`

In [None]:
dfRight = pl.DataFrame(
    {"id" : [f"id{i}" for i in np.arange(0,cardinality // 2)]}
)
dfRight.head(3)

Filter `dfLeft` by `dfRight` using `filter` and `is_in`

In [None]:
%%timeit -n1 -r3 
(
    dfLeft
    .filter(
        pl.col("id").is_in(dfRight["id"])
    )
)

Filter `dfLeft` by `dfRight` with a `semi` join

In [None]:
%%timeit -n1 -r3 
(
    dfLeft
    .join(
        dfRight,
        on="id",
        how="semi"
    )
)

In this case we see that the `semi` join is faster

- Vary `N` to see if the relative difference changes with scale
- Vary `cardinality` (e.g. set equal to a small number like `10`) to see if changes in cardinality affect relative performance