## Filtering one `DataFrame` by another `DataFrame`

In [3]:
import polars as pl

In [4]:
csv_file = "data/cites_extract.csv"

In [5]:
df_CITES = pl.read_csv(csv_file)
df_CITES

Year,Importer,Exporter,Taxon,Quantity
i64,str,str,str,f64
2021,"""KR""","""DE""","""Python reticulatus""",12.0
2021,"""TR""","""DE""","""Python reticulatus""",2.0
2021,"""NZ""","""DE""","""Python bivittatus""",2.0
2021,"""TH""","""BJ""","""Python regius""",200.0
2021,"""KR""","""CZ""","""Python bivittatus""",28.0
2021,"""TW""","""DE""","""Python reticulatus""",1.0
2021,"""UA""","""DE""","""Python reticulatus""",4.0


In [6]:
iso_csv_file = "data/countries_extract.csv"

In [7]:
df_ISO = pl.read_csv(iso_csv_file)
df_ISO

alpha-2,name,region
str,str,str
"""BJ""","""Benin""","""Africa"""
"""CZ""","""Czechia""","""Europe"""
"""KR""","""Korea, Republic of""","""Asia"""
"""NZ""","""New Zealand""","""Oceania"""
"""TW""","""Taiwan, Province of China""","""Asia"""
"""TH""","""Thailand""","""Asia"""
"""TR""","""Turkey""","""Asia"""


## Keep rows with values that are present in another `DataFrame`

We keep rows that are present in another `DataFrame` with a `semi` join

In [8]:
df_CITES.join(
    df_ISO,
    how="semi",
    left_on="Importer",
    right_on="alpha-2"
)

Year,Importer,Exporter,Taxon,Quantity
i64,str,str,str,f64
2021,"""KR""","""DE""","""Python reticulatus""",12.0
2021,"""TR""","""DE""","""Python reticulatus""",2.0
2021,"""NZ""","""DE""","""Python bivittatus""",2.0
2021,"""TH""","""BJ""","""Python regius""",200.0
2021,"""KR""","""CZ""","""Python bivittatus""",28.0
2021,"""TW""","""DE""","""Python reticulatus""",1.0


A `semi` join is like an `inner join` but we do not add any columns from the right `DataFrame`

## Keep rows with values that are **not** present in another `DataFrame`

We keep rows that are not present in another `DataFrame` with an `anti` join

In [9]:
df_CITES.join(
    df_ISO,
    how="anti",
    left_on="Importer",
    right_on="alpha-2"
)

Year,Importer,Exporter,Taxon,Quantity
i64,str,str,str,f64
2021,"""UA""","""DE""","""Python reticulatus""",4.0


## Exercises

### Exercise 1

In [10]:
pl.Config.set_fmt_str_lengths(80)
csv_file = "data/titanic.csv"
df = pl.read_csv(csv_file)

Create a `DataFrame` that has the Name, Sex, Age and Survival status of **all the passengers** from the ship's manifesto

In [11]:
dfManifesto = df.select("Name", "Sex", "Age", "Survived")

dfManifesto.head(3)

Name,Sex,Age,Survived
str,str,f64,i64
"""Braund, Mr. Owen Harris""","""male""",22.0,0
"""Cumings, Mrs. John Bradley (Florence Briggs Thayer)""","""female""",38.0,1
"""Heikkinen, Miss. Laina""","""female""",26.0,1


Create another `DataFrame` that only has the Name of the passengers that survived

In [None]:
dfSurvival = df.filter(
    pl.col("Survived") == 1
).select("Name")

dfSurvival.head(3)

Name
str
"""Cumings, Mrs. John Bradley (Florence Briggs Thayer)"""
"""Heikkinen, Miss. Laina"""
"""Futrelle, Mrs. Jacques Heath (Lily May Peel)"""


Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did not survive - all values in `Survived` should be 0

In [16]:
dfManifesto.join(
    dfSurvival,
    on="Name",
    how="anti"
).head(3)

Name,Sex,Age,Survived
str,str,f64,i64
"""Braund, Mr. Owen Harris""","""male""",22.0,0
"""Allen, Mr. William Henry""","""male""",35.0,0
"""Moran, Mr. James""","""male""",,0


Filter `dfManifesto` to create a `DataFrame` with the details of the passengers that did survive - all values in `Survived` should be 1

In [18]:
dfManifesto.join(
    dfSurvival,
    on="Name",
    how="semi"
).head(3)

Name,Sex,Age,Survived
str,str,f64,i64
"""Cumings, Mrs. John Bradley (Florence Briggs Thayer)""","""female""",38.0,1
"""Heikkinen, Miss. Laina""","""female""",26.0,1
"""Futrelle, Mrs. Jacques Heath (Lily May Peel)""","""female""",35.0,1


### Exercise 2

In [19]:
import numpy as np
np.random.seed(0)

N = 1_000_000
# Cardinality is half of N
cardinality = N // 2
# Create the random array of values for the join column
stringArray = [f"id{i}" for i in np.random.randint(0,cardinality,N)]

df_left = pl.DataFrame(
    {
        "id":stringArray
    }
)
df_left.head(3)

id
str
"""id461484"""
"""id305711"""
"""id435829"""


Create the right `DataFrame` with a single row for each `id`.

In [20]:
df_right = pl.DataFrame(
    {"id" : [f"id{i}" for i in np.arange(0,cardinality // 2)]}
)
df_right.head(3)

id
str
"""id0"""
"""id1"""
"""id2"""


Filter `df_left` by `df_right` using `filter` and `is_in`

In [None]:
%%timeit -n1 -r3 

df_left.filter(
    pl.col("id").is_in(df_right["id"].implode()) # implode converts the Series to a list
)

145 ms ± 6.59 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)


Filter `df_left` by `df_right` with a `semi` join

In [25]:
%%timeit -n1 -r3 

df_left.join(
    df_right,
    on="id",
    how="semi"
)

61.5 ms ± 12.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
