# Left, inner, outer, cross and fast-track joins
By the end of this lecture you will be able to:
- do left, inner, outer and cross joins between `DataFrames` in eager mode
- do joins between `LazyFrames` in lazy mode and streaming mode
- do fast-track joins on sorted integer columns

In [None]:
import polars as pl
pl.Config.set_tbl_rows(10)

## CITES Dataset
For this lecture we use an extract from a database on international trade in endangered species gathered by the CITES organisation.

This CSV has an extract of CITES trade data for 2021

In [None]:
csvFile = "../data/cites_extract.csv"

In [None]:
dfCITES = pl.read_csv(csvFile)
dfCITES

The `DataFrame` shows:
- the `Year` in which the trade occured
- the `Importer` and `Exporter` country in 2-digit ISO country codes
- the scientific name for the `Taxon` and
- the `Quantity` of items in the trade

For importers and exporters we would like to have:
- the full country name
- the region of that country

We will join the trade data to the ISO country metadata in the following CSV

In [None]:
isoCSVFile = "../data/countries_extract.csv"

In [None]:
dfISO = pl.read_csv(isoCSVFile)
dfISO

This `DataFrame` has:
- `alpha-2`: the 2-letter country code
- `name`: the full name of the country
- `region`: the region of the country

## Left join
In a left join we go through the left `DataFrame` row-by-row and try to add the data from the right `DataFrame` based on a join column in each `DataFrame`

In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="left")
)

We join on:
- the `Importer` column for the left `DataFrame` and 
- `alpha-2` for the right `DataFrame`

In this case:
- we now have the `name` and `region` columns from `dfISO` that give the name and region for the importing country
- we have `null` in the last row because the country cide "UA" is missing from `dfISO`.

We want to rename `name` and `region` columns to reflect that these are the importer values

In [None]:
(dfCITES
 .join(
     dfISO,
     left_on="Importer",
     right_on="alpha-2", 
     how="left"
     )
 .rename(
     {
         "name":"name_importer",
         "region":"region_importer"
     }
 )
)

We will join the exporter values in the exercises.

The `join` method also has a `suffix` option. This adds a suffix to the column names in the right `DataFrame` *for column names that occur in both `DataFrames`*.

## Inner join
In an inner join we only retain the rows in the left `DataFrame` where we can join to a value in the right `DataFrame`


In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="inner"
    )
)

The final row that had `null` values for `name` and `region` is not present with an inner join.

## Outer join
In an outer join we return all rows from both `DataFrames` with `null` values where the value in the join column is not present in both `DataFrames`

In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="outer"
    )
)

In the first row there are `null` values for the `dfCITES` columns as `BJ` is present in `dfISO` but not in `dfCITES`

## Cross join
In a cross join we create rows with all the combinations of the values in the join columns

In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="cross"
    )
)

## Joins in lazy mode
We can do joins in lazy mode by joining on `LazyFrames` instead of `DataFrames`

Join operations - and cross joins in particular - can be memory intensive. The streaming feature in Polars can help reduce the memory pressure by running the operation in batches.

In this example we do the `join` in streaming mode by:
- converting `dfCITES` and `dfISO` to `LazyFrames` before joining
- calling `collect(streaming=True)` at the end to evaluate in streaming mode

In [None]:
(
    dfCITES
    .lazy()
    .join(
        dfISO.lazy(),
        left_on="Importer",
        right_on="alpha-2", 
        how="cross"
    )
    .collect(streaming=True)
)

## Joining on sorted columns
When we join on sorted **integer** columns Polars uses a fast-track algorithm.

To use the fast-track algorithm Polars needs to know the join columns are sorted. See the lecture on Sorting and fast-track algorithms in Section 3 if you want a reminder of how this works.

We explore the performance effect of joining on sorted columns in the exercises.

## Exercises

In the exercises you will develop your understanding of:
- doing a left join of two `DataFrames`
- doing an inner join of two `DataFrames`
- doing fast-track joins on sorted integer columns

### Exercise 1
Do a left join of the CITES trade extract with the country data on the importer column

In [None]:
dfCITES = pl.read_csv(csvFile)
dfISO = pl.read_csv(isoCSVFile)
(
    <blank>
)

Now add a left join with the country data on the **exporter** column

In [None]:
(
    <blank>
)

Do an inner join with the country data for both importer and exporter

In [None]:
(
    <blank>
)

### Exercise 2
In this exercise we see the effect of joins on sorted integers

We first create a pre-sorted array of `N` integers to be the join keys.

We control the `cardinality` - the number of unique join keys - with the `cardinality` variable

In [None]:
import numpy as np
np.random.seed(0)

N = 100_000
cardinality = N // 2

We create a left-hand `DataFrame` with:
- a sorted `id` column and
- a random `values` column

We create a right-hand `DataFrame` with
- a sorted `id` columns
- a metadata column (equal to the `id` column in this case)

In [None]:
def createDataFrames(N:int,cardinality:int):
    # Create a random array with values up to cardinality and then sort it to be the `id` column
    sortedArray = np.sort(np.random.randint(0,cardinality,N))
    dfLeft = (
        pl.DataFrame(
            {
                "id":[i for i in sortedArray],
                "values":np.random.standard_normal(N)
            }
        )
    )
    # We create the right-hand `DataFrame` with the `id` column and arbitrary metadata 
    dfRight = (
    pl.DataFrame(
        {
            "id":[i for i in range(cardinality)],
            "meta":[i for i in range(cardinality)]
        }
    )
)
    return dfLeft, dfRight
dfLeft,dfRight = createDataFrames(N = N, cardinality = cardinality)
dfLeft.head()

In [None]:
dfRight.head()

Check the flags if Polars knows the `id` column is sorted on the left and right `DataFrames`

In [None]:
print(<blank>)
print(<blank>)

Time the performance for an unsorted join

In [None]:
%%timeit -n1 -r3
(
  <blank>  
)

Create new left and right `DataFrames` where Polars knows the `id` column is sorted

In [None]:
dfLeftSorted = (
    <blank>
)
                
dfRightSorted = (
    <blank>
)


Check the flags to see if Polars knows the `id` column is sorted on these new `DataFrames`

In [None]:
print(<blank>)
print(<blank>)

Time the sorted join performance

In [None]:
%%timeit -n1 -r3
(
  <blank>  
)

Compare performance if only the left `DataFrame` is sorted. Hint: use `dfLeftSorted` and `dfRight`

In [None]:
%%timeit -n1 -r3
(
  <blank>  
)

Compare the relative performance between sorted and unsorted joins when `cardinality` is low (say `cardinality = 100`)

## Solutions

### Solution to Exercise 1
Do a left join of the CITES trade extract with the country data on the importer column

In [None]:
dfCITES = pl.read_csv(csvFile)
dfISO = pl.read_csv(isoCSVFile)
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="left"
    )
    .rename(
        {"name":"name_importer","region":"region_importer"}
    )
)

Now add a left join with the country data on the **exporter** column

In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="left"
    )
    .rename(
        {"name":"name_importer","region":"region_importer"}
    )
    .join(
        dfISO,
        left_on="Exporter",
        right_on="alpha-2", 
        how="left"
    )
    .rename({"name":"name_exporter","region":"region_exporter"})
)

Do an inner join with the country data for both importer and exporter

In [None]:
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="inner"
    )
    .rename(
        {"name":"name_importer","region":"region_importer"}
    )
    .join(
        dfISO,
        left_on="Exporter",
        right_on="alpha-2", 
        how="inner"
    )
    .rename(
        {"name":"name_exporter","region":"region_exporter"}
    )
)

### Solution to Exercise 2

In [None]:
import numpy as np
np.random.seed(0)
N = 10_000_000
cardinality = N // 2
dfLeft,dfRight = createDataFrames(N = N, cardinality = cardinality)

Check the flags to see if Polars knows the `id` column is sorted on the left and right `DataFrames`

In [None]:
print(dfLeft["id"].flags)
print(dfRight["id"].flags)

Time the performance for an unsorted join

In [None]:
%%timeit -n1 -r3
(
    dfLeft.join(dfRight,on="id")
)

Create new `DataFrames` where Polars knows the `id` column is sorted

In [None]:
dfLeftSorted = (
    dfLeft
    .with_columns(pl.col("id").set_sorted())
)
                
dfRightSorted = (
    dfRight
    .with_columns(pl.col("id").set_sorted())
)


Check to see if Polars knows the `id` columns are sorted

In [None]:
print(dfLeftSorted["id"].flags)
print(dfRightSorted["id"].flags)

Time the sorted join performance

In [None]:
%%timeit -n1 -r3

(
    dfLeftSorted.join(dfRightSorted,left_on="id",right_on="id")
)

This is much faster than the unsorted joins

Compare performance if only the left `DataFrame` is sorted

In [None]:
%%timeit -n1 -r3
(
    dfLeftSorted.join(dfRight,left_on="id",right_on="id")
)

There is still a benefit if just the left `DataFrame` is sorted

In [None]:
%%timeit -n1 -r3
(
    dfLeft.join(dfRightSorted,left_on="id",right_on="id")
)

So there is no performance benefit from just the right `DataFrame` being sorted 

Compare the relative performance when `cardinality` is low (say `cardinality = 100`)

With low cardinality the overall joins are much faster but the differences in performances from sorting are much smaller