# Left, inner and fast-track joins
By the end of this lesson you will be able to:
- do inner joins between two dataframes in eager mode
- do left joins between two dataframes in eager mode
- do fast-track joins on sorted integer columns

In [None]:
import polars as pl

## CITES Dataset

The CITES organisation tracks international trade in endangered species.

This CSV has an extract of CITES trade data for 2021

In [None]:
csvFile = "../data/cites_extract.csv"

In [None]:
dfCITES = pl.read_csv(csvFile)
dfCITES

The `DataFrame` shows:
- the `Year` in which the trade occured
- the `Importer` and `Exporter` country in 2-digit ISO country codes
- the scientific name for the `Taxon` and
- the `Quantity` of items in the trade

For importers and exporters we would like to have:
- the full country name
- the region of that country

In the following CSV we have an extract from the ISO country data

In [None]:
isoCSVFile = "../data/countries_extract.csv"

In [None]:
dfISO = pl.read_csv(isoCSVFile)
dfISO

This `DataFrame` has:
- `alpha-2`: the 2-letter country code
- `name`: the full name of the country
- `region`: the region of the country

## Left join
In a left join we go through the left `DataFrame` row-by-row and try to add the data from the right `DataFrame` based on a join column in each `DataFrame`

In [None]:
dfCites = pl.read_csv(csvFile)

(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="left")
)

We join on:
- the `Importer` column for the left `DataFrame` and 
- `alpha-2` for the right `DataFrame`

In this case:
- we now have the `name` and `region` columns from `dfISO` that give the name and region for the importing country
- we have `null` in the last row because the country cide "UA" is missing from `dfISO`.

We want to rename `name` and `region` to reflect that these are the importer values

In [None]:
dfCites = pl.read_csv(csvFile)
(dfCITES
 .join(
     dfISO,
     left_on="Importer",
     right_on="alpha-2", 
     how="left"
     )
 .rename(
     {
         "name":"name_importer",
         "region":"region_importer"
     }
 )
)

We will join the exporter values in the exercises.

The `join` method also has a `suffix` option. This adds a suffix to the column names in the right `DataFrame` *for column names that occur in both `DataFrames`*.

## Inner join
In an inner join we only retain the rows in the left `DataFrame` where we can join to a value in the right `DataFrame`


In [None]:
dfCites = pl.read_csv(csvFile)
(
    dfCITES
    .join(
        dfISO,
        left_on="Importer",
        right_on="alpha-2", 
        how="inner"
    )
)

The final row that had `null` values for `name` and `region` is not present with an inner join.

## Joining on sorted columns
When we join on **integer** columns that are sorted Polars uses a fast-track algorithm.

To use the fast-track algorithm Polars needs to know the join columns are sorted. See the lecture on Sorting and fast-track algorithms in Section 3 if you want a reminder on this.

We explore the performance effect of joining on sorted columns in the exercises.


## Exercises

In the exercises you will develop your understanding of:
- doing a left join of two `DataFrames`
- doing an inner join of two `DataFrames`
- doing fast-track joins on sorted integer columns

## Exercise 1
Do a left join of the CITES trade extract with the country data on the importer column

In [None]:
dfCITES = pl.read_csv(csvFile)
(
    <blank>
)

Now add a left join with the country data on the **exporter** column

In [None]:
dfCITES = pl.read_csv(csvFile)
(
    <blank>
)

Do an inner join with the country data for both importer and exporter

In [None]:
dfCITES = pl.read_csv(csvFile)
(
    <blank>
)

## Exercise 2
In this exercise we see the effect of joins on sorted integers

We first create a pre-sorted array of `N` integers to be the join keys

In [None]:
# Set for only 4 rows to be printed
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)

N = 100_000
# Create a random array with values up to N/2 and then sort it
sortedArray = np.sort(np.random.randint(0,N //2,N))

We create the left-hand `DataFrame` with the sorted array

In [None]:
dfLeft = (
    pl.DataFrame(
        {
            "id":[i for i in sortedArray],
            "values":np.random.standard_normal(N)
        }
    )
)
dfLeft

We create the right-hand `DataFrame` with some metadata on the `id` column

In [None]:
dfRight = (
    pl.DataFrame(
        {
            "id":[i for i in range(N // 2)],
            "meta":[i for i in range(N //2)]
        }
    )
)
dfRight

Check the flags if Polars knows the `id` column is sorted

In [None]:
print(<blank>)
print(<blank>)

Time the performance for an unsorted join

In [None]:
%%timeit -n1 -r3
(
  <blank>  
)

Create new `DataFrames` where Polars knows the `id` column is sorted

In [None]:
dfLeftSorted = (
    <blank>
)
                
dfRightSorted = (
    <blank>
)


Check the flags to see if Polars knows the `id` column is sorted

In [None]:
print(<blank>)
print(<blank>)

Time the sorted join performance

In [None]:
%%timeit -n1 -r3
(
  <blank>  
)

Compare performance if only the left `DataFrame` is sorted. Hint: use `dfRight`

Change `N` to see how the relative performance differs with size

## Solutions

## Solution to Exercise 1
Do a left join of the CITES trade extract with the country data on the importer column

In [None]:
dfCITES = pl.read_csv(csvFile)
(dfCITES
 .join(dfISO,left_on="Importer",right_on="alpha-2", how="left")
 .rename({"name":"name_importer","region":"region_importer"})
)

Now add a left join with the country data on the **exporter** column

In [None]:
dfCITES = pl.read_csv(csvFile)
(dfCITES
 .join(dfISO,left_on="Importer",right_on="alpha-2", how="left")
 .rename({"name":"name_importer","region":"region_importer"})
 .join(dfISO,left_on="Exporter",right_on="alpha-2", how="left")
 .rename({"name":"name_exporter","region":"region_exporter"})
)

Do an inner join with the country data for both importer and exporter

In [None]:
dfCITES = pl.read_csv(csvFile)
(dfCITES
 .join(dfISO,left_on="Importer",right_on="alpha-2", how="inner")
 .rename({"name":"name_importer","region":"region_importer"})
 .join(dfISO,left_on="Exporter",right_on="alpha-2", how="inner")
 .rename({"name":"name_exporter","region":"region_exporter"})
)

## Solution to Exercise 2

In [None]:
pl.Config.set_tbl_rows(4)
import numpy as np
np.random.seed(0)
N = 10_000_000
sortedArray = np.sort(np.random.randint(0,N //2,N))
dfLeft = (
    pl.DataFrame(
        {
            "id":[i for i in sortedArray],
            "values":np.random.standard_normal(N)
        }
    )
)
dfLeft

In [None]:
dfRight = (
    pl.DataFrame(
        {
            "id":[i for i in range(N // 2)],
            "meta":[i for i in range(N //2)]
        }
    )
)
dfRight

Check the flags to see if Polars knows the `id` column is sorted

In [None]:
print(dfLeft["id"].flags)
print(dfRight["id"].flags)

Time the performance for an unsorted join

In [None]:
%%timeit -n1 -r3
(
    dfLeft.join(dfRight,on="id")
)

Create new `DataFrames` where Polars knows the `id` column is sorted

In [None]:
dfLeftSorted = (
    dfLeft
    .with_column(pl.col("id").set_sorted())
)
                
dfRightSorted = (
    dfRight
    .with_column(pl.col("id").set_sorted())
)


Check to see if Polars knows the `id` columns are sorted

In [None]:
print(dfLeftSorted["id"].flags)
print(dfRightSorted["id"].flags)

Time the sorted join performance

In [None]:
%%timeit -n1 -r3

(
    dfLeftSorted.join(dfRightSorted,left_on="id",right_on="id")
)

Compare performance if only the left `DataFrame` is sorted

In [None]:
%%timeit -n1 -r3

(
    dfLeftSorted.join(dfRight,left_on="id",right_on="id")
)