## Concatenation
By the end of this lecture you will be able to:
- vertically concatenate a list of `DataFrames`
- horizontally concatenate a list of `DataFrames`
- diagonally concatenate a list of `DataFrames`


In [None]:
import polars as pl

We create a first `DataFrame` with fake trade records from 2020

In [None]:
df2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"India","importer":"USA","quantity":0},
        {"year":2020,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df2020

We now create a second fake `DataFrame`with trade records from 2021

In [None]:
df2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"India","importer":"USA","quantity":2},
        {"year":2021,"exporter":"India","importer":"USA","quantity":3},
    ]
)
df2021

## Vertical concatenation

We combine the 2020 and 2021 `DataFrames` into a single `DataFrame` with `pl.concat`

In [None]:
dfVertical = (
    pl.concat(
        [df2020,df2021]
    )
)
dfVertical

Vertical concatenation fails when:
- the dataframes do not have the same column names

## Horizontal concatenation
We create another `DataFrame` that has more details about each of the trades in 2020

In [None]:
df2020Details = pl.DataFrame(
    [
        {"item":"Clothes","value":10},
        {"item":"Machinery","value":100},
    ]
 )
df2020Details

We combine these details with the original records using a horizontal concatenation.

In [None]:
dfHorizontal = pl.concat(
    [
        df2020,df2020Details
    ]
    ,how="horizontal"
)
dfHorizontal

Horizontal concatenation fails when:
- the dataframes have overlapping column names or 
- a different number of rows

We can also concatenate horizontally with `hstack`

In [None]:
(
    df2020.hstack(df2020Details)
)

With `hstack` we can also add a `list` of `Series` as columns

In [None]:
(
    df2020.hstack([df2020Details["value"]])
)

## Diagonal concatenation

We are now looking at new fake trade records for 2020 and 2021 between China and the USA.

In 2020 the schema of the trade records is the same as we saw above with: 
- `year`
- `exporter` and 
- `importer`

In 2021 the schema changed and also includes:
- `item` and 
- `value`

In [None]:
df2020 = pl.DataFrame(
    [
        {"year":2020,"exporter":"China","importer":"USA","quantity":0},
        {"year":2020,"exporter":"China","importer":"USA","quantity":1},
    ]
)
df2021 = pl.DataFrame(
    [
        {"year":2021,"exporter":"China","importer":"USA","quantity":2,"item":"Clothes","value":10},
        {"year":2021,"exporter":"China","importer":"USA","quantity":3,"item":"Machinery","value":100},
    ]
)

We want to combine these records into a single `DataFrame`. As the column names are not the same we cannot do a vertical concatenation.

Instead we do a diagonal concatenation.

In [None]:
pl.concat([df2020,df2021],how="diagonal")

This diagonal concatenation is a vertical concatenation for the column names that match but with `null` values where the column names do not.

Diagonal concatenation can be a quick way to work with multiple CSVs or other files where:
- the columns are not the same in all files
- the order of the columns is not the same in all files

A hypothetical example for this is provided here

In [None]:
# df_list = []
# for file_path in list_of_file_paths:
#     df_list.append(
#         pl.read_csv(file_path)
#     )
# df = pl.concat(df_list,how="diagonal")

The disadvantage of the diagonal concatenation is that all possible columns are populated in the `DataFrame`.

However, if this is an issue you can then analyse the `DataFrame` to see which columns you actually want and pass these as a list to the `scan_csv` or `read_csv` function

## Exercises

### Exercise 1
We split the Titanic dataset into `dfLeft` and `dfRight`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
dfLeft = df[:,:6]
dfRight = df[:,5:]

Horizontally concatenate `dfLeft` and `dfRight`

In [None]:
pl.concat(<blank>)

### Exercise 2

You are given the following data from the sales of a bike shop. 

In [None]:
sales2020 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales2021 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]

Combine the full set of data into a single `DataFrame`

In [None]:
<blank>

Combine the overlapping columns into a single `DataFrame` with vertical concatenation

### Exercise 3
In the lecture on quantiles in the Statistics section we learned how to calculate quantiles.

In this exercise we will combine multiple quantiles into a single `DataFrame`.

As a reminder, this is how we calculate a single quantile on the floating point columns

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
q = 0.25
(
    df
    .select(
            pl.col(pl.Float64).quantile(q)
        )
)

We want to produce a `DataFrame` that has:
- the 0.25,0.5 and 0.75 percentiles of the floating point columns on separate rows
- a column called `percentiles` to show the percentile for each row 

Create this `DataFrame` using vertical concatenation.

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
<blank>

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

Concatenate the outputs

## Solutions

### Solution to Exercise 1

Horizontally concatenate `dfLeft` and `dfRight`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
dfLeft = df[:,:6]
dfRight = df[:,5:]

In [None]:
pl.concat([dfLeft,dfRight.drop("Age")],how="horizontal")

### Solution to Exercise 2

In [None]:
sales2020 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales2021 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]
dfSales2020 = pl.DataFrame(sales2020)
dfSales2021 = pl.DataFrame(sales2021)

Combine the full set of data into a single `DataFrame`

In [None]:
pl.concat([dfSales2020,dfSales2021],how="diagonal")

Combine the data with overlapping columns into a single `DataFrame`

In [None]:
pl.concat(
    [dfSales2020,dfSales2021.select(["make","model","quantity"])
    ])

### Solution to Exercise 3

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
)

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_columns(
            pl.lit(q).alias("percentiles")
        )
)

Concatenate the outputs

In [None]:
csvFile = "../data/titanic.csv"
df = pl.read_csv(csvFile)
quantiles = [0.25,0.5,0.75]
dfList = []
for q in quantiles:
    dfList.append(
        df
        .select(
            pl.col(pl.Float64).quantile(q)
        )
        .with_columns(
            pl.lit(q).alias("percentiles")
        )
)
pl.concat(dfList)