# Concatenation

In [2]:
import polars as pl

In [3]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"India","importer":"USA","quantity":0},
        {"year":2000,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df_2000

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1


In [4]:
df_2001 = pl.DataFrame(
    [
        {"year":2001,"exporter":"India","importer":"USA","quantity":2},
        {"year":2001,"exporter":"India","importer":"USA","quantity":3},
    ]
)
df_2001

year,exporter,importer,quantity
i64,str,str,i64
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


## Combining `DataFrames` vertically

To combine two `DataFrame` as a new one, we can manage the data in memory in three different ways:
- keeping the data in the original two locations in memory and referencing the new `DataFrame` to these original locations
- copying the data to a single location in memory and referencing the new `DataFrame` to this single location
- appending the data from the second `DataFrame` to the location of the first `DataFrame`

### `vstack`

keeps the data from both `DataFrames` in their current locations in memory and points the new `DataFrame` to those locations

In [5]:
df_2000.vstack(
    df_2001
)

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


### Rechunk
`vstack` is computationally very cheap. 

However, subsequent operations (e.g. `group_by`) are slower than the data being *rechunked* (i.e. copied from the original two chunks to a new single location in memory.)

In [6]:
df_2000.vstack(
    df_2001
).rechunk()

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


### Extend
Append one `DataFrame` to another with `extend`

- copies the data from second `DataFrame` and appends it to the first `DataFrame`
- modifies the first `DataFrame`, like the parameter `in_place` in Pandas

In [7]:
df_2000.extend(
    df_2001
)

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


In [8]:
df_2000

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


Reassign df_2000

In [9]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"India","importer":"USA","quantity":0},
        {"year":2000,"exporter":"India","importer":"USA","quantity":1},
    ]
)
df_2000

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1


### Use case of `vstack`, `rechunk` and `extend`
- When combining `DataFrames` to do more transformations/groupby/joins etc, it is normally best to use `vstack` and `rechunk` so all of data is together in memory. In practice it is simpler to use `pl.concat`.
- When combining two `DataFrames` but do not want to do more operations on them, using `vstack`
- Add a small `DataFrame` to a large `DataFrame`, using `extend` as it only copies the data in the small `DataFrame`

### Vertically concatenating `DataFrames`

Combine a `list` of `DataFrames` with `pl.concat`.

For clarity, set the `how="vertical"` argument explicitly and this is the default argument.

In [12]:
pl.concat(
    [df_2000, df_2001],
    how="vertical"
)

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


`pl.concat` did the `vstack` and `rechunk`, we can close the `rechunk` by setting `rechunk=False` as well

In [13]:
df_vertical = pl.concat(
    [df_2000, df_2001],
    how="vertical",
    rechunk=False
)

df_vertical

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


### Handling different dtypes in vertical concatenation

Polars expects the column names and dtypes are matching when doing vertical concatenation.

In [14]:
df_2001_float= (
    df_2001
    .with_columns(
        pl.col("quantity").cast(pl.Float64)
    )
)
df_2001_float

year,exporter,importer,quantity
i64,str,str,f64
2001,"""India""","""USA""",2.0
2001,"""India""","""USA""",3.0


When concatenating different dtypes `DataFrame`, Polars will raise an `SchemaError`

In [15]:
pl.concat(
    [df_2000, df_2001_float]
)

SchemaError: type Float64 is incompatible with expected type Int64

In [16]:
pl.concat(
    [
        df_2000,
        df_2001_float.with_columns(
            pl.col("quantity").cast(pl.Int64)
        )
    ]
)

year,exporter,importer,quantity
i64,str,str,i64
2000,"""India""","""USA""",0
2000,"""India""","""USA""",1
2001,"""India""","""USA""",2
2001,"""India""","""USA""",3


Without casting, we can set the `how` parameter to `vertical_relaxed`, which will cast the dtype to `supertype`

In [17]:
pl.concat(
    [df_2000, df_2001_float],
    how="vertical_relaxed"
)

year,exporter,importer,quantity
i64,str,str,f64
2000,"""India""","""USA""",0.0
2000,"""India""","""USA""",1.0
2001,"""India""","""USA""",2.0
2001,"""India""","""USA""",3.0


## Horizontal concatenation
The horizontally concatenate `DataFrames` must have:
- the same number of rows
- different column names

In [18]:
df_2000_details = pl.DataFrame(
    [
        {"item":"Clothes","value":10},
        {"item":"Machinery","value":100},
    ]
 )
df_2000_details

item,value
str,i64
"""Clothes""",10
"""Machinery""",100


### `hstack`

In [19]:
df_2000.hstack(
    df_2000_details
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2000,"""India""","""USA""",0,"""Clothes""",10
2000,"""India""","""USA""",1,"""Machinery""",100


### Horizontal concatenation

In [20]:
pl.concat(
    [
        df_2000,
        df_2000_details
    ],
    how="horizontal"
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2000,"""India""","""USA""",0,"""Clothes""",10
2000,"""India""","""USA""",1,"""Machinery""",100


If there is any common column and overlaps in those common columns, we can use an alternative horizontal concatenation argument called `align`.

Polars identifies the common columns and aligns the rows appropriately.

In [22]:
(
    pl.concat(
        [
            pl.DataFrame(
                [
                    {"year":2000,"exporter":"India","item":"Clothes"},
                    {"year":2000,"exporter":"India","item":"Machinery"},
                ]
            ),
            pl.DataFrame(
                [
                    {"item":"Machinery","value":100}, # item has been overlapped
                ]
            )
        ]
        ,
        how="align"
    )
)

year,exporter,item,value
i64,str,str,i64
2000,"""India""","""Clothes""",
2000,"""India""","""Machinery""",100.0


## Diagonal concatenation

When the two `DataFrame` have different schema, we can set the `how` to `diagonal` to cover all columns.

The `DataFrame` which has no related columns will have a null value on it.

In [23]:
df_2000 = pl.DataFrame(
    [
        {"year":2000,"exporter":"China","importer":"USA","quantity":0},
        {"year":2000,"exporter":"China","importer":"USA","quantity":1},
    ]
)
df_2001 = pl.DataFrame(
    [
        {"year":2001,"exporter":"China","importer":"USA","quantity":2,"item":"Clothes","value":10},
        {"year":2001,"exporter":"China","importer":"USA","quantity":3,"item":"Machinery","value":100},
    ]
)

In [24]:
df_2000

year,exporter,importer,quantity
i64,str,str,i64
2000,"""China""","""USA""",0
2000,"""China""","""USA""",1


In [25]:
df_2001

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2001,"""China""","""USA""",2,"""Clothes""",10
2001,"""China""","""USA""",3,"""Machinery""",100


Concatenate `df_2000` & `df_2001` with `diagonal`

In [26]:
pl.concat(
    [df_2000, df_2001],
    how="diagonal"
)

year,exporter,importer,quantity,item,value
i64,str,str,i64,str,i64
2000,"""China""","""USA""",0,,
2000,"""China""","""USA""",1,,
2001,"""China""","""USA""",2,"""Clothes""",10.0
2001,"""China""","""USA""",3,"""Machinery""",100.0


## Exercises

### Exercise 1


In [31]:
sales_2000 = [
    {"make":"Giant","model":"Roam","quantity":100},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales_2000_df = pl.DataFrame(sales_2000)
sales_2000_df

make,model,quantity
str,str,i64
"""Giant""","""Roam""",100
"""Giant""","""Contend""",200
"""Trek""","""FX""",300


In [32]:
sales_2001 = [
    {"make":"Giant","model":"Roam","quantity":100.0},
    {"make":"Giant","model":"Contend","quantity":200},
    {"make":"Trek","model":"FX","quantity":300},
]
sales_2001_df = pl.DataFrame(sales_2001)
sales_2001_df

make,model,quantity
str,str,f64
"""Giant""","""Roam""",100.0
"""Giant""","""Contend""",200.0
"""Trek""","""FX""",300.0


Combine the 2000 and 2001 data into a single `DataFrame`

In [33]:
pl.concat(
    [sales_2000_df, sales_2001_df],
    how="vertical_relaxed"
)

make,model,quantity
str,str,f64
"""Giant""","""Roam""",100.0
"""Giant""","""Contend""",200.0
"""Trek""","""FX""",300.0
"""Giant""","""Roam""",100.0
"""Giant""","""Contend""",200.0
"""Trek""","""FX""",300.0


Now add a third year of data to the `DataFrame`

In [34]:
sales_2002 = [
    {"make":"Giant","model":"Roam","type":"Hybrid","quantity":100},
    {"make":"Giant","model":"Contend","type":"Gravel","quantity":200},
    {"make":"Trek","model":"FX","type":"Hybrid","quantity":300},
]
sales_2002_df = pl.DataFrame(sales_2002)
sales_2002_df

make,model,type,quantity
str,str,str,i64
"""Giant""","""Roam""","""Hybrid""",100
"""Giant""","""Contend""","""Gravel""",200
"""Trek""","""FX""","""Hybrid""",300


In [37]:
pl.concat(
    [sales_2000_df, sales_2001_df, sales_2002_df],
    how="diagonal_relaxed"
)

make,model,quantity,type
str,str,f64,str
"""Giant""","""Roam""",100.0,
"""Giant""","""Contend""",200.0,
"""Trek""","""FX""",300.0,
"""Giant""","""Roam""",100.0,
"""Giant""","""Contend""",200.0,
"""Trek""","""FX""",300.0,
"""Giant""","""Roam""",100.0,"""Hybrid"""
"""Giant""","""Contend""",200.0,"""Gravel"""
"""Trek""","""FX""",300.0,"""Hybrid"""


### Exercise 2

We want to produce a `DataFrame` that has:
- the 0.25,0.5 and 0.75 percentiles of the floating point columns on separate rows
- a column called `percentiles` to show the percentile for each row 

Begin by iterating over the list `quantiles`.

On each iteration calculate the quantile for the `Age` and `Fare` columns.

Append this output to the list `dfList`

In [40]:
csv_file = "data/titanic.csv"
df = pl.read_csv(csv_file)

df.head()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Owen Harris""","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. John Bradley (Fl…","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
3,1,3,"""Heikkinen, Miss. Laina""","""female""",26.0,0,0,"""STON/O2. 3101282""",7.925,,"""S"""
4,1,1,"""Futrelle, Mrs. Jacques Heath (…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
5,0,3,"""Allen, Mr. William Henry""","""male""",35.0,0,0,"""373450""",8.05,,"""S"""


In [41]:
quantiles = [0.25,0.5,0.75]
dfList = [df.select(pl.col(pl.Float64)).quantile(q) for q in quantiles]

dfList

[shape: (1, 2)
 ┌──────┬───────┐
 │ Age  ┆ Fare  │
 │ ---  ┆ ---   │
 │ f64  ┆ f64   │
 ╞══════╪═══════╡
 │ 20.0 ┆ 7.925 │
 └──────┴───────┘,
 shape: (1, 2)
 ┌──────┬─────────┐
 │ Age  ┆ Fare    │
 │ ---  ┆ ---     │
 │ f64  ┆ f64     │
 ╞══════╪═════════╡
 │ 28.0 ┆ 14.4542 │
 └──────┴─────────┘,
 shape: (1, 2)
 ┌──────┬──────┐
 │ Age  ┆ Fare │
 │ ---  ┆ ---  │
 │ f64  ┆ f64  │
 ╞══════╪══════╡
 │ 38.0 ┆ 31.0 │
 └──────┴──────┘]

Repeat this operation but this time on each iteration add a column called `percentile` that captures the percentile on that iteration.

In [None]:
quantiles = [0.25, 0.5, 0.75]
dfList = [
    df.select(pl.col(pl.Float64))
    .quantile(q)
    .with_columns(pl.lit(q).alias("percentiles"))
    for q in quantiles
]

dfList

[shape: (1, 3)
 ┌──────┬───────┬─────────────┐
 │ Age  ┆ Fare  ┆ percentiles │
 │ ---  ┆ ---   ┆ ---         │
 │ f64  ┆ f64   ┆ f64         │
 ╞══════╪═══════╪═════════════╡
 │ 20.0 ┆ 7.925 ┆ 0.25        │
 └──────┴───────┴─────────────┘,
 shape: (1, 3)
 ┌──────┬─────────┬─────────────┐
 │ Age  ┆ Fare    ┆ percentiles │
 │ ---  ┆ ---     ┆ ---         │
 │ f64  ┆ f64     ┆ f64         │
 ╞══════╪═════════╪═════════════╡
 │ 28.0 ┆ 14.4542 ┆ 0.5         │
 └──────┴─────────┴─────────────┘,
 shape: (1, 3)
 ┌──────┬──────┬─────────────┐
 │ Age  ┆ Fare ┆ percentiles │
 │ ---  ┆ ---  ┆ ---         │
 │ f64  ┆ f64  ┆ f64         │
 ╞══════╪══════╪═════════════╡
 │ 38.0 ┆ 31.0 ┆ 0.75        │
 └──────┴──────┴─────────────┘]

Concatenate the outputs

In [43]:
pl.concat(dfList)

Age,Fare,percentiles
f64,f64,f64
20.0,7.925,0.25
28.0,14.4542,0.5
38.0,31.0,0.75
