## Extending, stacking and concatenating
By the end of this lecture you will be able to:
- combine two `DataFrames` with `vstack`, `extend` and vertical `concat`
- explain the advantages and disadvantages of each approach

In [None]:
import polars as pl
import numpy as np

np.random.seed(0)

In [None]:
df1 = (
    pl.DataFrame(
        {
            "id":[0,1],
            "values":["a","b"]
        }
    )
)
df1

In [None]:
df2 = (
    pl.DataFrame(
        {
            "id":[2,3],
            "values":["c","d"]
        }
    )
)
df2

## Combining `DataFrames`
If we have data in two different `DataFrames` then we can combine them as a new `DataFrame` while treating the data for the original `DataFrames` in three different ways:
- keeping the data in the original two locations in memory and linking to these
- copying the data to a single location in memory
- appending the data from the second `DataFrame` to the location of the first `DataFrame`

The trade-offs relative to copying both to the same location in memory are that: 
- keeping the data in their original locations is cheap but makes subsequent operations slower
- copying data to a new location provides more consistent performance whereas the other methods are more variable.

## Methods
We cover three methods for combining `DataFrames`: `pl.concat`,`df.vstack` and `df.extent`. The output of each method is the same from a user perspective but differs in terms of where the data sits in memory underneath the hood.

Later we examine the performance implications of the methods for some simple operations.

### Concatenation
In the first lecture of this section we saw how to combine `DataFrames` with a vertical concatenation

In [None]:
print(pl.concat(
    [
        df1,df2
    ]
))

A vertical concatenation:
- combines a `list` of `DataFrames` into a single `DataFrame`
- rechunks (copies) the data into a single location in memory

We can tell Polars in `pl.concat` not to copy the data to a single location in memory with `rechunk = False`

In [None]:
pl.concat(
    [
        df1,df2
    ],
    rechunk=False
)

### Vstack
We can combine two `DataFrames` with `vstack`

In [None]:
(
    df1
    .vstack(
        df2
    )
)

A `vstack`:
- keeps the data in the original locations in memory

A `vstack` is computationally cheap but subsequent operations are slower than if the data has been rechunked (i.e. copied to a single location).

### Extend
We can append one `DataFrame` to another with `extend`

In [None]:
(
    df1
    .extend(
        df2
    )
)

An `extend`:
- copies the data from `df2` to append to the location of `df1`
- this append operation can lead to all the data being copied to a new location if there is not enough space in the existing location of `df1`

### Rechunk
We can manually cause two `DataFrames` linked by `vstack` to be copied to a single location in memory with `rechunk`

In [None]:
(
    df1
    .vstack(
        df2
    )
    .rechunk()
)

If we have done a `vstack` (or a series of `vstacks`) we can call `rechunk` to copy all the data to a single location in memory

The official API docs provide the following advice:

> Prefer `extend` over `vstack` when you want to do a query after a single append.
For instance during online operations where you add `n` rows and rerun a query.

> Prefer `vstack` over `extend` when you want to append many times before doing a
 query. For instance when you read in multiple files and when to store them in a
        single `DataFrame`. In the latter case, finish the sequence of `vstack`
        operations with a `rechunk`.

I would also say:
- use `extend` whenever adding a small `DataFrame` to a big one to avoid copying the big one
- use `vstack` if you aren't going to do a computationally intensive query on the output - say you just want to count the length of the data
- test the three options on your own data (see below)


## Horizontal stacking
We can also grow a `DataFrame` from a `Series` or another `DataFrame` with `hstack`. This is always a cheap operation as a new `DataFrame` is created without copying data

In [None]:
df1 = pl.DataFrame(
    {
        "a":[0,1,2]
    }
)
df2 = pl.DataFrame(
    {
        "b":[0,1,2]
    }
)
df1.hstack(df2)

## Exploring performance of different strategies
There are no exercises here as `concat`,`vstack` and `extend` do the same thing and we have already seen exercises for `concat`.

Instead we look at the relative performance of different methods.

We begin with a function to make a `DataFrame` with an integer `id` column and many floating point columns

In [None]:
def makeDataFrame(N:int,K:int,cardinality:int):
    return (
    pl.DataFrame(
        {
            "id":np.random.randint(0,cardinality,N)
        }
    ).hstack(
        pl.DataFrame(
            np.random.standard_normal((N,K))
        )
    )
)
N = 100_000
K = 100
cardinality = 1000

df = makeDataFrame(N=N,K=K,cardinality=cardinality)
df.head(2)

We now make another large `DataFrame` and a small `DataFrame`

In [None]:
# Make another large DataFrame
dfOther = makeDataFrame(N=N,K=K,cardinality=cardinality)
# Make another small DataFrame
dfOtherSmall = makeDataFrame(N=100,K=K,cardinality=cardinality)

As there are many functions to compare we create a wrapper function to time execution below

In [None]:
from functools import wraps
import time

def timeit(func):
    @wraps(func)
    def timeit_wrapper(*args, **kwargs):
        start_time = time.perf_counter()
        result = func(*args, **kwargs)
        end_time = time.perf_counter()
        total_time = end_time - start_time
        print(f'Function {func.__name__} Took {1000*total_time:.2f} milliseconds')
        return result
    return timeit_wrapper


We define the functions for different strategies wrapped by the timing module

In [None]:
@timeit
def concatRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=True)
@timeit
def concatNoRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=False)
@timeit
def vstack(df1,df2):
    return df1.vstack(df2)
@timeit
def vstackRechunk(df1,df2):
    return df1.vstack(df2).rechunk()
@timeit
def extend(df1,df2):
    return df1.extend(df2)

We first test the timings for combining the two large `DataFrames` to a new `DataFrame`. Note that:
- relative timings may vary between your machine and my machine due to hardware differences
- timings often get slower the longer the kernel has been running, it may be worth restarting it periodically
- it is worth running any set of timings a few times as they do vary

In [None]:
concatRechunk(df,dfOther)
concatNoRechunk(df,dfOther)
vstack(df,dfOther)
vstackRechunk(df,dfOther)
extend(df,dfOther)
# Add a line here to stop the wrapped function printing a DataFrame
a = 1

On my machine `concatRechunk` takes about 100 milliseconds. If you get values much larger than this I recommend restarting your kernel and trying again.

We see that combining `DataFrames` without copying any data (`concatNoRechunk`,`vstack`) is very fast - especially `vstack`.

### Combining a large and small `DataFrame`

Compare how long it takes to combine `df` with `dfOtherSmall`

In [None]:
concatRechunk(df,dfOtherSmall)
concatNoRechunk(df,dfOtherSmall)
vstack(df,dfOtherSmall)
vstackRechunk(df,dfOtherSmall)
extend(df,dfOtherSmall)
a = 1

In this case we also find that `extend` is very fast as it copies data but only copies the second smaller `DataFrame` that it appends to the large `DataFrame` (on some runs for me `extend` can also be as slow as `concatRechunk` though)

### Combining and doing a groupby
We now want to do a `groupby` on the combined `DataFrame`. We want to see if the strategies that do a `rechunk` make up time with a faster `groupby` 

In [None]:
@timeit
def concatRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=True).groupby("id").agg(
        pl.col(pl.Float64).mean()
    )
@timeit
def concatNoRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=False).groupby("id").agg(
        pl.col(pl.Float64).mean()
    )
@timeit
def vstack(df1,df2):
    return df1.vstack(df2).groupby("id").agg(
        pl.col(pl.Float64).mean()
    )
@timeit
def vstackRechunk(df1,df2):
    return df1.vstack(df2).groupby("id").agg(
        pl.col(pl.Float64).mean()
    )
@timeit
def extend(df1,df2):
    return df1.extend(df2).groupby("id").agg(
        pl.col(pl.Float64).mean()
    )

In [None]:
concatRechunk(df,dfOther)
concatNoRechunk(df,dfOther)
vstack(df,dfOther)
vstackRechunk(df,dfOther)
extend(df,dfOther)
# Add a line here to stop the wrapped function printing a DataFrame
a = 1

On my machine the differences are much smaller between strategies in this case but `vstack` is still fastest

### Combining and doing a sort
We now combine and sort by the `id` column

In [None]:
@timeit
def concatRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=True).sort("id")
@timeit
def concatNoRechunk(df1,df2):
    return pl.concat([df1,df2,],rechunk=False).sort("id")
@timeit
def vstack(df1,df2):
    return df1.vstack(df2).sort("id")
@timeit
def vstackRechunk(df1,df2):
    return df1.vstack(df2).sort("id")
@timeit
def extend(df1,df2):
    return df1.extend(df2).sort("id")

In [None]:
concatRechunk(df,dfOther)
concatNoRechunk(df,dfOther)
vstack(df,dfOther)
vstackRechunk(df,dfOther)
extend(df,dfOther)
# Add a line here to stop the wrapped function printing a DataFrame
a = 1

Again in this case the `vstack` strategy is fastest (for me at least) but the relative differences are smaller.

The timings presented here cannot provide results on the different methods but do show that it may be worth experimenting with the various approaches in your own queries.

## Horizontal combinations
Both of the horizontal combinations are similarly fast as there is no rechunking of data

In [None]:
@timeit
def concat_horizontal(df1,df2):
    return pl.concat([df1,df2.select(pl.all().suffix("_other"))],how="horizontal")
@timeit
def hstack(df1,df2):
    return df1.hstack(df2.select(pl.all().suffix("_other")))

In [None]:
# Make more large DataFrames
dfHorizontal1 = makeDataFrame(N=N,K=K,cardinality=cardinality)
dfHorizontal2 = makeDataFrame(N=N,K=K,cardinality=cardinality)

In [None]:
concat_horizontal(dfHorizontal1,dfHorizontal2)
hstack(dfHorizontal1,dfHorizontal2)
