# Abstract

**Objective**: Compare the fastest way to concatenate pandas DataFrames.

**Methods**: Use the magic command `%timeit` 3 times.

**Conclusion**: Based on the average time, using a generator function and then applying `pandas.concat` is the same as adding DataFrames to a list and then applying `pandas.concat`. Applying `pandas.concat` inside a `for` loop is the slowest.

# Initialize

In [1]:
import sys
import pandas as pd

In [2]:
print('Python version')
print(sys.version)
print('---------------')
print('pandas version')
print(pd.__version__)

Python version
3.6.2 |Anaconda custom (64-bit)| (default, Jul 20 2017, 13:51:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
---------------
pandas version
0.20.3


# Define functions to test

In [3]:
def random_df(seed, size, low, high):
    return pd.DataFrame(random_array(seed, size, low, high))

def random_array(seed, size, low, high):
    return (pd
            .np
            .random
            .RandomState(seed=seed)
            .randint(low=low,
                     high=high,
                     size=size))

def loop_concat_df(N, size, low, high):
    df = pd.DataFrame()
    for ii in range(N):
        df = pd.concat([df, random_df(ii, size, low, high)])
    return df

def loop_concat_list(N, size, low, high):
    df_list = []
    for ii in range(N):
        df_list.append(random_df(ii, size, low, high))
    return pd.concat(df_list)

def loop_generator(N, size, low, high):
    for ii in range(N):
        yield random_df(ii, size, low, high)

def generator_concat(N, size, low, high):
    return pd.concat(loop_generator(N, size, low, high))

## Check if functions return the same value

In [4]:
N = 50  # number of dataframes
size = (10**2, 4)  # shape of each dataframe
low = 0  # smallest number
high = 10  # biggest number

df1 = loop_concat_df(N, size, low, high)
df2 = loop_concat_list(N, size, low, high)
df3 = generator_concat(N, size, low, high)

assert df1.equals(df2), '`loop_concat_df` and `loop_concat_list` are different'
assert df1.equals(df3), '`loop_concat_df` and `generator_concat` are different'

print('`loop_concat_df` = `loop_concat_list` = `generator_concat`')

`loop_concat_df` = `loop_concat_list` = `generator_concat`


# Run tests

In [5]:
print('loop_concat_df:')
%timeit loop_concat_df(N, size, low, high)
print('--------------------------------------------------------------')
print('loop_concat_list:')
%timeit loop_concat_list(N, size, low, high)
print('--------------------------------------------------------------')
print('generator_concat:')
%timeit generator_concat(N, size, low, high)

loop_concat_df:
38 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
--------------------------------------------------------------
loop_concat_list:
12 ms ± 939 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
--------------------------------------------------------------
generator_concat:
10.9 ms ± 656 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [6]:
print('loop_concat_df:')
%timeit loop_concat_df(N, size, low, high)
print('--------------------------------------------------------------')
print('loop_concat_list:')
%timeit loop_concat_list(N, size, low, high)
print('--------------------------------------------------------------')
print('generator_concat:')
%timeit generator_concat(N, size, low, high)

loop_concat_df:
30.8 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
--------------------------------------------------------------
loop_concat_list:
11.5 ms ± 943 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
--------------------------------------------------------------
generator_concat:
11.7 ms ± 1.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
print('loop_concat_df:')
%timeit loop_concat_df(N, size, low, high)
print('--------------------------------------------------------------')
print('loop_concat_list:')
%timeit loop_concat_list(N, size, low, high)
print('--------------------------------------------------------------')
print('generator_concat:')
%timeit generator_concat(N, size, low, high)

loop_concat_df:
24.7 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
--------------------------------------------------------------
loop_concat_list:
8.56 ms ± 82.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
--------------------------------------------------------------
generator_concat:
8.59 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
