In [83]:
import pandas as pd

%load_ext memory_profiler
%load_ext line_profiler

The memory_profiler extension is already loaded. To reload it, use:
  %reload_ext memory_profiler
The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler


# Create a Dataframe concatenating other dataframes

What is a good strategy to grow incrementally a pandas dataframe in slices?

Let us try the following ideas:

- Store partial results in a list, then create a dataframe from a list with containing dataframes

- Append direcly partial results in a dataframe and return it

Conculision: **It is much better to store partial dataframe results in a list and then concatenate them with pd.concat()**

In [73]:
def append_many_df_as_list(initial_df, n=1_000):
    res = []
    for i in range(n):
        res.append(initial_df)
    return pd.concat(res)

def append_many_df(initial_df, n=1_000):
    df_res = pd.DataFrame()
    for i in range(n):
        df_res = pd.concat((df_res, initial_df))
    return df_res.reset_index(drop=True)

In [74]:
initial_df = pd.DataFrame({'a':[1]*10_000,
                           'b':['a']*10_000})

In [75]:
%%time
res_1 = append_many_df(initial_df)

CPU times: user 34.3 s, sys: 25.5 s, total: 59.7 s
Wall time: 1min 1s


In [76]:
%%time
res_2 = append_many_df_as_list(initial_df)

CPU times: user 51.3 ms, sys: 2.99 ms, total: 54.3 ms
Wall time: 53.8 ms


It is much faster to append partial dataframe results in a list

In [78]:
%lprun -f append_many_df append_many_df(initial_df)

Timer unit: 1e-06 s

Total time: 62.3616 s
File: /var/folders/83/gfk1v5cj3hb66glpkyr_wmtc0000gr/T/ipykernel_19478/3730665614.py
Function: append_many_df at line 7

Line #      Hits         Time  Per Hit   % Time  Line Contents
     7                                           def append_many_df(initial_df, n=1_000):
     8         1       1607.0   1607.0      0.0      df_res = pd.DataFrame()
     9      1001       4218.0      4.2      0.0      for i in range(n):
    10      1000   62263271.0  62263.3     99.8          df_res = pd.concat((df_res, initial_df))
    11         1      92461.0  92461.0      0.1      return df_res.reset_index(drop=True)

In [79]:
%lprun -f append_many_df_as_list append_many_df_as_list(initial_df)

Timer unit: 1e-06 s

Total time: 0.116299 s
File: /var/folders/83/gfk1v5cj3hb66glpkyr_wmtc0000gr/T/ipykernel_19478/3730665614.py
Function: append_many_df_as_list at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
     1                                           def append_many_df_as_list(initial_df, n=1_000):
     2         1          2.0      2.0      0.0      res = []
     3      1001        234.0      0.2      0.2      for i in range(n):
     4      1000        258.0      0.3      0.2          res.append(initial_df)
     5         1     115805.0 115805.0     99.6      return pd.concat(res)

Let us  profile the memory, we can see that both approaches reach the same amount of peak usage

In [81]:
%memit append_many_df(initial_df)

peak memory: 9284.47 MiB, increment: 427.57 MiB


In [82]:
%memit append_many_df_as_list(initial_df)

peak memory: 8861.65 MiB, increment: -3.05 MiB


# Read only 