# Pandas: File format benchmark

We will compare reading and writing performances of several file formats:
 * CSV
 * Feather
 * Parquet

The metrics are:
 * size_mb — the size of the file (in Mb) with the serialized data frame
 * save_time — an amount of time required to save a data frame onto a disk
 * load_time — an amount of time needed to load the previously dumped data frame into memory
 * save_ram_delta_mb — the maximal memory consumption growth during a data frame saving process
 * load_ram_delta_mb — the maximal memory consumption growth during a data frame loading process

In [1]:
import random
import string
import numpy as np
import pandas as pd
from datetime import datetime
import pathlib
%load_ext memory_profiler

Create a large dataset

In [2]:
%%time
def gen_random_string(length:int=32) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
    
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):

    dt = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(dt)
    np.random.seed = seed
    columns = {
        'date': dt,
        'cat': np.random.choice(['cat1','cat2','cat3','cat4','cat5'],n),
        'str1':[gen_random_string() for _ in range(n)],
        'str2':[gen_random_string() for _ in range(n)],
        'a': np.random.rand(n),
        'b': np.random.rand(n),
        'c': np.random.randint(1,100,n),
    }

    df = pd.DataFrame(columns, columns=columns)
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

df = make_timeseries(start=datetime(2020,1,1), end=datetime(2023,12,31), freq='1min', seed=10)
df["cat"] = df["cat"].astype("category")

CPU times: user 7.86 s, sys: 242 ms, total: 8.1 s
Wall time: 8.1 s


Print the fisrt rows to see what the data looks like.

In [3]:
df.head()

Unnamed: 0,date,cat,str1,str2,a,b,c
0,2020-01-01 00:00:00,cat5,FYY2NJI5GNI43JLU2BHMLB7BR21EPRGC,TJKDJ0DYO6WGJXFNZFJI7K7T93I8U6PC,0.324755,0.481147,87
1,2020-01-01 00:01:00,cat2,IED441OT3A1M6N3I1DR7RK34QWAHGX1P,Q329243NA8HKXSC6JR2FMDW8DLUV9XG3,0.235484,0.915106,55
2,2020-01-01 00:02:00,cat4,U6AQRGQDDBQ40CN8WJSDPGXYKHRN4EAC,OZ6QUUJT5851KBAT23R4BR2KI1F7TQUO,0.231799,0.802818,39
3,2020-01-01 00:03:00,cat3,PRS76E9WDQFYO28P0KAQ14S5XXH9INVT,BNTCIFJ37IDR4240PVH64HDJDJ09UIKH,0.792647,0.51872,86
4,2020-01-01 00:04:00,cat1,682L5BFCDZK9U670DRP81U6Q3JTKSPWQ,G4VR2HF9PRXRTQLQV7VM1EYLX34GDYMH,0.945993,0.308536,87


Print the shape of the dataframe

In [4]:
df.shape

(2102401, 7)

Print memory usage of the dataframe 

In [5]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2102401 entries, 0 to 2102400
Data columns (total 7 columns):
 #   Column  Dtype         
---  ------  -----         
 0   date    datetime64[ns]
 1   cat     category      
 2   str1    object        
 3   str2    object        
 4   a       float64       
 5   b       float64       
 6   c       int64         
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 391.0 MB


## Write and read the dataframe in CSV format 

First we will time reading and writing the dataframe in CSV format, in order to compare the performance of pyarrow.

Measure the time and the memory to write dataframe

In [6]:
# TODO

In [7]:
%%time
df.to_csv("pandas.csv",index=False)

CPU times: user 8.45 s, sys: 146 ms, total: 8.6 s
Wall time: 8.64 s


In [8]:
%%memit
df.to_csv("pandas.csv",index=False)

peak memory: 821.58 MiB, increment: 0.50 MiB


Check the file on the disk

In [9]:
# TODO

In [10]:
pathlib.Path("pandas.csv").stat().st_size / 1024 /1024

265.56116771698

Measure the time and the memory to read the dataframe

In [11]:
# TODO

In [12]:
%%time
df1 = pd.read_csv("pandas.csv")

CPU times: user 3.53 s, sys: 243 ms, total: 3.78 s
Wall time: 3.78 s


In [13]:
%%memit
df1 = pd.read_csv("pandas.csv")

peak memory: 2026.89 MiB, increment: 644.98 MiB


## Write and read the dataframe in parquet format 

Measure the time and the memory to write dataframe

In [14]:
# TODO

In [15]:
%%time
df.to_parquet("pandas.parquet",index=False)

CPU times: user 870 ms, sys: 314 ms, total: 1.18 s
Wall time: 1.16 s


In [16]:
%%memit
df.to_parquet("pandas.parquet",index=False)

peak memory: 1733.83 MiB, increment: 182.97 MiB


Check the file on the disk

In [17]:
# TODO

In [18]:
pathlib.Path("pandas.parquet").stat().st_size / 1024 /1024

187.42490577697754

Measure the time and the memory to read the dataframe

In [19]:
# TODO

In [20]:
%%time
df1 = pd.read_parquet("pandas.parquet")

CPU times: user 1.07 s, sys: 764 ms, total: 1.83 s
Wall time: 1.34 s


In [21]:
%%memit
df1 = pd.read_parquet("pandas.parquet")

peak memory: 2715.60 MiB, increment: 1016.17 MiB


## Write and read the dataframe in feather format

Measure the time and the memory to write dataframe

In [22]:
# TODO

In [23]:
%%time
df.to_feather("pandas.feather")

CPU times: user 414 ms, sys: 334 ms, total: 748 ms
Wall time: 629 ms


In [24]:
%%memit
df.to_feather("pandas.feather")

peak memory: 2199.72 MiB, increment: 170.55 MiB


Check the file on the disk

In [25]:
# TODO

In [26]:
pathlib.Path("pandas.feather").stat().st_size / 1024 /1024

199.9052219390869

Measure the time and the memory to read the dataframe

In [27]:
# TODO

In [28]:
%%time
df1 = pd.read_feather("pandas.feather")

CPU times: user 617 ms, sys: 465 ms, total: 1.08 s
Wall time: 953 ms


In [29]:
%%memit
df1 = pd.read_feather("pandas.feather")

peak memory: 2898.86 MiB, increment: 750.05 MiB


## More excercices

### 1. Plot the benchmark results

### 2. Analyze the influence of the compression in reading and writing