# Pandas: File format benchmark

We will compare reading and writing performances of several file formats:
 * CSV
 * Feather
 * Parquet

The metrics are:
 * size_mb — the size of the file (in Mb) with the serialized data frame
 * save_time — an amount of time required to save a data frame onto a disk
 * load_time — an amount of time needed to load the previously dumped data frame into memory
 * save_ram_delta_mb — the maximal memory consumption growth during a data frame saving process
 * load_ram_delta_mb — the maximal memory consumption growth during a data frame loading process

In [1]:
import random
import string
import numpy as np
import pandas as pd
from datetime import datetime
import pathlib
%load_ext memory_profiler

Create a large dataset

In [2]:
%%time
def gen_random_string(length:int=32) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
    
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):

    dt = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(dt)
    np.random.seed = seed
    columns = {
        'date': dt,
        'cat': np.random.choice(['cat1','cat2','cat3','cat4','cat5'],n),
        'str1':[gen_random_string() for _ in range(n)],
        'str2':[gen_random_string() for _ in range(n)],
        'a': np.random.rand(n),
        'b': np.random.rand(n),
        'c': np.random.randint(1,100,n),
    }

    df = pd.DataFrame(columns, columns=columns)
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

df = make_timeseries(start=datetime(2020,1,1), end=datetime(2023,12,31), freq='1min', seed=10)
df["cat"] = df["cat"].astype("category")

CPU times: user 23.3 s, sys: 465 ms, total: 23.8 s
Wall time: 23.8 s


Print the fisrt rows to see what the data looks like.

In [3]:
df.head()

Unnamed: 0,date,cat,str1,str2,a,b,c
0,2020-01-01 00:00:00,cat4,KIIHL0VEUZR48V2DS7M537A9ZQCEPQ7C,YRIEWCQLFZHX6FE6OJ1QV51PS9HNSRTD,0.582354,0.92828,20
1,2020-01-01 00:01:00,cat2,U446Z57ITE5U9ZMAE59AQ3ICHKZH7BQC,JS6YSS5HU7HSYMKV3ZZ2L45F0LP0KL19,0.917722,0.207994,20
2,2020-01-01 00:02:00,cat5,MP742U2IJ5T3XKYULEHD7SVG09M6X152,4QDNQ4ZA768WCYSJOTIXU6YFD64O6GDH,0.344823,0.228929,39
3,2020-01-01 00:03:00,cat3,WZ4FQQAMBUGXFDEN7BQ2AZNKPWWP9JBL,YQRIW2CPN3TRIEWVQK8TVWEZB1BNAK1N,0.466455,0.683289,37
4,2020-01-01 00:04:00,cat1,PNNFUI4L19PAZZ9J68TXHSENP65CYFJJ,YNFLFMT3JHJVZ8D3K0299PIEK1RLXU8M,0.654103,0.035013,94


Print the shape of the dataframe

In [4]:
df.shape

(2102401, 7)

Print memory usage of the dataframe 

In [5]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2102401 entries, 0 to 2102400
Data columns (total 7 columns):
 #   Column  Dtype         
---  ------  -----         
 0   date    datetime64[ns]
 1   cat     category      
 2   str1    object        
 3   str2    object        
 4   a       float64       
 5   b       float64       
 6   c       int64         
dtypes: category(1), datetime64[ns](1), float64(2), int64(1), object(2)
memory usage: 423.1 MB


## Write and read the dataframe in CSV format 

First we will time reading and writing the dataframe in CSV format, in order to compare the performance of pyarrow.

Measure the time and the memory to write dataframe

In [6]:
# TODO

In [7]:
%%time
df.to_csv("pandas.csv",index=False)

CPU times: user 11.3 s, sys: 488 ms, total: 11.8 s
Wall time: 11.9 s


In [8]:
%%memit
df.to_csv("pandas.csv",index=False)

peak memory: 916.26 MiB, increment: 2.33 MiB


Check the file on the disk

In [9]:
# TODO

In [10]:
pathlib.Path("pandas.csv").stat().st_size / 1024 /1024

265.5627746582031

Measure the time and the memory to read the dataframe

In [11]:
# TODO

In [12]:
%%time
df1 = pd.read_csv("pandas.csv")

CPU times: user 4.76 s, sys: 373 ms, total: 5.13 s
Wall time: 5.15 s


In [13]:
%%memit
df1 = pd.read_csv("pandas.csv")

peak memory: 2292.89 MiB, increment: 767.73 MiB


## Write and read the dataframe in parquet format 

Measure the time and the memory to write dataframe

In [14]:
# TODO

In [15]:
%%time
df.to_parquet("pandas.parquet",index=False)

CPU times: user 1.04 s, sys: 610 ms, total: 1.65 s
Wall time: 1.58 s


In [16]:
%%memit
df.to_parquet("pandas.parquet",index=False)

peak memory: 2032.72 MiB, increment: 286.93 MiB


Check the file on the disk

In [17]:
# TODO

In [18]:
pathlib.Path("pandas.parquet").stat().st_size / 1024 /1024

187.54562950134277

Measure the time and the memory to read the dataframe

In [19]:
# TODO

In [20]:
%%time
df1 = pd.read_parquet("pandas.parquet")

CPU times: user 1.43 s, sys: 1.67 s, total: 3.1 s
Wall time: 2.29 s


In [21]:
%%memit
df1 = pd.read_parquet("pandas.parquet")

peak memory: 2650.42 MiB, increment: 626.09 MiB


## Write and read the dataframe in feather format

Measure the time and the memory to write dataframe

In [22]:
# TODO

In [25]:
%%time
df.to_feather("pandas.feather")

CPU times: user 673 ms, sys: 544 ms, total: 1.22 s
Wall time: 1.08 s


In [26]:
%%memit
df.to_feather("pandas.feather")

peak memory: 1924.81 MiB, increment: 290.80 MiB


Check the file on the disk

In [27]:
# TODO

In [28]:
pathlib.Path("pandas.feather").stat().st_size / 1024 /1024

199.90581703186035

Measure the time and the memory to read the dataframe

In [29]:
# TODO

In [30]:
%%time
df1 = pd.read_feather("pandas.feather")

CPU times: user 1.02 s, sys: 979 ms, total: 2 s
Wall time: 1.67 s


In [31]:
%%memit
df1 = pd.read_feather("pandas.feather")

peak memory: 2695.30 MiB, increment: 469.24 MiB


## More excercices

### 1. Plot the benchmark results

### 2. Analyze the influence of the compression in reading and writing