# Pandas: File format benchmark

We will compare reading and writing performances of several file formats:
 * CSV
 * Feather
 * Parquet

The metrics are:
 * size_mb — the size of the file (in Mb) with the serialized data frame
 * save_time — an amount of time required to save a data frame onto a disk
 * load_time — an amount of time needed to load the previously dumped data frame into memory
 * save_ram_mb — the memory consumption during a data frame saving process
 * load_ram_mb — the memory consumption during a data frame loading process

In [None]:
import random
import string
import numpy as np
import pandas as pd
from datetime import datetime
import pathlib
%load_ext memory_profiler

Create a large dataset

In [None]:
%%time
def gen_random_string(length:int=32) -> str:
    return ''.join(random.choices(string.ascii_uppercase + string.digits, k=length))
    
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):

    dt = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(dt)
    np.random.seed = seed
    columns = {
        'date': dt,
        'cat': np.random.choice(['cat1','cat2','cat3','cat4','cat5'],n),
        'str1':[gen_random_string() for _ in range(n)],
        'str2':[gen_random_string() for _ in range(n)],
        'a': np.random.rand(n),
        'b': np.random.rand(n),
        'c': np.random.randint(1,100,n),
    }

    df = pd.DataFrame(columns, columns=columns)
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

df = make_timeseries(start=datetime(2020,1,1), end=datetime(2023,12,31), freq='1min', seed=10)
df["cat"] = df["cat"].astype("category")

Print the fisrt rows to see what the data looks like.

In [None]:
df.head()

Print the shape of the dataframe

In [None]:
df.shape

Print memory usage of the dataframe 

In [None]:
df.info(memory_usage="deep")

## Write and read the dataframe in CSV format 

First we will time reading and writing the dataframe in CSV format, in order to compare the performance of pyarrow.

Measure the time and the memory to write dataframe

In [None]:
# TODO

Check the file on the disk

In [None]:
# TODO

Measure the time and the memory to read the dataframe

In [None]:
# TODO

## Write and read the dataframe in parquet format 

Measure the time and the memory to write dataframe

In [None]:
# TODO

Check the file on the disk

In [None]:
# TODO

Measure the time and the memory to read the dataframe

In [None]:
# TODO

## Write and read the dataframe in feather format

Measure the time and the memory to write dataframe

In [None]:
# TODO

Check the file on the disk

In [None]:
# TODO

Measure the time and the memory to read the dataframe

In [None]:
# TODO

## More excercices

### 1. Plot the benchmark results

### 2. Analyze the influence of the compression in reading and writing