## Feather

In [2]:
import pandas as pd

#### Read to Pandas

In [3]:
output_directory = "../data"

In [4]:
df = pd.read_csv(f"{output_directory}/combined_data.csv", parse_dates=True)


#### Write to Feather

In [5]:
%%time

df.to_feather(f"{output_directory}/combined_data.feather")

CPU times: user 7.82 s, sys: 6.87 s, total: 14.7 s
Wall time: 11.3 s


In [6]:
%%sh
du -sh "../data/combined_data.csv"
du -sh "../data/combined_data.feather"


5.6G	../data/combined_data.csv
1.2G	../data/combined_data.feather


### Read Feather File

In [7]:
%%time

df = pd.read_feather(f"{output_directory}/combined_data.feather")
print(df["model"].value_counts())

MPI-ESM1-2-HR       5154240
CMCC-CM2-HR4        3541230
CMCC-ESM2           3541230
CMCC-CM2-SR5        3541230
NorESM2-MM          3541230
TaiESM1             3541230
SAM0-UNICON         3541153
GFDL-ESM4           3219300
FGOALS-f3-L         3219300
GFDL-CM4            3219300
MRI-ESM2-0          3037320
EC-Earth3-Veg-LR    3037320
BCC-CSM2-MR         3035340
MIROC6              2070900
ACCESS-CM2          1932840
ACCESS-ESM1-5       1610700
INM-CM4-8           1609650
INM-CM5-0           1609650
FGOALS-g3           1287720
KIOST-ESM           1287720
AWI-ESM-1-1-LR       966420
MPI-ESM1-2-LR        966420
NESM3                966420
MPI-ESM-1-2-HAM      966420
NorESM2-LM           919800
BCC-ESM1             551880
CanESM5              551880
Name: model, dtype: int64
CPU times: user 8.53 s, sys: 8.54 s, total: 17.1 s
Wall time: 13.6 s


### Disscusion

- According to the lecture notes:
> "feather is how to store arrow in-memory data to disk, and feather was initially developed as a file format to exchange data between R and python quickly. People use feather for short term storage in situations like dataframe transfers. But the use of feather as long-term storage is not recommended (don’t know if things will change in the future), and file formats like parquet are still considered defacto for efficient long-term file storage."

- Writing to feather does not support `pandas.core.indexes.datetimes.DatetimeIndex` for the index, and therefore we need to reset the index to work with feather, and after loading we can set the index again.

- Feather is faster than Parquet, on my laptop `to_parquet` took 18.3s, whereas `to_feather` took 11.3s. 

- Parquet file is more memory efficient as the parquet file took 544M, whereas the feather file took 1.2G.

- Reading both the parquet file and the feather file and counting took around the same time.

Conclusion, if we care more about write speed, we should use feather, however if we care more about file size and memory, we should use Parquet.