## Data Storage

These are ways I found to save store data so that they are as LIGHT as possible

### A. CSV (with or without GZIP)

- This is super slow but works out of the box. 
- I don't find any reason to use it except for fast visualization when the data is small.

In [79]:
df.to_csv('database.csv', compression='gzip', index=True, index_label='index')
pd.read_csv('database.csv', compression='gzip')

Unnamed: 0,index,one,two
0,2,1.0,1.0
1,6,2.0,2.0
2,7,3.0,3.0
3,9,4.0,4.0


### B. NPZ

This one saves several arrays or multidimensional arrays into a single file in compressed.
- It's the best solution I know so far.
- **It works for both dataframes and series.**
- It ignores indexes.

In [83]:
np.savez_compressed('df.npz', df=df)
np.load('df.npz')['df']

array([[1., 1.],
       [2., 2.],
       [3., 3.],
       [4., 4.]])

In [84]:
np.savez_compressed('ts.npz', ts=ts)
np.load('ts.npz')['ts']

array([1., 2., 3., 4.])

### C. PARQUET

It has lots of dependecies, but after having them, it outperforms npz.
- **It only works with dataframes**.
- It produces the lightest compressions, in general.
- It doesn't ignore indexes.
- It's the fastest way.
- **It keeps the column names**.

In [85]:
df.to_parquet('database.parquet', compression='gzip')
pd.read_parquet('database.parquet')

Unnamed: 0,one,two
2,1.0,1.0
6,2.0,2.0
7,3.0,3.0
9,4.0,4.0


## Data Loading

Usually competitors **store their data after preprocessing and feature engineering**, but it might be too big, and also there might be a lot of kernel restarts, so they do the next:

hdf5 and npy files are loaded faster.

### A. Data is stored in 64 bit arrays. Downcast it to 32 bits

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

def downcast_dtypes(df):
    '''
        Changes column types in the dataframe: 
                `float64` type to `float32`
                `int64`   type to `int32`
    '''
    # Select columns to downcast
    float_cols = [c for c in df if df[c].dtype == "float64"]
    int_cols =   [c for c in df if df[c].dtype == "int64"]    
    # Downcast
    df[float_cols] = df[float_cols].astype(np.float32)
    df[int_cols]   = df[int_cols].astype(np.int32)
    return df

iris = load_iris()
data = {'col1' : [1., 2., 3., 4.], 'col2' : [4., 3., 2., 1.]}
df = pd.DataFrame(data)
print('Memory usage before {}'.format(df.memory_usage(index=True).sum()))
downcast_dtypes(df)
print('Memory usage after {}'.format(df.memory_usage(index=True).sum()))

Memory usage before 144
Memory usage after 112


### B. Converting csv/txt to hdf5 (for panda dataframes)/npy (for non-bit arrays) for faster loading

- **HDF5(Hierarchical Data Format):** Only when working with large scale datasets which don't fit in memory. This file type will be heavier than others.
- **NPY:** Numpy binary file. I'm not sure if you can read it by chunks as with HDF (TODO).

In [54]:
data = {'one' : [1., 2., 3., 4.],
        'two' : [1., 2., 3., 4.]}
df = pd.DataFrame(data, index=[2,6,7,9])

data = {'one' : [3., 4.],
        'two' : [3., 4.]}
df2 = pd.DataFrame(data)

# WRITE
df.to_hdf('database.h5', key='df', mode='a')
df2.to_hdf('database.h5', key='df2', mode='a')

# READ
df_ = pd.read_hdf('database.h5', 'df')
df2_ = pd.read_hdf('database.h5', 'df2')
print(df_)
print(df2_)

   one  two
2  1.0  1.0
6  2.0  2.0
7  3.0  3.0
9  4.0  4.0
   one  two
0  3.0  3.0
1  4.0  4.0


In [86]:
data = [1., 2., 3., 4.]
ts = pd.Series(data)

# WRITE
np.save(arr=ts, file='ts.npy')
# READ
ts_ = np.load('ts.npy')
ts_

array([1., 2., 3., 4.])

### C. Large datasets can be processed in chunks (TODO)