<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#df.memory_usage():--Column-memory" data-toc-modified-id="df.memory_usage():--Column-memory-1"><span class="toc-item-num">1&nbsp;&nbsp;</span><code>df.memory_usage()</code>:  Column memory</a></span></li><li><span><a href="#Sparse-Dataframes" data-toc-modified-id="Sparse-Dataframes-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sparse Dataframes</a></span><ul class="toc-item"><li><span><a href="#pd.arrays.SparseArray:-Convert-a-column-to-a-sparse-column" data-toc-modified-id="pd.arrays.SparseArray:-Convert-a-column-to-a-sparse-column-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><code>pd.arrays.SparseArray</code>: Convert a column to a sparse column</a></span></li><li><span><a href="#pd.to_numeric:-Downcasting-dtypes" data-toc-modified-id="pd.to_numeric:-Downcasting-dtypes-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span><code>pd.to_numeric</code>: Downcasting dtypes</a></span></li><li><span><a href="#sparsify:-Cast-columns-as-sparse-arrays" data-toc-modified-id="sparsify:-Cast-columns-as-sparse-arrays-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span><code>sparsify</code>: Cast columns as sparse arrays</a></span></li><li><span><a href="#Convert-DataFrame-to-sparse-matrix" data-toc-modified-id="Convert-DataFrame-to-sparse-matrix-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Convert DataFrame to sparse matrix</a></span></li></ul></li></ul></div>

# `df.memory_usage()`:  Column memory

Pandas provides `df.memory_usage()` that returns the amount of memory needed for each column in a dataframe. We can use this function to build a `df_MB` function that provides the overall memory usage of the dataframe, summing over the memory needed in all the columns.



In [135]:

def df_MB(df,str=True):
    BYTES_TO_MB_DIV = 0.000001
    mem = round(df.memory_usage().sum() * BYTES_TO_MB_DIV, 2) 
    if str:
        return f"{mem} MB"
    else:
        return mem


In this notebook we will use the movielens dataset.

`https://grouplens.org/datasets/movielens/100k/`



In [138]:
import pandas as pd
from os.path import expanduser

home = expanduser("~")
path_datasets = f"{home}/Documents/Datasets/movielens"
df = pd.read_csv(f'{path_datasets}/ml-100k/u.data', sep="\t", header=None)
df.columns = ["user_id", "item_id", "rating", "timestamp"]
df.drop(["timestamp"], axis=1, inplace=True)
display(df.head())

Unnamed: 0,user_id,item_id,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1


In [142]:
df.memory_usage()

Index         128
user_id    800000
item_id    800000
rating     800000
dtype: int64

The overal memory usage in MB is

In [141]:
df_MB(df)

'2.4 MB'

# Sparse Dataframes


Sparse datasets can use a lot of memory if not properly processed. In some cases, non sparse approaches can become unfeasible. Moreover, using Sparse structures can processing time, not just memory usage.




A classical transformation applyed to the categorical values in user_id and item_id is the one hot encoding

In [611]:
df_onehot = pd.get_dummies(df, columns=['user_id', 'item_id']) 
display(df_onehot.head())

Unnamed: 0,rating,user_id_1,user_id_2,user_id_3,user_id_4,user_id_5,user_id_6,user_id_7,user_id_8,user_id_9,...,item_id_1673,item_id_1674,item_id_1675,item_id_1676,item_id_1677,item_id_1678,item_id_1679,item_id_1680,item_id_1681,item_id_1682
0,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The issue with this dataframe is that we have increased the memory needed substantially

In [612]:
print("df mem: ", df_MB(df))
print("df_onehot mem: ", df_MB(df_onehot))
print("memory increase:", df_MB(df_onehot,str=False)/df_MB(df,str=False))

df mem:  2.4 MB
df_onehot mem:  263.3 MB
memory increase: 109.70833333333334


The amount of memory is increased by 100 nevertheless, most numbers are 0

## `pd.arrays.SparseArray`: Convert a column to a sparse column

We create a `SparseArray` object with the function `pd.arrays.SparseArray`.

In [146]:
maxval = 100
X = np.random.randint(0,maxval,1000)
X_sp = pd.arrays.SparseArray(X)
type(X_sp)

pandas.core.arrays.sparse.array.SparseArray

Note that this object van be used to create a DataFrame

In [147]:
pd.DataFrame(X_sp).head()

Unnamed: 0,0
0,60
1,79
2,55
3,58
4,38


Nevertheless this dataframe might be using more memory than needed because the elementtypes of the nonzero elements might be too big. In this case the maximum value in `X_sp` is 100 (by construction). Therefore, we could use a int8 to store any nonzero value

In [201]:
pd.DataFrame(X_sp).dtypes

0    Sparse[int64, 0]
dtype: object

The function `np.iinfo` can be used to know the minimum and maximum value of `dtype` in numpy.

In [215]:
display(np.iinfo(np.int8))
display(np.iinfo(np.uint8))
display(np.iinfo(np.int16))
display(np.iinfo(np.uint16))

iinfo(min=-128, max=127, dtype=int8)

iinfo(min=0, max=255, dtype=uint8)

iinfo(min=-32768, max=32767, dtype=int16)

iinfo(min=0, max=65535, dtype=uint16)

## `pd.to_numeric`: Downcasting dtypes 

The method `pd.to_numeric` allows a numpy array to be casted as the dtype that uses the lowest amount of memory able to represent all the elements in the array. 

We can use `X_downcast = pd.to_numeric(X,downcast='signed')` to generate a numpy array 
`X_downcast`  which has as `dtype` the elementype that uses the lowest amount of memory and can represent all elements in `X`.

The argument `downcast` can take values in `['integer', 'signed', 'unsigned', 'float']`.



- 'integer' or 'signed': smallest signed int dtype (min.: np.int8)

- 'unsigned' : smallest unsigned int dtype (min.: np.uint8)

- 'float': smallest float dtype (min.: np.float32)




If `X` is made of numbers from 0 to 100, then a `np.int8` or `np.uint8` are enough since botwh datatypes can hold values up to 127 and 255 respectively.

In [613]:
import numpy as np
maxval = 100
X = np.random.randint(0,maxval,1000)
X_downcast = pd.to_numeric(X_downcast,downcast='signed')
df_aux = pd.DataFrame(X_downcast)
df_aux.dtypes

0    int8
dtype: object


If `X` is made of numbers from 0 to 10000, then a `np.int16` or `np.uint16` are enough since botwh datatypes can hold values up to 32767 and 65535 respectively.

In [701]:
import numpy as np
maxval = 10000
X = np.random.randint(0,maxval,10000)
df_aux = pd.DataFrame(pd.to_numeric(X,downcast='signed'))
df_aux.dtypes

0    int16
dtype: object

With  this in mind we can define a function `find_downcast_strategy` that, using the following sets
```
int_set_dtypes   = set([np.int8, np.int16, np.int32, np.int64])
uint_set_dtypes  = set([np.uint8, np.uint16, np.uint32, np.uint64])
float_set_dtypes = set([np.float16, np.float32, np.float64])
```

returns 


- 'integer' or 'signed': if the dtype is in `int_set_dtypes`

- 'unsigned': if the dtype is in `uint_set_dtypes`

- 'float':  if the dtype is in `float_set_dtypes`



In [725]:
int_dtypes   = [np.int8, np.int16, np.int32, np.int64]
uint_dtypes  = [np.uint8, np.uint16, np.uint32, np.uint64]
float_dtypes = [np.float16, np.float32, np.float64]

    
def find_downcast_strategy(x, avoid_unsigned=False):
    """
    Return the correct downcast term depending on x.dtype
    """
    x_dtype = x.dtype
    if x_dtype in int_dtypes:
        return 'signed'
    elif x_dtype in uint_dtypes:
        if avoid_unsigned:
            return 'signed'
        else:
            return 'unsigned'
    elif x_dtype in float_dtypes:
        return 'float'

    

In [726]:
find_downcast_strategy(np.random.rand(10))

'float'

## `sparsify`: Cast columns as sparse arrays

Now we can check the type each column downcast strategy and create a new dataframe that uses less memory

In [760]:
def sparsify(df, exclude_cols=[], fixed_dtype=False, downcast=False):
    """
    Converts columns of `df` into SparseArrays.
    A list `exclude_colums` allows users to specify the columns that do not
    want to be sparsified.
    
    
    df:                  pandas data frame
    exclude_columns:     list
    downcast:              Bool
    
    return: Pandas DataFrame
    """
    df = df.copy()
    exclude_cols = set(exclude_cols)
    sparse_cols = set(df.columns) - exclude_cols
    
    for col in sparse_cols:
        df_col = df[col].values

        if downcast:
            dtype = find_downcast_strategy(df_col)
            df_col = pd.to_numeric(df_col, downcast=dtype)
            df[col] = pd.arrays.SparseArray(df_col)
        else:
            if fixed_dtype:
                df[col] = pd.arrays.SparseArray(df_col, dtype=fixed_dtype)
            else:
                dtype = df[col].values.dtype
                df[col] = pd.arrays.SparseArray(df_col, dtype=dtype)
    return df

In [761]:
df_onehot_sp = sparsify(df_onehot,  exclude_cols=['rating'], downcast=True)
display(df_onehot_sp.dtypes)
print("df_onehot mem:", df_MB(df_onehot))
print("df_onehot_sp mem:", df_MB(df_onehot_sp))

rating                     int64
user_id_1       Sparse[uint8, 0]
user_id_2       Sparse[uint8, 0]
user_id_3       Sparse[uint8, 0]
user_id_4       Sparse[uint8, 0]
                      ...       
item_id_1678    Sparse[uint8, 0]
item_id_1679    Sparse[uint8, 0]
item_id_1680    Sparse[uint8, 0]
item_id_1681    Sparse[uint8, 0]
item_id_1682    Sparse[uint8, 0]
Length: 2626, dtype: object

df_onehot mem: 263.3 MB
df_onehot_sp mem: 1.8 MB


In [762]:
df_onehot_sp = sparsify(df_onehot,  exclude_cols=['rating'], downcast=False)
display(df_onehot_sp.dtypes)
print("df_onehot mem:", df_MB(df_onehot))
print("df_onehot_sp mem:", df_MB(df_onehot_sp))

rating                     int64
user_id_1       Sparse[uint8, 0]
user_id_2       Sparse[uint8, 0]
user_id_3       Sparse[uint8, 0]
user_id_4       Sparse[uint8, 0]
                      ...       
item_id_1678    Sparse[uint8, 0]
item_id_1679    Sparse[uint8, 0]
item_id_1680    Sparse[uint8, 0]
item_id_1681    Sparse[uint8, 0]
item_id_1682    Sparse[uint8, 0]
Length: 2626, dtype: object

df_onehot mem: 263.3 MB
df_onehot_sp mem: 1.8 MB


Sometimes we migth be interested in having the same type for all the element types.
We can use `fixed_dtype=np.int32` to ensure it.

In [763]:
df_onehot_sp = sparsify(df_onehot, fixed_dtype=np.int32)
display(df_onehot_sp.dtypes)
print("df_onehot mem:", df_MB(df_onehot))
print("df_onehot_sp mem:", df_MB(df_onehot_sp))

rating          Sparse[int32, 0]
user_id_1       Sparse[int32, 0]
user_id_2       Sparse[int32, 0]
user_id_3       Sparse[int32, 0]
user_id_4       Sparse[int32, 0]
                      ...       
item_id_1678    Sparse[int32, 0]
item_id_1679    Sparse[int32, 0]
item_id_1680    Sparse[int32, 0]
item_id_1681    Sparse[int32, 0]
item_id_1682    Sparse[int32, 0]
Length: 2626, dtype: object

df_onehot mem: 263.3 MB
df_onehot_sp mem: 2.4 MB


## Convert DataFrame to sparse matrix

This part is very important to use data in other packages, such as scikit-learn

In [790]:
from scipy.sparse import lil_matrix
import numpy as np

def data_frame_to_scipy_sparse_matrix(df):
    """
    Converts a sparse pandas data frame to sparse scipy csr_matrix.
    :param df: pandas data frame
    :return: csr_matrix
    """
    arr = lil_matrix(df.shape, dtype=np.float32)
    for i, col in enumerate(df.columns):
        ix = df[col] != 0
        arr[np.where(ix), i] = 1

    return arr.tocsr()

def get_csr_memory_usage(matrix):
    mem = (X_csr.data.nbytes + X_csr.indptr.nbytes + X_csr.indices.nbytes) * BYTES_TO_MB_DIV
    print("Memory usage is " + str(mem) + " MB")

X_csr = data_frame_to_scipy_sparse_matrix(df_onehot_sp)
get_csr_memory_usage(X_csr)


Memory usage is 2.800004 MB


In [807]:
df_X_onehot_sp = df_onehot_sp[df_onehot_sp.columns.difference(["rating"])]
df_y_onehot_sp = df_onehot_sp[["rating"]]

In [None]:
import sklearn
from sklearn import linear_model
m = linear_model.LogisticRegression()

Note that using directly the dataframe takes a lot of time

In [865]:
n = 10000
m.fit(df_X_onehot_sp[0:n], df_y_onehot_sp[0:n].values.ravel())

  return f(**kwargs)


LogisticRegression(max_iter=1000)

Nevertheless, converting the data to a scipy.sparsearray and fitting the model yields much faster training

In [866]:
n = 10000
X_csr = data_frame_to_scipy_sparse_matrix(df_X_onehot_sp[0:n])
y = df_y_onehot_sp[0:n].values.ravel()
m = linear_model.LogisticRegression(max_iter=1000)

In [867]:
m.fit(X_csr, y)

LogisticRegression(max_iter=1000)