## 1. How much memory is used by Pandas Dataframe

In [1]:
import pandas as pd
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)

In [2]:
diamonds = pd.read_csv('diamonds.csv')
print(diamonds.head())

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
3   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
4   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75


##### Memory used by each column in Pandas Dataframe

In [3]:
diamonds.memory_usage(deep=True)

Index          128
carat       431520
cut        3413674
color      3128520
clarity    3242590
depth       431520
table       431520
price       431520
x           431520
y           431520
z           431520
dtype: int64

##### Total memory used by Dataframe

In [4]:
diamonds.memory_usage(deep=True).sum()

12805552

In [10]:
import numpy as np

## 2. Save memory with pandas

Let's try to reduce the memory usage for `cut` column. First let's check how much memory is used.

In [5]:
diamonds['cut'].memory_usage(deep=True)

3413802

#### 2.1 Saving memory using categories.  
As you can see below, there only 5 unique values are there `cut` column.

In [6]:
diamonds['cut'].unique()

array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)

##### See the difference in memory usage with Category

In [7]:
print(diamonds['cut'].memory_usage(deep=True))
print(diamonds['cut'].astype('category').memory_usage(deep=True))

3413802
54554


#### 2.2 Saving memory using short numbers. 

By default, `int64` is used by pandas dataframe for numbers, which is not required for all the columns.

In [8]:
diamonds['price'].min(), diamonds['price'].max()

(326, 18823)

`price` column has values between 326 to 18823, which can be covered using `int16`

In [11]:
np.iinfo('int16')

iinfo(min=-32768, max=32767, dtype=int16)

In [12]:
print('int16 -->', diamonds['price'].astype('int16').memory_usage(deep=True))
print('int64 -->', diamonds['price'].memory_usage(deep=True))

int16 --> 108008
int64 --> 431648


Follow me @itsafiz for more Python and ML content.