## Efficient Memory usage in Pandas

In [1]:
import numpy as np
import pandas as pd

## Creating our dataframe.

In [11]:
def get_data(size):
    df = pd.DataFrame()
    df['position'] = np.random.choice(['left','right','middle'],size)
    df['age'] = np.random.randint(22,44,size)
    df['team'] = np.random.choice(['Hawks','Bulls','Demons','Truckers','Vultures'],size)
    df['win'] = np.random.choice(['Yes','No'],size)
    df['prob'] = np.random.uniform(0,1,size)
    return df

In [19]:
df_10m = get_data(10_000_000)
df_10m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
 #   Column    Dtype  
---  ------    -----  
 0   position  object 
 1   age       int32  
 2   team      object 
 3   win       object 
 4   prob      float64
dtypes: float64(1), int32(1), object(3)
memory usage: 343.3+ MB


As it can be seen that,10M data values are consuming a huge amount of memory space.(around 343.3mb).

We can build the similar code in an effecient way.

for this we'll use comparitively smaller size data, size 1M

In [20]:
df_1m = get_data(1_000_000) 

In [21]:
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   position  1000000 non-null  object 
 1   age       1000000 non-null  int32  
 2   team      1000000 non-null  object 
 3   win       1000000 non-null  object 
 4   prob      1000000 non-null  float64
dtypes: float64(1), int32(1), object(3)
memory usage: 34.3+ MB


So, this is consuming just 35mb of memory space but we can still attempt to reduce it and make our code more effecient.

#### performing some group by operations on our data,
And, let's see how much time it takes to execute.

In [23]:
%timeit df_1m['age_rank'] = df_1m.groupby(['team','position'])['age'].rank()
%timeit df_1m['prob_rank'] = df_1m.groupby(['team','position'])['prob'].rank()
%timeit df_1m['win_prob_rank'] = df_1m.groupby(['team','position','win'])['prob'].rank()

465 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
676 ms ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
769 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
df_1m.head()

Unnamed: 0,position,age,team,win,prob,age_rank,prob_rank,win_prob_rank
0,left,40,Vultures,No,0.69498,56528.5,46792.0,23441.0
1,right,29,Bulls,Yes,0.674097,22777.0,44986.0,22753.0
2,right,32,Hawks,Yes,0.742484,31602.0,49483.0,24702.0
3,middle,28,Demons,Yes,0.244515,19862.5,16401.0,8183.0
4,left,27,Truckers,Yes,0.858566,16791.5,57277.0,28733.0


In [28]:
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   position       1000000 non-null  object 
 1   age            1000000 non-null  int32  
 2   team           1000000 non-null  object 
 3   win            1000000 non-null  object 
 4   prob           1000000 non-null  float64
 5   age_rank       1000000 non-null  float64
 6   prob_rank      1000000 non-null  float64
 7   win_prob_rank  1000000 non-null  float64
dtypes: float64(4), int32(1), object(3)
memory usage: 57.2+ MB


- position column is now just an object of string. But, from data understanding we can see that, it can be typecasted to category. so let's just do it first!

In [29]:
df_1m['position'] = df_1m['position'].astype('category')
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int32   
 2   team           1000000 non-null  object  
 3   win            1000000 non-null  object  
 4   prob           1000000 non-null  float64 
 5   age_rank       1000000 non-null  float64 
 6   prob_rank      1000000 non-null  float64 
 7   win_prob_rank  1000000 non-null  float64 
dtypes: category(1), float64(4), int32(1), object(2)
memory usage: 50.5+ MB


Memory usage size dropped to around 50mb.

Similar **Dropcasting** can be performed on team column.

In [32]:
df_1m['team'] = df_1m['team'].astype('category')
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int32   
 2   team           1000000 non-null  category
 3   win            1000000 non-null  object  
 4   prob           1000000 non-null  float64 
 5   age_rank       1000000 non-null  float64 
 6   prob_rank      1000000 non-null  float64 
 7   win_prob_rank  1000000 non-null  float64 
dtypes: category(2), float64(4), int32(1), object(1)
memory usage: 43.9+ MB


more reduction in memory usage can be observed.

Next, we work on the age column, which is an integer.**By default**, pandas store integers as **int64** values.

In [33]:
df_1m['age']

0         40
1         29
2         32
3         28
4         27
          ..
999995    41
999996    23
999997    37
999998    27
999999    38
Name: age, Length: 1000000, dtype: int32

### Int Downcasting Value Range.
- int8 can be stored from -128 to 127
- int16 can be stored from -32768 to 32767
- int64 can be stored from -9223372036854775808 to 9223372036854775807.

In [35]:
#age column dtype is int32 and let's try downcast it
df_1m['age'] = df_1m['age'].astype('int8')
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int8    
 2   team           1000000 non-null  category
 3   win            1000000 non-null  object  
 4   prob           1000000 non-null  float64 
 5   age_rank       1000000 non-null  float64 
 6   prob_rank      1000000 non-null  float64 
 7   win_prob_rank  1000000 non-null  float64 
dtypes: category(2), float64(4), int8(1), object(1)
memory usage: 41.0+ MB


In [37]:
df_1m['prob']

0         0.694980
1         0.674097
2         0.742484
3         0.244515
4         0.858566
            ...   
999995    0.435224
999996    0.600249
999997    0.040977
999998    0.177359
999999    0.238080
Name: prob, Length: 1000000, dtype: float64

In [39]:
df_1m['prob'] = df_1m['prob'].astype('float32')
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int8    
 2   team           1000000 non-null  category
 3   win            1000000 non-null  object  
 4   prob           1000000 non-null  float32 
 5   age_rank       1000000 non-null  float64 
 6   prob_rank      1000000 non-null  float64 
 7   win_prob_rank  1000000 non-null  float64 
dtypes: category(2), float32(1), float64(3), int8(1), object(1)
memory usage: 37.2+ MB


In [41]:
df_1m['age_rank'] = df_1m['age_rank'].astype('float32')
df_1m['prob_rank'] = df_1m['prob_rank'].astype('float32')
df_1m['win_prob_rank'] =df_1m['win_prob_rank'].astype('float32')
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int8    
 2   team           1000000 non-null  category
 3   win            1000000 non-null  object  
 4   prob           1000000 non-null  float32 
 5   age_rank       1000000 non-null  float32 
 6   prob_rank      1000000 non-null  float32 
 7   win_prob_rank  1000000 non-null  float32 
dtypes: category(2), float32(4), int8(1), object(1)
memory usage: 25.7+ MB


### Casting bool types

In [43]:
df_1m['win'] = df_1m['win'].map({'Yes':True,'No':False})
df_1m.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 8 columns):
 #   Column         Non-Null Count    Dtype   
---  ------         --------------    -----   
 0   position       1000000 non-null  category
 1   age            1000000 non-null  int8    
 2   team           1000000 non-null  category
 3   win            0 non-null        object  
 4   prob           1000000 non-null  float32 
 5   age_rank       1000000 non-null  float32 
 6   prob_rank      1000000 non-null  float32 
 7   win_prob_rank  1000000 non-null  float32 
dtypes: category(2), float32(4), int8(1), object(1)
memory usage: 25.7+ MB


- We have reduced the memory usage space for the same size data values from **35mb** to **25mb**.