<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/Reducing_Pandas_Memory_Usage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reducing Pandas Memory Usage

## Technique 1: Don’t load all the columns

Quite often the CSV you’re loading will include columns you don’t actually use when processing the data. If you don’t use them, there’s no point in loading them!

In the following example, I am loading the full CSV, checking how much memory is used by the DataFrame, and then shrinking down to just the columns I’m interested in:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv")
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 14 entries, Region to Total Profit
dtypes: float64(5), int64(2), object(7)
memory usage: 489.9 MB


In [None]:
df.head()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,South Africa,Fruits,Offline,M,7/27/2012,443368995,7/28/2012,1593,9.33,6.92,14862.69,11023.56,3839.13
1,Middle East and North Africa,Morocco,Clothes,Online,M,9/14/2013,667593514,10/19/2013,4611,109.28,35.84,503890.08,165258.24,338631.84
2,Australia and Oceania,Papua New Guinea,Meat,Offline,M,5/15/2015,940995585,6/4/2015,360,421.89,364.69,151880.4,131288.4,20592.0
3,Sub-Saharan Africa,Djibouti,Clothes,Offline,H,5/17/2017,880811536,7/2/2017,562,109.28,35.84,61415.36,20142.08,41273.28
4,Europe,Slovakia,Beverages,Offline,L,10/26/2016,174590194,12/4/2016,3973,47.45,31.79,188518.85,126301.67,62217.18


In [None]:
df = df[["Region", "Country"]]
df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 2 entries, Region to Country
dtypes: object(2)
memory usage: 132.3 MB


In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv", usecols=["Region", "Country"])
>>> df.info(verbose=False, memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Columns: 2 entries, Region to Country
dtypes: object(2)
memory usage: 132.3 MB


## Technique 2: Shrink numerical columns with smaller dtypes

Another technique can help reduce the memory used by columns that contain only numbers.

Each column in a Pandas DataFrame is a particular data type (dtype). For example, for integers there is the int64 dtype, int32, int16, and more.

Why does the dtype matter? First, because it affects what values you can store in that column:

* int8 can store integers from -128 to 127.
* int16 can store integers from -32768 to 32767.
* int64 can store integers from -9223372036854775808 to 9223372036854775807.

Second, the larger the range, the more memory is used. For example, int64 uses 4× as much memory as int16, and 8× as much as int8.

By default when Pandas loads a CSV, it guesses at the dtypes. If it decides a column volumes are all integers, by default it assigns that column int64 as the dtype.

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv")
df["Unit Price"].memory_usage(index=False, deep=True)
df["Unit Price"].max()
df["Unit Price"].min()

9.33

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv", dtype={"Unit Price ": "int8"})
df["Unit Price"].memory_usage(index=False, deep=True)

8000000

## Technique 3: Shrink categorical data using Categorical dtypes

What about non-numerical data? In some cases you can shrink those columns as well.

In [None]:
set(df["Item Type"])

{'Baby Food',
 'Beverages',
 'Cereal',
 'Clothes',
 'Cosmetics',
 'Fruits',
 'Household',
 'Meat',
 'Office Supplies',
 'Personal Care',
 'Snacks',
 'Vegetables'}

In [None]:
df["Item Type"].memory_usage(index=False, deep=True)

65583558

In [None]:
 df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv", dtype={"Item Type": "category"})
df["Item Type"].head()

0       Fruits
1      Clothes
2         Meat
3      Clothes
4    Beverages
Name: Item Type, dtype: category
Categories (12, object): ['Baby Food', 'Beverages', 'Cereal', 'Clothes', ..., 'Office Supplies',
                          'Personal Care', 'Snacks', 'Vegetables']

In [None]:
df["Item Type"].memory_usage(index=False, deep=True)

1001087

## Technique 4: Sparse series

If you have a column with lots of empty values, usually represented as NaNs, you can save memory by using a sparse column representation. It won’t waste memory storing all those empty values.

In [None]:
f = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv")
series = df["Ship Date"]
series.memory_usage(index=False, deep=True)

65936993

In [None]:
df.isna()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,False,False,False,False,False,False,False,False,False,False,False,False,False,False
999996,False,False,False,False,False,False,False,False,False,False,False,False,False,False
999997,False,False,False,False,False,False,False,False,False,False,False,False,False,False
999998,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [None]:
len(series)

1000000

In [None]:
len(series.dropna())

1000000

In [None]:
sparse_series = series.astype("Sparse[str]")

In [None]:
len(sparse_series)

1000000

In [None]:
sparse_series.memory_usage(index=False, deep=True)

KeyboardInterrupt: ignored