<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/Strategies_deal_with_large_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strategies how to deal with large datasets in Pandas

[Dataset: 1000000 Sales Records](https://github.com/drshahizan/dataset/raw/main/1000000%20Sales%20Records.rar).


**Instruction**:
* Please extract the file *1000000 Sales Records.rar*.
* Please enter the file location: *1000000 Sales Records.csv* in a code

With datasets ever increasing in size, the optimal use of Pandas is becoming more and more tricky. In this blog I will share four strategies how to deal with large datasets when using Pandas.
Every data scientist knows that data pre-processing and feature engineering is paramount for a successful data science project. Often, however, these steps are time-consuming and involve you waiting for computations to finish, keeping you from creating that awesome model. In this post we will look at a few tricks that intend to speed up your pandas data-crunching workflows by enabling Pandas to use your machine in an optimal way.

Pandas is a powerful, versatile and easy-to-use Python library for manipulating data structures. For many data scientists like me, it has become the go-to tool when it comes to exploring and pre-processing data, as well as for engineering the best predictive features. Even though Pandas is still rapidly improving, we see Pandas users reverting to alternative tools like Spark as datasets become too large to fit in RAM memory. It is unfortunate that you have to learn and use a different tool, only because you have too much data. Therefore, I looked into four strategies to handle those too large datasets, all without leaving the comfort of Pandas:

## Sampling
The most simple option is sampling your dataset. This approach can be especially powerful during the exploration phase: how does the data look like? What features can I create? In other words, what works and what does not. Often a random sample of 10% of such a large dataset will already contain a lot of information. That raises the first question, do you actually need to process your entire dataset to train an adequate model?

In [None]:
!pip install feature_engine
!pip install -U feature-engine

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feature_engine
  Downloading feature_engine-1.5.2-py2.py3-none-any.whl (290 kB)
[K     |████████████████████████████████| 290 kB 4.5 MB/s 
Installing collected packages: feature-engine
Successfully installed feature-engine-1.5.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np
import random
from sklearn.linear_model import LogisticRegression


df =pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv")
df

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,South Africa,Fruits,Offline,M,7/27/2012,443368995,7/28/2012,1593,9.33,6.92,14862.69,11023.56,3839.13
1,Middle East and North Africa,Morocco,Clothes,Online,M,9/14/2013,667593514,10/19/2013,4611,109.28,35.84,503890.08,165258.24,338631.84
2,Australia and Oceania,Papua New Guinea,Meat,Offline,M,5/15/2015,940995585,6/4/2015,360,421.89,364.69,151880.40,131288.40,20592.00
3,Sub-Saharan Africa,Djibouti,Clothes,Offline,H,5/17/2017,880811536,7/2/2017,562,109.28,35.84,61415.36,20142.08,41273.28
4,Europe,Slovakia,Beverages,Offline,L,10/26/2016,174590194,12/4/2016,3973,47.45,31.79,188518.85,126301.67,62217.18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999995,Sub-Saharan Africa,Senegal,Baby Food,Offline,L,11/6/2010,575470578,12/11/2010,3387,255.28,159.42,864633.36,539955.54,324677.82
999996,Central America and the Caribbean,Panama,Office Supplies,Offline,C,1/12/2015,766942107,3/1/2015,4068,651.21,524.96,2649122.28,2135537.28,513585.00
999997,Europe,Norway,Office Supplies,Online,M,10/25/2011,685472047,12/5/2011,5266,651.21,524.96,3429271.86,2764439.36,664832.50
999998,Europe,Montenegro,Beverages,Offline,M,10/31/2010,946734225,12/8/2010,8551,47.45,31.79,405744.95,271836.29,133908.66


In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv" 
n = sum(1 for line in open(filename))-1  # Calculate number of rows in file
s = n//10  # sample size of 10%
skip = sorted(random.sample(range(1, n+1), n-s))  # n+1 to compensate for header 
df = pd.read_csv(filename, skiprows=skip)
df

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,Tanzania,Cosmetics,Offline,L,5/23/2016,739008080,5/24/2016,7768,437.20,263.33,3396169.60,2045547.44,1350622.16
1,Asia,Taiwan,Fruits,Offline,M,2/9/2014,732588374,2/23/2014,8034,9.33,6.92,74957.22,55595.28,19361.94
2,Central America and the Caribbean,The Bahamas,Personal Care,Online,C,1/19/2011,246248090,2/21/2011,9137,81.73,56.67,746767.01,517793.79,228973.22
3,Middle East and North Africa,Oman,Cosmetics,Online,H,11/29/2010,358570849,12/28/2010,7937,437.20,263.33,3470056.40,2090050.21,1380006.19
4,Europe,Switzerland,Office Supplies,Offline,C,7/29/2014,830410039,8/27/2014,5639,651.21,524.96,3672173.19,2960249.44,711923.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,Asia,Tajikistan,Household,Offline,L,9/30/2011,509657323,10/3/2011,7782,668.27,502.54,5200477.14,3910766.28,1289710.86
99996,Europe,Austria,Baby Food,Online,M,10/28/2015,476508653,11/20/2015,8354,255.28,159.42,2132609.12,1331794.68,800814.44
99997,Central America and the Caribbean,Panama,Office Supplies,Offline,C,1/12/2015,766942107,3/1/2015,4068,651.21,524.96,2649122.28,2135537.28,513585.00
99998,Europe,Montenegro,Beverages,Offline,M,10/31/2010,946734225,12/8/2010,8551,47.45,31.79,405744.95,271836.29,133908.66


## Chunking
If you do need to process all data, you can choose to split the data into a number of chunks (which in itself do fit in memory) and perform your data cleaning and feature engineering on each individual chunk. Moreover, depending on the type of model you want to use, you have two options:

* If the model of your choosing allows for partial fitting, you can incrementally train a model on the data of each chunk;
* Train a model on each individual chunk. Subsequently, to score new unseen data, make a prediction with each model and take the average or majority vote as the final prediction.

In [None]:
datafile = "/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv"
# chunksize = 100000
# models = []
# for chunk in pd.read_csv(datafile, chunksize=chunksize):
#     chunk = pre_process_and_feature_engineer(chunk) 
#     # A function to clean my data and create my features
#     model = LogisticRegression()
#     model.fit(chunk[features], chunk['label'])
#     models.append(model)
# df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv")
# df = pre_process_and_feature_engineer(df)
# predictions = mean([model.predict(df[features]) for model in models], axis=0)

result = None
for chunk in pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv", chunksize=1000):
    voters_street = chunk[
        "Region"]
    chunk_result = voters_street.value_counts()
    if result is None:
        result = chunk_result
    else:
        result = result.add(chunk_result, fill_value=0)

result.sort_values(ascending=False, inplace=True)
print(result)

Sub-Saharan Africa                   259953
Europe                               259036
Asia                                 146017
Middle East and North Africa         124344
Central America and the Caribbean    108042
Australia and Oceania                 80837
North America                         21771
Name: Region, dtype: int64


In [None]:
def get_counts(chunk):
    voters_street = chunk[
        "Region"]
    return voters_street.value_counts()
result = get_counts(pd.read_csv("/content/drive/MyDrive/Colab Notebooks/1000000 Sales Records.csv"))
result

Sub-Saharan Africa                   259953
Europe                               259036
Asia                                 146017
Middle East and North Africa         124344
Central America and the Caribbean    108042
Australia and Oceania                 80837
North America                         21771
Name: Region, dtype: int64

## Optimise data types
When loading data from file, Pandas automatically infers the datatypes. Very convenient of course, however, often these datatypes are not optimal and take up more memory than needed. We will go over the three most common datatypes used by Pandas — int, float and object — and show how to decrease their memory imprint while looking at an example.

As a default, Pandas sets the dtype of integers to int64, this datatype takes in 8 bytes and can represent humongous integers, from -9223372036854775808 to 9223372036854775807. Many times, however, integers represent countable entities, like number of cars or visitors per day. Those type of numbers can easily be represented in the four times smaller dtype int16. If your data fits in the range -32768 to 32767 convert them to int16 to achieve a memory reduction of 75%! In case your data is positive and under 65535, go for the unsigned variant, uint16.

In the same way, the float class consists of float16, float32 and float64, where the latter is Pandas’ default. Float64 can represent both very small and large numbers with high precision, which makes it suitable for accurate calculations. Often, however, you find yourself working with already noisy data, like sensor data or data with limited precision from itself such as currency. Again, the smaller datatypes float32 or float16 will serve many use case’s purpose and will reduce your memory imprint with 50%, respectively 75%.

Another way to drastically reduce the size of your Pandas Dataframe is to transform columns of dtype Object to category. Rather than having copies of the same string at many positions in your dataframe, pandas will have a single copy from each string and will use pointers under the hood that refer to these strings. However, notice, that if every row has a different string, this approach will not work.

In the notebook below, I demonstrate a dataframe memory imprint reduction from 69.8%, only by changing the datatypes.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Region          100000 non-null  object 
 1   Country         100000 non-null  object 
 2   Item Type       100000 non-null  object 
 3   Sales Channel   100000 non-null  object 
 4   Order Priority  100000 non-null  object 
 5   Order Date      100000 non-null  object 
 6   Order ID        100000 non-null  int64  
 7   Ship Date       100000 non-null  object 
 8   Units Sold      100000 non-null  int64  
 9   Unit Price      100000 non-null  float64
 10  Unit Cost       100000 non-null  float64
 11  Total Revenue   100000 non-null  float64
 12  Total Cost      100000 non-null  float64
 13  Total Profit    100000 non-null  float64
dtypes: float64(5), int64(2), object(7)
memory usage: 10.7+ MB


In [None]:
df['Region'].memory_usage()/(1024*1024)

0.7630615234375

In [None]:
def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
df_new = reduce_mem_usage(df)

Memory usage of dataframe is 0.01 MB
Memory usage after optimization is: 0.00 MB
Decreased by 69.8%


In [None]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype   
---  ------          --------------   -----   
 0   Region          100000 non-null  category
 1   Country         100000 non-null  category
 2   Item Type       100000 non-null  category
 3   Sales Channel   100000 non-null  category
 4   Order Priority  100000 non-null  category
 5   Order Date      100000 non-null  category
 6   Order ID        100000 non-null  int32   
 7   Ship Date       100000 non-null  category
 8   Units Sold      100000 non-null  int16   
 9   Unit Price      100000 non-null  float16 
 10  Unit Cost       100000 non-null  float16 
 11  Total Revenue   100000 non-null  float32 
 12  Total Cost      100000 non-null  float32 
 13  Total Profit    100000 non-null  float32 
dtypes: category(7), float16(2), float32(3), int16(1), int32(1)
memory usage: 3.2 MB
