<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/Lab_2_3_technique_handle_large_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to handle large datasets in Python with Pandas

**Dataset**: [NYC Yellow Taxi Trip Data](https://https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data)

New York City (NYC) Taxi & Limousine Commission (TLC) keeps data from all its cabs, and it is freely available to download from its official website. Now, the TLC primarily keeps and manages data for 4 different types of vehicles:

1. Yellow Taxi: Yellow Medallion Taxicabs: These are the famous NYC yellow taxis that provide transportation exclusively through street hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.
2. Green Taxi: Street Hail Livery: The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.
3. For-Hire Vehicles (FHVs): FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

**Important Points**

In this dataset, we are considering only the Yellow Taxis Data, for the months of Jan 2015 & Jan-mar 2016.
If you go over to the website of NYC TLC, and download any of the CSV files, you will find a different format of these files. This is because, the TLC regularly adds more data, alongside updating the existing one.
One of the key changes that they have made to their data is that, instead of providing the pickup & dropoff coordinates, they have divided the NYC into regions and indexed those regions, and in the CSV files, they have provided these indices.
Due to this reason only, I have made this dataset using the previous version of the CSV files. This dataset allows me to practice my clustering knowledge alongside my time-series knowledge.
If you want to leave out the clustering part, then just go over to their website, and download the new CSV files.

# Techniques to handle large datasets

We will be using NYC Yellow Taxi Trip Data for the year 2016. The size of the dataset is around 1.5 GB which is good enough to explain the below techniques.

In [None]:
import pandas as pd
df = pd.read_csv('yellow_tripdata_2016-01.csv')
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RatecodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
dtype: object

In [None]:
df.memory_usage().sum()/(1024*1024*1024)

1.5439861416816711

In [None]:
df['store_and_fwd_flag'].memory_usage()/(1024*1024)

83.21279907226562

## 1. Use efficient data types
When you load the dataset into pandas dataframe, the default datatypes assigned to each column are not memory efficient. 

If we can convert these data types into memory-efficient ones we can save a lot of memory. For example, int64 can be downcast to int8 or int16 or int32 depending upon the max and min value the column holds.The below code takes of this downcasting for all numeric datatypes excluding object and DateTime types.

In [None]:
import pandas as pd
import numpy as np

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**3
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

As you can see from the below result, the size of the dataset is drastically reduced, about 60%, after downcasting the data types of columns. In the 2nd screenshot, you can see that data types are changed to int8 or float16 or float 32.

In [None]:
df_new = reduce_mem_usage(df)

Memory usage of dataframe is 1.54 MB
Memory usage after optimization is: 0.63 MB
Decreased by 59.3%


What happened here actually? Let’s take a column named passenger_count as an example. It holds values 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The default data type was int64. Do you need 64 bits to store these 10 values? No. 8 bits or 1 byte is enough to hold these values. Hence it will be down-casted into int8. Similar logic goes into other numeric data types.

What about “object” data type? converting object data type to a category can also save a lot of memory. For the given dataset, store_and_fwd_flag was converted to category type. 
As you can see from the below screenshot, the size of the columns reduced from 83 Mb to just 10 MB. 

In [None]:
df_new.dtypes

VendorID                     int8
tpep_pickup_datetime     category
tpep_dropoff_datetime    category
passenger_count              int8
trip_distance             float32
pickup_longitude          float16
pickup_latitude           float16
RatecodeID                   int8
store_and_fwd_flag       category
dropoff_longitude         float16
dropoff_latitude          float16
payment_type                 int8
fare_amount               float32
extra                     float16
mta_tax                   float16
tip_amount                float16
tolls_amount              float16
improvement_surcharge     float16
total_amount              float32
dtype: object

In [None]:
df['store_and_fwd_flag'].memory_usage()/(1024*1024)

In [None]:
df_new['store_and_fwd_flag'].memory_usage()/(1024*1024)

10.401758193969727

## 2. Remove unwanted columns
Sometimes you don’t need all the columns/features for your analysis. In such situations, you don’t have to load the dataset into pandas dataframe and then delete it. 

Instead, you can exclude the columns while loading the dataframe. This method along with the efficient data type can save reduce the size of the dataframe significantly. 

## 3. Chunking
Do you know Pandas read_csv, read_excel, etc. have ***chunksize*** parameter that can be used to read larger than the memory datasets?

When you use ***chunksize*** parameter, it returns an iterable object of the type TextFileReader. Next, as with any other iterable, you can iterate over this object until data is exhausted.
Refer to our article here to understand more about the iterables and iterators. 

In the below example, we are using chucksize of 100,000. What this means is that Pandas reads 100,000 each time and returns iterable called reader. Now you can perform any operation on this reader object. Once the processing on this object is done, Pandas reads next 100,000 records and the process continues until all the records are processed.

In [None]:
fare_amount_max = []

for reader in pd.read_csv('yellow_tripdata_2016-01.csv', chunksize=100000):
    # do any processing on reader
    fare_amount_max.append(reader['fare_amount'].max())

In [None]:
max(fare_amount_max)

111270.85

* Note that this method of using chunksize is useful only when the operation you are performing doesn’t require coordination between the chunks. 

