<a href="https://colab.research.google.com/github/drshahizan/Python_Tutorial/blob/main/big%20data/Lab_4_NYC_Large_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Strategies to Deal With Large Datasets Using Pandas
[Source: Guido Tournois](https://www.codementor.io/@guidotournois/4-strategies-to-deal-with-large-datasets-using-pandas-qdw3an95k)

In [None]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
import numpy as np
import pandas as pd
import random
from sys import getsizeof

[Dataset: Yellow Tripdata 2015](https://data.cityofnewyork.us/dataset/Yellow-Tripdata-2015-January-June/2yzn-sicd) or [Kaggle](https://www.kaggle.com/datasets/elemento/nyc-yellow-taxi-trip-data)

In [None]:
df = pd.read_csv('data/yellow_tripdata_2015-01.csv')

For the sake of demonstration the benefits from Pandas' category, let's add a random pickup neighbourhood to each row 

In [None]:
nyc_neighbourhoods = [line.rstrip() 
                      for line 
                      in open('nyc_neighbourhoods.txt')]
df['pickup_neighbourhood'] = df.VendorID.apply(
                lambda x: random.choice(nyc_neighbourhoods))

In [None]:
start_size = getsizeof(df)/(1024.0**3)
print('Dataframe size: %2.2f GB'%start_size)

Dataframe size: 4.93 GB


In [None]:
df.dtypes

VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RateCodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
pickup_neighbourhood      object
dtype: object

## Integers

In [None]:
# VendorID is either 1 or 2, so boolean suffices 
df.VendorID = df.VendorID.apply(lambda x: x==2)  

# passenger_count, RateCodeID and payment_type contain 0<x<65535
df.passenger_count = df.passenger_count.astype('uint8')
df.RateCodeID = df.RateCodeID.astype('uint8')       
df.payment_type = df.payment_type.astype('uint8')

### Convert Dollars to cents

In [None]:
monetary_columns = ['fare_amount','tip_amount',
                    'total_amount','tolls_amount','extra']
df[monetary_columns] = \
    df[monetary_columns].apply(lambda row: (row*100).astype('uint8'))

In [None]:
print('Dataframe size: %2.2f GB'%(getsizeof(df)/(1024.0**3)))

Dataframe size: 4.18 GB


## Floats

In [None]:
location_columns = ['pickup_latitude','pickup_longitude',
                    'dropoff_latitude','dropoff_longitude']
df[location_columns] = df[location_columns].astype('float32') 

In [None]:
# 0.0<trip_distance<1.54e+07meters, so convert to km
df.trip_distance = (df.trip_distance/1000).astype('float16') 

# only 0.0 and 0.3 occur
df.improvement_surcharge = df.improvement_surcharge.apply(lambda x: x==0.3)

# Precision of float32 is sufficient for lat and lon
location_columns = ['pickup_latitude','pickup_longitude',
                    'dropoff_latitude','dropoff_longitude']
df[location_columns] = df[location_columns].astype('float32') 

In [None]:
print('Dataframe size: %2.2f GB'%(getsizeof(df)/(1024.0**3)))

Dataframe size: 3.83 GB


## Object

In [None]:
# store_and_fwd_flag contains Y or N
df.store_and_fwd_flag = df.store_and_fwd_flag.apply(lambda x: x=='Y')

# Convert string to datetime64[ns]
date_time_columns = ['tpep_pickup_datetime','tpep_dropoff_datetime']
for col in date_time_columns:
    df[col] = pd.to_datetime(df[col])

In [None]:
print('Dataframe size: %2.2f GB'%(getsizeof(df)/(1024.0**3)))

Dataframe size: 1.45 GB


## Categories

In [None]:
df.mta_tax = df.mta_tax.astype('category')
df.payment_type = df.payment_type.astype('category')
df.pickup_neighbourhood = df.pickup_neighbourhood.astype('category')

In [None]:
final_size = getsizeof(df)/(1024.0**3)
print('Dataframe size: %2.2f GB'%final_size)

Dataframe size: 0.57 GB


# Total reduction: 88.4%!

In [None]:
print('total size reduction: %2.1f'%((1-final_size/start_size)*100))

total size reduction: 88.4


In [None]:
df.dtypes

VendorID                           bool
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   uint8
trip_distance                   float16
pickup_longitude                float32
pickup_latitude                 float32
RateCodeID                        uint8
store_and_fwd_flag                 bool
dropoff_longitude               float32
dropoff_latitude                float32
payment_type                   category
fare_amount                       uint8
extra                             uint8
mta_tax                        category
tip_amount                        uint8
tolls_amount                      uint8
improvement_surcharge              bool
total_amount                      uint8
pickup_neighbourhood           category
dtype: object