# Ice Breaking on Dataset

### Investigation & Optimization


## Investigation
* For this project, the total dataset consists of 12 `csv` files
  * Each file contains overall information about taxi trips in Chicago by month
* Investigate & find some important information in `chicago_taxi_trips_2016_01.csv`

### Make some pre-processing functions to optimize the total (12) data files.

In [1]:
import pandas as pd

trips_2016_01 = pd.read_csv('data/chicago_taxi_trips_2016_01.csv')

In [2]:
trips_2016_01.head()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016-1-13 06:15:00,2016-1-13 06:15:00,180.0,0.4,,,24.0,24.0,4.5,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,2776.0,2016-1-22 09:30:00,2016-1-22 09:45:00,240.0,0.7,,,,,4.45,4.45,0.0,0.0,8.9,Credit Card,,,,,
2,3168.0,2016-1-31 21:30:00,2016-1-31 21:30:00,0.0,0.0,,,,,42.75,5.0,0.0,0.0,47.75,Credit Card,119.0,,,,
3,4237.0,2016-1-23 17:30:00,2016-1-23 17:30:00,480.0,1.1,,,6.0,6.0,7.0,0.0,0.0,0.0,7.0,Cash,,686.0,500.0,686.0,500.0
4,5710.0,2016-1-14 05:45:00,2016-1-14 06:00:00,480.0,2.71,,,32.0,,10.25,0.0,0.0,0.0,10.25,Cash,,385.0,478.0,,


In [3]:
trips_2016_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705805 entries, 0 to 1705804
Data columns (total 20 columns):
taxi_id                   float64
trip_start_timestamp      object
trip_end_timestamp        object
trip_seconds              float64
trip_miles                float64
pickup_census_tract       float64
dropoff_census_tract      float64
pickup_community_area     float64
dropoff_community_area    float64
fare                      float64
tips                      float64
tolls                     float64
extras                    float64
trip_total                float64
payment_type              object
company                   float64
pickup_latitude           float64
pickup_longitude          float64
dropoff_latitude          float64
dropoff_longitude         float64
dtypes: float64(17), object(3)
memory usage: 260.3+ MB


In [4]:
trips_2016_01._data

BlockManager
Items: Index(['taxi_id', 'trip_start_timestamp', 'trip_end_timestamp', 'trip_seconds',
       'trip_miles', 'pickup_census_tract', 'dropoff_census_tract',
       'pickup_community_area', 'dropoff_community_area', 'fare', 'tips',
       'tolls', 'extras', 'trip_total', 'payment_type', 'company',
       'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
       'dropoff_longitude'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=1705805, step=1)
FloatBlock: [0, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19], 17 x 1705805, dtype: float64
ObjectBlock: [1, 2, 14], 3 x 1705805, dtype: object

In [5]:
print('total size: {}'.format(trips_2016_01.size))
print('column length: {}'.format(len(trips_2016_01.columns)))
print('row length: {}'.format(len(trips_2016_01)))

total size: 34116100
column length: 20
row length: 1705805


In [6]:
obj_cols = trips_2016_01.select_dtypes(include=['object'])
float_cols = trips_2016_01.select_dtypes(include=['float'])

In [7]:
obj_cols_mem = obj_cols.memory_usage(deep=True)
float_cols_mem = float_cols.memory_usage(deep=True)

In [8]:
obj_cols_mem

Index                          80
trip_start_timestamp    127468137
trip_end_timestamp      127462927
payment_type            109573614
dtype: int64

In [9]:
float_cols_mem

Index                           80
taxi_id                   13646440
trip_seconds              13646440
trip_miles                13646440
pickup_census_tract       13646440
dropoff_census_tract      13646440
pickup_community_area     13646440
dropoff_community_area    13646440
fare                      13646440
tips                      13646440
tolls                     13646440
extras                    13646440
trip_total                13646440
company                   13646440
pickup_latitude           13646440
pickup_longitude          13646440
dropoff_latitude          13646440
dropoff_longitude         13646440
dtype: int64

### Note.

* Most columns have the same amount of memory usages
* We can check the range by each column and optimize

In [10]:
# total memory usages by megabyte
trips_2016_01.memory_usage(deep=True).sum()/2**20

568.86123466491699

In [11]:
# object type columns have only 3 columns in data
# but use more memory than float type columns do
obj_cols_mem.sum() / 2**20

347.61882591247559

In [12]:
float_cols_mem.sum() / 2**20

221.24248504638672

## Optimize `Float` type columns using `pandas.to_numeric` with `downcast` parameter

In [13]:
for fc in  float_cols.columns:
    trips_2016_01[fc] = pd.to_numeric(trips_2016_01[fc], 
                                      downcast='float')
    print(fc, trips_2016_01[fc].dtype)

taxi_id float32
trip_seconds float32
trip_miles float32
pickup_census_tract float32
dropoff_census_tract float32
pickup_community_area float32
dropoff_community_area float32
fare float32
tips float32
tolls float32
extras float32
trip_total float32
company float32
pickup_latitude float32
pickup_longitude float32
dropoff_latitude float32
dropoff_longitude float32


## Which `float` type columns can be converted to `int` type columns?

In [14]:
# check the header part again
trips_2016_01.head()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016-1-13 06:15:00,2016-1-13 06:15:00,180.0,0.4,,,24.0,24.0,4.5,0.0,0.0,0.0,4.5,Cash,107.0,199.0,510.0,199.0,510.0
1,2776.0,2016-1-22 09:30:00,2016-1-22 09:45:00,240.0,0.7,,,,,4.45,4.45,0.0,0.0,8.9,Credit Card,,,,,
2,3168.0,2016-1-31 21:30:00,2016-1-31 21:30:00,0.0,0.0,,,,,42.75,5.0,0.0,0.0,47.75,Credit Card,119.0,,,,
3,4237.0,2016-1-23 17:30:00,2016-1-23 17:30:00,480.0,1.1,,,6.0,6.0,7.0,0.0,0.0,0.0,7.0,Cash,,686.0,500.0,686.0,500.0
4,5710.0,2016-1-14 05:45:00,2016-1-14 06:00:00,480.0,2.71,,,32.0,,10.25,0.0,0.0,0.0,10.25,Cash,,385.0,478.0,,


In [24]:
# taxi_id only has integer values
# however, there exists nulls - not able to be converted.
trips_2016_01['taxi_id'].isnull().sum()

23

## Converting to DateTime

In [25]:
trips_2016_01.columns

Index(['taxi_id', 'trip_start_timestamp', 'trip_end_timestamp', 'trip_seconds',
       'trip_miles', 'pickup_census_tract', 'dropoff_census_tract',
       'pickup_community_area', 'dropoff_community_area', 'fare', 'tips',
       'tolls', 'extras', 'trip_total', 'payment_type', 'company',
       'pickup_latitude', 'pickup_longitude', 'dropoff_latitude',
       'dropoff_longitude'],
      dtype='object')

In [26]:
# 'timestamp' related columns --- object type.
print(trips_2016_01.trip_start_timestamp.dtype)
print(trips_2016_01.trip_end_timestamp.dtype)

object
object


In [27]:
trips_2016_01['trip_start_timestamp'] = pd.to_datetime(trips_2016_01['trip_start_timestamp'])
trips_2016_01['trip_end_timestamp'] = pd.to_datetime(trips_2016_01['trip_end_timestamp'])

In [28]:
trips_2016_01[['trip_start_timestamp', 'trip_end_timestamp']].memory_usage(deep=True)

Index                         80
trip_start_timestamp    13646440
trip_end_timestamp      13646440
dtype: int64

## Converting to Categorical
* `category` data type uses int subtype to represent the unique values in a columns.
* TRADEOFF! : we can't do arithmetic with `category` columns or use methods like `Series.min()` or `Series(max)` without converting to a true numerical dtype first.
  * **We should stick to using the category type primarily for object columns where less than 50% of the values are unique.**
  * If all of the values in a column are unique, the category type will end up using more memory.

In [34]:
trips_2016_01['payment_type'].dtype

dtype('O')

In [31]:
trips_2016_01['payment_type'].memory_usage(deep=True)

109573694

In [33]:
trips_2016_01['payment_type'].value_counts()

Cash           912334
Credit Card    781271
No Charge        7555
Unknown          3139
Dispute           845
Pcard             437
Prcard            224
Name: payment_type, dtype: int64

In [36]:
# unique values less than 50% of the total values?
(len(trips_2016_01['payment_type'].unique()) / len(trips_2016_01)) < .5

True

In [37]:
trips_2016_01['payment_type'] = trips_2016_01['payment_type'].astype('category')

## Compare the memory usages vs. the original dataset

* `deep memory usage` decreased from `568.9MB` to `138.3MB`

In [38]:
trips_2016_01_origin = pd.read_csv('data/chicago_taxi_trips_2016_01.csv')
trips_2016_01_origin.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705805 entries, 0 to 1705804
Data columns (total 20 columns):
taxi_id                   float64
trip_start_timestamp      object
trip_end_timestamp        object
trip_seconds              float64
trip_miles                float64
pickup_census_tract       float64
dropoff_census_tract      float64
pickup_community_area     float64
dropoff_community_area    float64
fare                      float64
tips                      float64
tolls                     float64
extras                    float64
trip_total                float64
payment_type              object
company                   float64
pickup_latitude           float64
pickup_longitude          float64
dropoff_latitude          float64
dropoff_longitude         float64
dtypes: float64(17), object(3)
memory usage: 568.9 MB


In [39]:
trips_2016_01.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705805 entries, 0 to 1705804
Data columns (total 20 columns):
taxi_id                   float32
trip_start_timestamp      datetime64[ns]
trip_end_timestamp        datetime64[ns]
trip_seconds              float32
trip_miles                float32
pickup_census_tract       float32
dropoff_census_tract      float32
pickup_community_area     float32
dropoff_community_area    float32
fare                      float32
tips                      float32
tolls                     float32
extras                    float32
trip_total                float32
payment_type              category
company                   float32
pickup_latitude           float32
pickup_longitude          float32
dropoff_latitude          float32
dropoff_longitude         float32
dtypes: category(1), datetime64[ns](2), float32(17)
memory usage: 138.3 MB


## Now we can write a function to optimize the memory usage for data.

In [52]:
def optimize_memory_usage(df):
    
    data = df.copy()
    
    obj_cols = data.select_dtypes(include=['object'])
    float_cols = data.select_dtypes(include=['float'])
    
    # donwcast float type cloumns
    for fc in float_cols.columns:
        data[fc] = pd.to_numeric(data[fc], downcast='float')
        
    
    for oc in obj_cols.columns:
        
        # convert timestamp columns to datetime type columns
        if 'timestamp' in oc:
            data[oc] = pd.to_datetime(data[oc])
            
        # converting to categorical
        if 'type' in oc:
            if len(data[oc].unique()) / len(data) < .5:
                data[oc] = data[oc].astype('category')
    
    
    print('original data')
    print(df.info(memory_usage='deep'))
    print('optimized data')
    print(data.info(memory_usage='deep'))    
    
    return data

In [53]:
# test the function above with the csv file we have used till now.

optimize_memory_usage(trips_2016_01_origin)

original data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705805 entries, 0 to 1705804
Data columns (total 20 columns):
taxi_id                   float64
trip_start_timestamp      object
trip_end_timestamp        object
trip_seconds              float64
trip_miles                float64
pickup_census_tract       float64
dropoff_census_tract      float64
pickup_community_area     float64
dropoff_community_area    float64
fare                      float64
tips                      float64
tolls                     float64
extras                    float64
trip_total                float64
payment_type              object
company                   float64
pickup_latitude           float64
pickup_longitude          float64
dropoff_latitude          float64
dropoff_longitude         float64
dtypes: float64(17), object(3)
memory usage: 568.9 MB
None
optimized data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705805 entries, 0 to 1705804
Data columns (total 20 columns):
taxi_id  

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,85.0,2016-01-13 06:15:00,2016-01-13 06:15:00,180.0,0.40,,,24.0,24.0,4.50,0.00,0.0,0.0,4.500000,Cash,107.0,199.0,510.0,199.0,510.0
1,2776.0,2016-01-22 09:30:00,2016-01-22 09:45:00,240.0,0.70,,,,,4.45,4.45,0.0,0.0,8.900000,Credit Card,,,,,
2,3168.0,2016-01-31 21:30:00,2016-01-31 21:30:00,0.0,0.00,,,,,42.75,5.00,0.0,0.0,47.750000,Credit Card,119.0,,,,
3,4237.0,2016-01-23 17:30:00,2016-01-23 17:30:00,480.0,1.10,,,6.0,6.0,7.00,0.00,0.0,0.0,7.000000,Cash,,686.0,500.0,686.0,500.0
4,5710.0,2016-01-14 05:45:00,2016-01-14 06:00:00,480.0,2.71,,,32.0,,10.25,0.00,0.0,0.0,10.250000,Cash,,385.0,478.0,,
5,1987.0,2016-01-08 18:15:00,2016-01-08 18:45:00,1080.0,6.20,,,8.0,3.0,17.75,0.00,0.0,0.0,17.750000,Cash,,599.0,346.0,660.0,120.0
6,4986.0,2016-01-14 04:30:00,2016-01-14 05:00:00,1500.0,18.40,,,,,45.00,12.00,0.0,0.0,57.000000,Credit Card,,,,,
7,6400.0,2016-01-26 04:15:00,2016-01-26 04:15:00,60.0,0.20,,,16.0,16.0,3.75,0.00,0.0,0.0,3.750000,Cash,107.0,527.0,24.0,527.0,24.0
8,7418.0,2016-01-22 11:30:00,2016-01-22 11:45:00,180.0,0.00,,504.0,8.0,32.0,5.00,2.00,0.0,1.5,8.500000,Credit Card,82.0,210.0,470.0,744.0,605.0
9,6450.0,2016-01-07 21:15:00,2016-01-07 21:15:00,0.0,0.00,,,,,3.25,0.00,0.0,1.5,4.750000,Cash,,,,,
