# 3. Preparing Datasets for Prediction Modelling
In this notebook, the dataset which have been cleaned and analysed will have to be prepared before we can run our model.

## Objectives:
 - Encode catergorical features
 - Create test and validation sets
 - Extracting required "POLYLINE" coordinates from test and validation sets

In [1]:
import pandas as pd
import numpy as np
import json
pd.options.display.max_columns = 999

Importing both train and test datasets:

In [2]:
taxi = pd.read_pickle('./Pickles/taxi_EDA_completed')

In [3]:
test = pd.read_pickle('./Pickles/test_EDA_completed')

## Encoding catergorical columns

Encoding catergorical columns is a good practice as most algorithm packages do not take in str. The catergorical features in our dataset are TRIP_ID, CALL_TYPE, ORIGIN_CALL, TIMESTAMP  , TAXI_ID , WEEK , DAY , Q_HOUR.  

Of these, only CALL_TYPE is required to be encoded as the values of CALL_TYPE is in a str format. For encoding, label encoder from sklearn is chosen as it encode classes with value between 0 and n_classes-1. In addition, it is important to encode both train and test dataset together as there might be classes present are only in train/test dataset.

In [4]:
from sklearn import preprocessing

#concatinating  both train and test "CALL_TYPE" to ensure that all classes are captured in our encoder
total_call_type = pd.DataFrame(pd.concat([taxi["CALL_TYPE"], test["CALL_TYPE"]])) 
call_type_encoder = preprocessing.LabelEncoder().fit(total_call_type["CALL_TYPE"])

#encoding class labels for both train and test sets
taxi["CALL_TYPE"] = call_type_encoder.transform(taxi["CALL_TYPE"])
test["CALL_TYPE"] = call_type_encoder.transform(test["CALL_TYPE"])

## Creating a test and validation set
As the given test set contains 320 truncated trips, we will split our training data into train and valiadation sets. The ratio chosen is train(99.9% - 1,580,976 trips) and validation(0.1% - 1,581 trips). 

In [5]:
from sklearn.utils import shuffle
def train_val_split(df):
    '''
    Splitting up the training set into 99.9% train/0.1% test test.
    '''
    df = shuffle(df, random_state = 0).reset_index(drop=True) #shuffle the dataset before splitting
    val_thirty = df.shape[0] - int(0.001 * df.shape[0]) + 1
    print val_thirty , df.shape[0] - val_thirty
    train = df.iloc[:val_thirty].reset_index(drop=True)
    val = df.iloc[val_thirty:].reset_index(drop=True)
    return train , val

In [6]:
train , val = train_val_split(taxi) 

1580976 1581


Aftering splittig the training data into train and val sets, the polyline coordinates of the val set have to be truncated.

In [7]:
def random_truncate(df, polyline):
    """
    Randomly truncate the end of the trip's polyline points to simulate partial trips.
    This is only to create a validation dataset.
    Note: If there is only one coordiante, no truncation will be carried out
    """
    np.random.seed(0)
    return df[polyline].map(lambda x: x[:np.random.randint(1, len(x))] if len(x) > 1 else x)

In [8]:
val["POLYLINE"] = random_truncate(val, "POLYLINE")

## Extracting required "POLYLINE" coordinates from test and validation sets

After extensive research, only the first and last few coordinates is essential determine the taxi's destination. From research done by University of South Florida, five coordinates is required to conduct trajectory prediction analysis. Therefore, we shall use only the five and last five coordinates for our destination prediction. 

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.391.8313&rep=rep1&type=pdf

In our train, val and test sets, the start and end coordinates have already been extracted. Therefore, only the subsequent 4 coordinates after start and last 4 coordinates before end are required to be extracted. 

For the train set, the last coordinate in the "POLYLINE" refers to our y-label, however for the val and test sets, the "POLYLINE" coordinates are already truncated, which means the last coordinate in the test is not our y-label. Therefore, to tackle this, two seperate functions are written for extraction. 

In [9]:
def extract_train(df):
    """
    Extract next 4 coordinates after start 
    and last 4 coordinates before end for train set
    """
    train_lat_2 = []
    train_lat_3 = []
    train_lat_4 = []
    train_lat_5 = []
    
    train_long_2 = []
    train_long_3 = []
    train_long_4 = []
    train_long_5 = []

    for x in df["POLYLINE"]:
        if len(x) >= 5: 
            train_lat_2.append(x[1][0])
            train_lat_3.append(x[2][0])
            train_lat_4.append(x[3][0])
            train_lat_5.append(x[4][0])
                              
            train_long_2.append(x[1][1])
            train_long_3.append(x[2][1])
            train_long_4.append(x[3][1])
            train_long_5.append(x[4][1])
            
        
        if len(x) == 4: 
            train_lat_2.append(x[1][0])
            train_lat_3.append(x[2][0])
            train_lat_4.append(x[3][0])
            train_lat_5.append(x[3][0])
                              
            train_long_2.append(x[1][1])
            train_long_3.append(x[2][1])
            train_long_4.append(x[3][1])
            train_long_5.append(x[3][1])
                                
        if len(x) == 3: 
            train_lat_2.append(x[1][0])
            train_lat_3.append(x[2][0])
            train_lat_4.append(x[2][0])
            train_lat_5.append(x[2][0])
                              
            train_long_2.append(x[1][1])
            train_long_3.append(x[2][1])
            train_long_4.append(x[2][1])
            train_long_5.append(x[2][1])
        
        if len(x) == 2: 
            train_lat_2.append(x[1][0])
            train_lat_3.append(x[1][0])
            train_lat_4.append(x[1][0])
            train_lat_5.append(x[1][0])
                              
            train_long_2.append(x[1][1])
            train_long_3.append(x[1][1])
            train_long_4.append(x[1][1])
            train_long_5.append(x[1][1]) 
            
        if len(x) == 1: 
            train_lat_2.append(x[0][0])
            train_lat_3.append(x[0][0])
            train_lat_4.append(x[0][0])
            train_lat_5.append(x[0][0])
                              
            train_long_2.append(x[0][1])
            train_long_3.append(x[0][1])
            train_long_4.append(x[0][1])
            train_long_5.append(x[0][1])
        
                                
    df["LAT_2"] = pd.DataFrame(train_lat_2)
    df["LAT_3"] = pd.DataFrame(train_lat_3)
    df["LAT_4"] = pd.DataFrame(train_lat_4)
    df["LAT_5"] = pd.DataFrame(train_lat_5)  
                              
    df["LONG_2"] = pd.DataFrame(train_long_2)
    df["LONG_3"] = pd.DataFrame(train_long_3)
    df["LONG_4"] = pd.DataFrame(train_long_4)
    df["LONG_5"] = pd.DataFrame(train_long_5)
    
    print "First LATs, LONGs extracted"   
    
    train_lat_2_last = []
    train_lat_3_last = []
    train_lat_4_last = []
    train_lat_5_last = []
                              
    train_long_2_last = []
    train_long_3_last = []
    train_long_4_last = []
    train_long_5_last = []

    for x in df["POLYLINE"]:
        if len(x) >= 5: 
            train_lat_2_last.append(x[-2][0])
            train_lat_3_last.append(x[-3][0])
            train_lat_4_last.append(x[-4][0])
            train_lat_5_last.append(x[-5][0])
                                                         
            train_long_2_last.append(x[-2][1])
            train_long_3_last.append(x[-3][1])
            train_long_4_last.append(x[-4][1])
            train_long_5_last.append(x[-5][1])
        
        if len(x) == 4: 
            train_lat_2_last.append(x[-2][0])
            train_lat_3_last.append(x[-3][0])
            train_lat_4_last.append(x[-4][0])
            train_lat_5_last.append(x[-4][0])
                                   
            train_long_2_last.append(x[-2][1])
            train_long_3_last.append(x[-3][1])
            train_long_4_last.append(x[-4][1])
            train_long_5_last.append(x[-4][1])
                                
        if len(x) == 3:  
            train_lat_2_last.append(x[-2][0])
            train_lat_3_last.append(x[-3][0])
            train_lat_4_last.append(x[-3][0])
            train_lat_5_last.append(x[-3][0])
                                   
            train_long_2_last.append(x[-2][1])
            train_long_3_last.append(x[-3][1])
            train_long_4_last.append(x[-3][1])
            train_long_5_last.append(x[-3][1])
        
        if len(x) == 2: 
            train_lat_2_last.append(x[-2][0])
            train_lat_3_last.append(x[-2][0])
            train_lat_4_last.append(x[-2][0])
            train_lat_5_last.append(x[-2][0])
                                   
            train_long_2_last.append(x[-2][1])
            train_long_3_last.append(x[-2][1])
            train_long_4_last.append(x[-2][1])
            train_long_5_last.append(x[-2][1])
        
        if len(x) == 1: 
            train_lat_2_last.append(x[-1][0])
            train_lat_3_last.append(x[-1][0])
            train_lat_4_last.append(x[-1][0])
            train_lat_5_last.append(x[-1][0])
                                   
            train_long_2_last.append(x[-1][1])
            train_long_3_last.append(x[-1][1])
            train_long_4_last.append(x[-1][1])
            train_long_5_last.append(x[-1][1])
                                   
                              
    df["LAT_2_last"] = pd.DataFrame(train_lat_2_last)
    df["LAT_3_last"] = pd.DataFrame(train_lat_3_last)
    df["LAT_4_last"] = pd.DataFrame(train_lat_4_last)
    df["LAT_5_last"] = pd.DataFrame(train_lat_5_last)
                                   
    df["LONG_2_last"] = pd.DataFrame(train_long_2_last)
    df["LONG_3_last"] = pd.DataFrame(train_long_3_last)
    df["LONG_4_last"] = pd.DataFrame(train_long_4_last)
    df["LONG_5_last"] = pd.DataFrame(train_long_5_last)
    
    print "Last LATs, LONGs extracted"
    print "Train set extraction completed!"
    return df

In [10]:
def extract_test(df):
    """
    Extract next 3 coordinates after start 
    and last 3 coordinates before end for test set
    """
    test_lat_2 = []
    test_lat_3 = []
    test_lat_4 = []
    test_lat_5 = []
    
    test_long_2 = []
    test_long_3 = []
    test_long_4 = []
    test_long_5 = []

    for x in df["POLYLINE"]:
        if len(x) >= 5:
            test_lat_2.append(x[1][0])
            test_lat_3.append(x[2][0])
            test_lat_4.append(x[3][0])
            test_lat_5.append(x[4][0])
                              
            test_long_2.append(x[1][1])
            test_long_3.append(x[2][1])
            test_long_4.append(x[3][1])
            test_long_5.append(x[4][1])
            
        if len(x) == 4:
            test_lat_2.append(x[1][0])
            test_lat_3.append(x[2][0])
            test_lat_4.append(x[3][0])
            test_lat_5.append(x[3][0])
                              
            test_long_2.append(x[1][1])
            test_long_3.append(x[2][1])
            test_long_4.append(x[3][1])
            test_long_5.append(x[3][1])
                                
        if len(x) == 3:
            test_lat_2.append(x[1][0])
            test_lat_3.append(x[2][0])
            test_lat_4.append(x[2][0])
            test_lat_5.append(x[2][0])
                              
            test_long_2.append(x[1][1])
            test_long_3.append(x[2][1])
            test_long_4.append(x[2][1])
            test_long_5.append(x[2][1])
        
        if len(x) == 2:
            test_lat_2.append(x[1][0])
            test_lat_3.append(x[1][0])
            test_lat_4.append(x[1][0])
            test_lat_5.append(x[1][0])
                              
            test_long_2.append(x[1][1])
            test_long_3.append(x[1][1])
            test_long_4.append(x[1][1])
            test_long_5.append(x[1][1])
        
        if len(x) ==1:
            test_lat_2.append(x[0][0])
            test_lat_3.append(x[0][0])
            test_lat_4.append(x[0][0])
            test_lat_5.append(x[0][0])
                              
            test_long_2.append(x[0][1])
            test_long_3.append(x[0][1])
            test_long_4.append(x[0][1])
            test_long_5.append(x[0][1])
                                
    df["LAT_2"] = pd.DataFrame(test_lat_2)
    df["LAT_3"] = pd.DataFrame(test_lat_3)
    df["LAT_4"] = pd.DataFrame(test_lat_4)
    df["LAT_5"] = pd.DataFrame(test_lat_5)
                              
    df["LONG_2"] = pd.DataFrame(test_long_2)
    df["LONG_3"] = pd.DataFrame(test_long_3)
    df["LONG_4"] = pd.DataFrame(test_long_4)
    df["LONG_5"] = pd.DataFrame(test_long_5) 
    
    print "First LATs, LONGs extracted"   
    
    test_lat_2_last = []
    test_lat_3_last = []
    test_lat_4_last = []
    test_lat_5_last = []
    
    test_long_2_last = []
    test_long_3_last = []
    test_long_4_last = []
    test_long_5_last = []

    for x in df["POLYLINE"]:
        if len(x) >= 5:
            test_lat_2_last.append(x[-1][0])
            test_lat_3_last.append(x[-2][0])
            test_lat_4_last.append(x[-3][0])
            test_lat_5_last.append(x[-4][0])
                                   
            test_long_2_last.append(x[-1][1])
            test_long_3_last.append(x[-2][1])
            test_long_4_last.append(x[-3][1])
            test_long_5_last.append(x[-4][1])
        
        if len(x) == 4:
            test_lat_2_last.append(x[-1][0])
            test_lat_3_last.append(x[-2][0])
            test_lat_4_last.append(x[-3][0])
            test_lat_5_last.append(x[-4][0])
                                   
            test_long_2_last.append(x[-1][1])
            test_long_3_last.append(x[-2][1])
            test_long_4_last.append(x[-3][1])
            test_long_5_last.append(x[-4][1])
                                
        if len(x) == 3:
            test_lat_2_last.append(x[-1][0])
            test_lat_3_last.append(x[-2][0])
            test_lat_4_last.append(x[-3][0])
            test_lat_5_last.append(x[-3][0])
                                   
            test_long_2_last.append(x[-1][1])
            test_long_3_last.append(x[-2][1])
            test_long_4_last.append(x[-3][1])
            test_long_5_last.append(x[-3][1])
        
        if len(x) == 2:
            test_lat_2_last.append(x[-1][0])
            test_lat_3_last.append(x[-2][0])
            test_lat_4_last.append(x[-2][0])
            test_lat_5_last.append(x[-2][0])
                                   
            test_long_2_last.append(x[-1][1])
            test_long_3_last.append(x[-2][1])
            test_long_4_last.append(x[-2][1])
            test_long_5_last.append(x[-2][1])
        
        if len(x) == 1:
            test_lat_2_last.append(x[-1][0])
            test_lat_3_last.append(x[-1][0])
            test_lat_4_last.append(x[-1][0])
            test_lat_5_last.append(x[-1][0])
                                   
            test_long_2_last.append(x[-1][1])
            test_long_3_last.append(x[-1][1])
            test_long_4_last.append(x[-1][1])
            test_long_5_last.append(x[-1][1])
                                   
                              
    df["LAT_2_last"] = pd.DataFrame(test_lat_2_last)
    df["LAT_3_last"] = pd.DataFrame(test_lat_3_last)
    df["LAT_4_last"] = pd.DataFrame(test_lat_4_last)
    df["LAT_5_last"] = pd.DataFrame(test_lat_5_last)
                                   
    df["LONG_2_last"] = pd.DataFrame(test_long_2_last)
    df["LONG_3_last"] = pd.DataFrame(test_long_3_last)
    df["LONG_4_last"] = pd.DataFrame(test_long_4_last)
    df["LONG_5_last"] = pd.DataFrame(test_long_5_last)
    
    print "Last LATs, LONGs extracted"
    print "Val/Test set extraction completed!"
    return df

In [11]:
train = extract_train(train)
val = extract_test(val)
test = extract_test(test)

First LATs, LONGs extracted
Last LATs, LONGs extracted
Train set extraction completed!
First LATs, LONGs extracted
Last LATs, LONGs extracted
Val/Test set extraction completed!
First LATs, LONGs extracted
Last LATs, LONGs extracted
Val/Test set extraction completed!


Removing "END_LAT" & "END_LONG"  in test set as that is now represented with "LAT_2_last" & "LONG_2_last".

In [12]:
test.drop(['END_LAT','END_LONG'], axis = 1, inplace = True)

To ensure there is no leakage of data, the distance and duration must be recalculated for the val set.

In [13]:
val.drop(['DURATION','DISTANCE','DURATION_LOG','DISTANCE_LOG'], axis=1 ,inplace=True)  

def haversine(start_long, start_lat, end_long, end_lat):
    '''
    Using haversine formula to calculate distance in km
    between two lat,long coordinates.
    '''
    EARTH_RADIUS = 6371 #6,371 km is the approximate distance from Earth's center to its surface 
    start_long = np.radians(start_long)
    start_lat = np.radians(start_lat)
    end_long = np.radians(end_long)
    end_lat = np.radians(end_lat)

    dlong = end_long - start_long
    dlat = end_lat - start_lat

    a = (np.sin(dlat / 2)**2 + np.cos(start_lat) * np.cos(end_lat) *
         np.sin(dlong / 2)**2)
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    return np.nan_to_num(c * EARTH_RADIUS)

val["DISTANCE"] = haversine(val['START_LONG'],val['START_LAT'], val['END_LONG'], val['END_LAT'])


def duration(df):
    """    
    The time difference between each coordinate 
    in "POLYLINE" is 15s. We get the duration by
    taking the len of list * 15s - 15(subtract 15 as first 
    coordinate is start of trip)
    """
    return df["POLYLINE"].map(lambda x: len(x)*15 - 15)

val["DURATION"] = duration(val)

def replace_zero(df,col,value=1):
    '''
    Changes 0 to 1 as log 0 is undefined
    '''
    df[col].replace(to_replace = 0, value = value, inplace = True)
    return df[col]

val["DURATION"] = replace_zero(val,"DURATION")
val["DISTANCE"] = replace_zero(val,"DISTANCE")

val['DURATION_LOG'] = val['DURATION'].map(lambda x : np.log(x))
val['DISTANCE_LOG'] = val['DISTANCE'].map(lambda x : np.log(x))

Pickling train, val and test sets for Part 4:
- [Capstone_Taxi_4.1-Tree_Based_Models]('./Capstone_Taxi_4.1-Tree_Based_Models')
- [Capstone_Taxi_Model4.2_Artificial_Neural_Network]('./Capstone_Taxi_Model4.2_Artificial_Neural_Network.ipynb')

In [18]:
train.to_pickle('./Pickles/train')
val.to_pickle('./Pickles/val')
test.to_pickle('./Pickles/test')