# New York Cab Fare

### Preface

This notebook presents my solutions to the "New York City Taxi Fare" Kaggle competition hosted by Google in 2018. I used a variety of preprocessing techniques and several ML Methods including Random Forests, LightGBM, and Sequential Neural Nets. 

A fundamental difference between industry and competition became evident during this process. There was an issue with cab rides that started and ended in the same location. Presumably, people go somewhere in a cab, like a grocery store, and return to the same location. If predicting cab fare, it would make sense to input the destination, and double the amount, or sum two round trips. But in this dataset, the goal is to predict fare based on only the timestamp, and the starting and ending coordinates. Since these fares travel no distance according to the dataset, I initially eliminated them during wrangling.

To my surpise, these tricky fares remained in the Kaggle competition. So I had to place them back in the data and account for them by segmenting the data. I tried a few approaches like using the mean fare, and averaging it with other models, but they continued to cause problems.

Any realistic taxi fare prediction methods would eliminate these fares. Upon elimination, my RMSE was approximately 2.70. Without, it was approximately 3.25. I placed in the top third of the competition, though I could have performed better with a little more preprocessing, and a better cloud connection to utilize deep learning on all 50+ million rows.

The pipeline and tests presented below remain as they were on the last day of the competition.

# NYC_Pipeline_Tests

In [320]:
ROWS = 2875000
NODES = [100,100,50]

## Import Libraries

In [321]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from keras.models import model_from_json
from sklearn.externals import joblib
from keras.layers import Dropout
from keras.constraints import maxnorm
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge

pd.set_option('display.max_columns', 50)

## Data Preparation

In [322]:
def file_to_dataFrame(file_name, subset=True, nrows=ROWS):
    if subset:
        df = pd.read_csv(file_name, nrows=nrows, parse_dates=['pickup_datetime'])
    else:
        df = pd.read_csv(file_name, parse_dates=['pickup_datetime'])
    return df

In [323]:
def make_Xtest_ytest(df, split=False):
    y_test = df['key']
    y_test = pd.DataFrame(y_test)
    X_test = df.drop('key', axis=1)
    return X_test, y_test

## Clean Data

In [324]:
def clean_data(df):
    x = len(df)
    print('Length of df:', x)
    df = df.dropna(axis=0, subset=['dropoff_latitude'])
    df = df.drop('key', axis=1)
    y = len(df)
    print('NaN dropped:', x-y)
    return df

In [325]:
def lat_lon_US(df):
    x = len(df)
    # Choose cab rides whose pickup and dropoff are the US Mainland
    # Declare constants
    latmin = 5.496100
    latmax = 71.538800
    longmin = -124.482003
    longmax = -66.885417

    # Create dataframe with correct coordinates
    df = df[((((df['pickup_longitude']<=longmax) & (df['pickup_longitude']>=longmin)) & ((df['pickup_latitude']<=latmax) & (df['pickup_latitude']>=latmin)))) & ((((df['dropoff_longitude']<=longmax) & (df['dropoff_longitude']>=longmin)) & ((df['dropoff_latitude']<=latmax) & (df['dropoff_latitude']>=latmin))))]
    
    print('US Mainland Only dropped:', x-len(df))

    return df

In [326]:
def lat_lon_NYC(df):
    x = len(df)
    # Find cab rides whose pickup or dropoff are within NYC boundaries
    # Declare constants
    latmin = 40.477399
    latmax = 40.917577
    longmin = -74.259090
    longmax = -73.700272

    # Create dataframe with correct coordinates
    df = df[((((df['pickup_longitude']<=longmax) & (df['pickup_longitude']>=longmin)) & ((df['pickup_latitude']<=latmax) & (df['pickup_latitude']>=latmin)))) | ((((df['dropoff_longitude']<=longmax) % (df['dropoff_longitude']>=longmin)) & ((df['dropoff_latitude']<=latmax) & (df['dropoff_latitude']>=latmin))))]
    
    print('NYC Taxis Only dropped:', x-len(df))

    return df

In [327]:
def max_Riders(df, num=6):
    x = len(df)
    # Only choose cabs between 1 and num riders
    df = df[(df['passenger_count'] <= num) & (df['passenger_count'] > 0)]
    print('Max Passengers 6 dropped:',  x-len(df))
    return df

In [328]:
from geopy.distance import vincenty

def add_distance(df):

    # Define coordinates (x,y)
    y1 = df['pickup_latitude']
    x1 = df['pickup_longitude']
    y2 = df['dropoff_latitude']
    x2 = df['dropoff_longitude']
    
    #df['vincenty_distance'] = df.apply(lambda x: vincenty((x['pickup_longitude'], x['pickup_latitude']), (x['dropoff_longitude'], x['dropoff_latitude'])).miles, axis = 1)
    
    # Create Euclidean Distrance column
    df['euclidean_distance'] = np.sqrt((y2-y1)**2 + (x2-x1)**2)

    #Create Taxicab Distance column
    #df['taxicab_distance'] = np.abs(y2-y1) + np.abs(x2-x1)

    # Convert to miles
    df['euclidean_distance'] = df['euclidean_distance'] * 69
    #df['taxicab_distance'] = df['taxicab_distance'] * 69
    
    print('Distance Columns added...')
    
    return df

I tried Vincenty, Euclidean and Taxicab Distances. Interestingly, Euclidean gave the best results. (Note that Vincenty is considered most accurate.)

In [329]:
def min_Fare(df):
    # Eliminate unrealistic plots
    #df = df[df['fare_amount'] >= (df['vincenty_distance'] * 2 + 2.5)]
    df = df[df['fare_amount'] >= (df['euclidean_distance'] * 2 + 2.5)]

    print('Min fares dropped:', len(df))

    return df

The min fare rates are given here http://nymag.com/nymetro/urban/features/taxi/n_20286/. The 2.50 base charge is confirmed by histograms.

In [330]:
def max_Fare(df):
    #df = df[(df['fare_amount'] <= (df['vincenty_distance'] * 48 + 16)) | (df['fare_amount'] <= 56)]
    df = df[(df['fare_amount'] <= (df['euclidean_distance'] * 48 + 16)) | (df['fare_amount'] <= 56)]

    print('Max fares dropped:', len(df))
    return df

Some fares are way too high. The max_fare is my attempt at eliminating unrealistic fares. They would entail people sitting in taxis and going literally nowhere for an hour. (Data wrangling revealed unrealistic cab fares of tens of thousands of dollars. Although someone could keep one running for days, it's not helpful for data analysis.) 

In [331]:
def no_distance(df):
    # Elminate fares that traveled no distance
    #df = df[df['vincenty_distance']>0]
    df = df[df['euclidean_distance']>0]

    print('No distance dropped:', len(df))
    return df

In [332]:
def distance_cap(df, cap=75):
    df = df[df['vincenty_distance'] < cap]
    print('Distance cap dropped:', len(df))
    return df

In [333]:
def row_elimination(df):
    df = clean_data(df)
    df = lat_lon_US(df)
    df = lat_lon_NYC(df)
    df = max_Riders(df)
    df = add_distance(df)
    #df = min_Fare(df)
    #df = max_Fare(df)
    #df = no_distance(df)
    return df

Some preprocessing is commented out because of the no_distance problem in the kaggle dataset. I had to segment the data and apply these functions later.

## X_train, y_train Columns

In [334]:
def make_X_y(df, split=False):
    X = df.drop('fare_amount', axis=1)
    y = df['fare_amount'].copy()
    return X,y

## Garbage Removal

In [335]:
# Get rid of accumulated garbage
import gc
gc.collect()

284

AWESOME trick. 

## Add Attributes

### Time

In [336]:
def add_Time_units(df):
    
    df['month'] = df['pickup_datetime'].dt.month
    df['year'] = df['pickup_datetime'].dt.year
    df['hour'] = df['pickup_datetime'].dt.hour
    df['minute'] = df['pickup_datetime'].dt.minute
    df['second'] = df['pickup_datetime'].dt.second
    df['dayofweek'] = df['pickup_datetime'].dt.dayofweek
    
    from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
    dr = pd.date_range(start='2009-01-01', end='2015-12-31')
    cal = calendar()
    holidays = cal.holidays(start=dr.min(), end=dr.max())
    df['holiday'] = df['pickup_datetime'].dt.date.astype('datetime64').isin(holidays)
    
    df = df.drop('pickup_datetime', axis=1)

    df['total_seconds'] = 3600 * df['hour'] + 60 * df['minute'] + df['second']
        
    return df

In [337]:
def add_Time_columns(df):
    
    def morning_rush(row):
        if ((row['hour'] in [6,7,8,9]) & (row['dayofweek'] in [0,1,2,3,4])) & (not row['holiday']):
            return 1
        else:
            return 0

    df['morning_rush'] = df.apply(morning_rush, axis=1)

    def night_charge(row):
        if row['hour'] in [20,21,22,23,24,1,2,3,4,5,6]:
            return 1
        else:
            return 0

    df['night_charge'] = df.apply(night_charge, axis=1)

    def weekday_surcharge(row):
        if ((row['hour'] in [16,17,18,19,20]) & (row['dayofweek'] in [0,1,2,3,4])) & (not row['holiday']):
            return 1
        else:
            return 0

    df['weekday_surcharge'] = df.apply(weekday_surcharge, axis=1)
        
    return df

In [338]:
def add_Time(df):
    df = add_Time_units(df)
    df = add_Time_columns(df)
    return df

### Manhattan

In [339]:
# Define line from two points and a provided column
def two_points_line(a, b, column):
        
    # Case when y-values are the same
    if b[1]==a[1]:
        
        # Slope defaults to 0
        slope = 0
        
    # Case when x-values are the same
    elif b[0]==a[0]:
        
        # Case when max value is less than 999999999
        if column.max() < 999999999:
            
            # Add 999999999 to max value
            slope = column.max() + 999999999
        
        # All other cases
        else:
            
            # Multiply max value by itself (greater than 999999999)
            slope = column.max() * column.max()
    
    # When x-values and y-values are not 0
    else:
        
        # Use standard slope formula
        slope = (b[1] - a[1])/(b[0]-a[0])
    
    
    # Equation for y-intercept (solving y=mx+b for b)
    y_int = a[1] - slope * a[0]
    
    # Return slope and y-intercept
    return slope, y_int

In [340]:
def manhattan_cols(df):
    
    upper_right = (-73.929224, 40.804328)
    bottom_right = (-73.980036, 40.710706)
    bottom_left = (-74.054880, 40.681292)
    upper_left = (-73.966303, 40.830050)

    m_top, b_top = two_points_line(upper_right, upper_left, df.pickup_latitude)
    m_left, b_left = two_points_line(bottom_left, upper_left, df.pickup_latitude)
    m_right, b_right = two_points_line(bottom_right, upper_right, df.pickup_latitude)
    m_bottom, b_bottom = two_points_line(bottom_right, bottom_left, df.pickup_latitude)

    def manhattan_pickup(row):
        if (((row['pickup_latitude'] <= (row['pickup_longitude'] * m_top + b_top)) &
        (row['pickup_latitude'] >= (row['pickup_longitude'] * m_bottom + b_bottom))) &
        ((row['pickup_latitude'] >= (row['pickup_longitude'] * m_right + b_right)) &
        (row['pickup_latitude'] <= (row['pickup_longitude'] * m_left + b_left)))):
            return 1
        else:
            return 0
    
    df['manhattan_pickup'] = df.apply(manhattan_pickup, axis=1)
    
    
    def manhattan_dropoff(row):
        if (((row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_top + b_top)) &
        (row['dropoff_latitude'] >= (row['dropoff_longitude'] * m_bottom + b_bottom))) &
        ((row['dropoff_latitude'] >= (row['dropoff_longitude'] * m_right + b_right)) &
        (row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_left + b_left)))):
            return 1
        else:
            return 0
        
    df['manhattan_dropoff'] = df.apply(manhattan_dropoff, axis=1)
    
    
    def manhattan(row):
        if (row['manhattan_pickup']) & (row['manhattan_dropoff']):
            return 1
        else:
            return 0
    
    df['manhattan'] = df.apply(manhattan, axis=1)
    
    
    def manhattan_one_way(row):
        if (not row['manhattan']) & (row['manhattan_pickup']) | (row['manhattan_dropoff']):
            return 1
        else: 
            return 0

    df['manhattan_one_way'] = df.apply(manhattan_one_way, axis=1)
     
        
    return df

Not the best method, I eyeballed 4 points on a digital map and drew a quadrilateral around them to bound Manhattan. Since this method was relatively time-consuming, I only applied it to a couple other geographic areas with known surchages, trips to Newark and the JFK Airport.

In [341]:
def newark_cols(df):
    
    upper_right = (-74.107867, 40.718282)
    bottom_right = (-74.143665, 40.654673)
    bottom_left = (-74.250524, 40.698436)
    upper_left = (-74.171983, 40.792347)

    m_top, b_top = two_points_line(upper_right, upper_left, df.pickup_latitude)
    m_left, b_left = two_points_line(bottom_left, upper_left, df.pickup_latitude)
    m_right, b_right = two_points_line(bottom_right, upper_right, df.pickup_latitude)
    m_bottom, b_bottom = two_points_line(bottom_right, bottom_left, df.pickup_latitude)

    def newark(row):
        if (((row['pickup_latitude'] <= (row['pickup_longitude'] * m_top + b_top)) &
        (row['pickup_latitude'] >= (row['pickup_longitude'] * m_bottom + b_bottom))) &
        ((row['pickup_latitude'] >= (row['pickup_longitude'] * m_right + b_right)) &
        (row['pickup_latitude'] <= (row['pickup_longitude'] * m_left + b_left)))) | (((row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_top + b_top)) &
        (row['dropoff_latitude'] >= (row['dropoff_longitude'] * m_bottom + b_bottom))) &
        ((row['dropoff_latitude'] >= (row['dropoff_longitude'] * m_right + b_right)) &
        (row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_left + b_left)))):
            return 1
        else:
            return 0
        
    df['newark'] = df.apply(newark, axis=1)
    
    return df

In [342]:
def jkf_cols(df):
    
    upper_right = (-73.789700, 40.663781)
    bottom_right = (-73.762112, 40.633567)
    bottom_left = (-73.818920, 40.642250)
    upper_left = (-73.804656, 40.664858)

    m_top, b_top = two_points_line(upper_right, upper_left, df.pickup_latitude)
    m_left, b_left = two_points_line(bottom_left, upper_left, df.pickup_latitude)
    m_right, b_right = two_points_line(bottom_right, upper_right, df.pickup_latitude)
    m_bottom, b_bottom = two_points_line(bottom_right, bottom_left, df.pickup_latitude)

    def jfk(row):
        if (((row['pickup_latitude'] <= (row['pickup_longitude'] * m_top + b_top)) &
        (row['pickup_latitude'] >= (row['pickup_longitude'] * m_bottom + b_bottom))) &
        ((row['pickup_latitude'] <= (row['pickup_longitude'] * m_right + b_right)) &
        (row['pickup_latitude'] <= (row['pickup_longitude'] * m_left + b_left)))) | (((row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_top + b_top)) &
        (row['dropoff_latitude'] >= (row['dropoff_longitude'] * m_bottom + b_bottom))) &
        ((row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_right + b_right)) &
        (row['dropoff_latitude'] <= (row['dropoff_longitude'] * m_left + b_left)))):
            return 1
        else:
            return 0
        
    df['jfk'] = df.apply(jfk, axis=1)
            
    return df

In [343]:
def add_locations(df):
    df = manhattan_cols(df)
    df = jkf_cols(df)
    df = newark_cols(df)
    return df

In [344]:
def add_cols(df):
    df = add_Time(df)
    df = add_locations(df)
    return df

## Choose Columns

In [345]:
def choose_predictor_cols(df, no_dist=False): 
    if no_dist:
        cols=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'year', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'total_seconds', 'morning_rush', 'night_charge', 'weekday_surcharge', 'manhattan', 'manhattan_one_way', 'jfk', 'newark', 'passenger_count', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    else: 
        cols=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'year', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'total_seconds', 'morning_rush', 'night_charge', 'weekday_surcharge', 'manhattan', 'manhattan_one_way', 'jfk', 'newark', 'passenger_count','euclidean_distance', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
    X = df[cols]
    return X

## Scaler

In [346]:
def standard_scaler(X):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_df = pd.DataFrame(X_scaled, columns=X.columns)
    return X_df

## One Hot Encoder

In [347]:
def one_hot_cols(X):
    X = one_Hot_Encoder(X, X['month'])
    del X['month']
    X = one_Hot_Encoder(X, X['dayofweek'], month=False)
    del X['dayofweek']
    return X

In [348]:
def one_Hot_Encoder(X, col, month=True): 
    encoder = OneHotEncoder()
    hot_array = encoder.fit_transform(np.array(col).reshape(-1,1)).toarray()
    hot_df = pd.DataFrame(hot_array)
    if month:
        hot_df.columns = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    else:
        hot_df.columns = ['Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']
    new_df = X.join(hot_df)
    return new_df

## ML Tests

### Linear Regression

In [349]:
def linear_regression_split(X_train, y_train):
            
    y = y_train.median()
    mse = np.sum((y_train-y)**2)
    score = mse/len(y_train)
    rmse = np.sqrt(score)
    print('Lin reg train rmse:', rmse)
    print('Lin reg train mean:', rmse.mean())
    print('Lin reg train std:', rmse.std())
    
    return rmse

In [350]:
def linear_regression(X_train, y_train, distance_none=False, distance_high=False):
        
    print('Length of X:', len(X_train))
    lr_model = LinearRegression(fit_intercept=False)
    lr_model.fit(X_train, y_train)
    scores = cross_val_score(lr_model, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
    rmse = np.sqrt(-scores)
    print('Lin reg train rmse:', rmse)
    print('Lin reg train mean:', rmse.mean())
    print('Lin reg train std:', rmse.std())
    
    if distance_none:
        joblib.dump(lr_model, 'lr_distance_none_model.pkl')
        print('Linear Regression model saved as "lr_distance_none_model.pkl"')
    elif distance_high:
        joblib.dump(lr_model, 'lr_distance_high_model.pkl')
        print('Linear Regression model saved as "lr_distance_high_model.pkl"')
    else:
        joblib.dump(lr_model, 'lr_model.pkl') 
        print('Linear Regression model saved as "lr_model.pkl"')

    return lr_model

### Ridge

In [351]:
def ridge(X_train, y_train, distance_none=False, distance_high=False):
        
    print('Length of X:', len(X_train))
    ri_model = Ridge()
    ri_model.fit(X_train, y_train)
    scores = cross_val_score(ri_model, X_train, y_train, scoring='neg_mean_squared_error', cv=5)
    rmse = np.sqrt(-scores)
    print('Lin reg train rmse:', rmse)
    print('Lin reg train mean:', rmse.mean())
    print('Lin reg train std:', rmse.std())
    
    if distance_none:
        joblib.dump(ri_model, 'ri_distance_none_model.pkl')
        print('Linear Regression model saved as "ri_distance_none_model.pkl"')
    elif distance_high:
        joblib.dump(ri_model, 'ri_distance_high_model.pkl')
        print('Linear Regression model saved as "ri_distance_high_model.pkl"')
    else:
        joblib.dump(ri_model, 'ri_model.pkl') 
        print('Linear Regression model saved as "ri_model.pkl"')

    return ri_model

### Random Forests

In [352]:
def random_random_forest_tuner(X_train, y_train):
        
    param_grid = [
        {'max_features': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 
         'min_samples_leaf': [3, 5, 7, 9],
        'min_samples_split': [2, 5, 10],
        'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
        }, 
    ]
    
    forest_reg = RandomForestRegressor(n_jobs=-1)
    
    forest_reg_tuned = RandomizedSearchCV(forest_reg, param_grid, cv=3, n_iter=10, 
                                    scoring='neg_mean_squared_error')
    
    forest_reg_tuned.fit(X,y)
    
    # Print the tuned parameters and score
    print("Tuned Random Forest Parameters: {}".format(forest_reg_tuned.best_params_))
    
    return forest_reg_tuned

In [354]:
def display_scores(title, scores):
    rmse = np.sqrt(-scores)
    print(title, ' rmse scores:', rmse)
    print(title, ' mean score:', rmse.mean())
    print(title, ' std:', rmse.std())

In [396]:
def random_forest(X_train, y_train, distance_none=False, distance_high=False):
    
    rf_model = RandomForestRegressor(max_features=16, n_estimators=500, min_samples_leaf=3, min_samples_split=12, max_depth=None, n_jobs=-1)
    
    rf_model.fit(X_train, y_train)
    
    scores = cross_val_score(rf_model, X_train, y_train, scoring='neg_mean_squared_error', cv=2)
    
    display_scores('Random Forest', scores)
    
    if distance_none:
        joblib.dump(rf_model, 'rf_distance_none_500_model.pkl')
        print('Random Forest model saved as "rf_distance_none_500_model.pkl"')
    else:
        #joblib.dump(rf_model, 'rf_model.pkl') 
        #print('Random Forest model saved as "rf_model.pkl"')
        joblib.dump(rf_model, 'rf_sum_model.pkl') 
        print('Random Forest model saved as "rf_sum_model.pkl"')
        
    return rf_model

### Deep Learning (Sequential)

In [356]:
# keras_regression_test requires "from sklearn.model_selection import train_test_split"
def deep_learning(X_train, y_train, nodes=NODES, batch_size=32, activation='relu', optimizer='adam', loss='mean_squared_error', keras_distance_high=False, keras_distance_none=False):
        
    X, X_check, y, y_check = train_test_split(X_train, y_train, test_size=0.05)
    
    # Save the number of columns in predictors: n_cols
    n_cols = X.shape[1]

    # Set up the model: model
    model = Sequential()
    
    # Add the first layer
    model.add(Dense(nodes[0], activation=activation, input_shape=(n_cols,)))
    
    # Add addition layers
    for i in range(len(nodes)-1):
        model.add(Dense(nodes[i+1], activation=activation, kernel_constraint=maxnorm(3)))
        model.add(Dropout(0.2))

    # Add the output layer
    model.add(Dense(1))

    # Compile the model
    model.compile(optimizer=optimizer, loss=loss)

    # Define early_stopping_monitor
    early_stopping_monitor = EarlyStopping(patience=3)

    # Fit the model
    model.fit(X, y, validation_split=0.05, epochs=1000, batch_size=batch_size, callbacks=[early_stopping_monitor])

    # Get score for predictions
    score = model.evaluate(X_check, y_check)
    
    # Get root mean squared error
    rmse = np.sqrt(score)
    
    # Return root mean squared error
    print(rmse)
    
    save_keras_model(model, keras_distance_high=keras_distance_high, keras_distance_none=keras_distance_none)
    
    return model

## Reset Index

In [357]:
def reset_index(X):
    X = X.reset_index(drop=True)
    return X

## Pipeline

In [358]:
def data_frame_split(df):
    
    #df_distance_none = df[df['vincenty_distance']==0]
    df_distance_none = df[df['euclidean_distance']==0]

    print('New dataframe "df_distance_none" created with length:', len(df_distance_none))
    
    #df_distance_high = df[df['vincenty_distance']>30]
    df_distance_high = df[df['euclidean_distance']>30]

    print('New dataframe "df_distance_high" created with length:', len(df_distance_high))

    #df = df[df['vincenty_distance']>0]
    #df = df[df['vincenty_distance']<=30]
    
    df = df[df['euclidean_distance']>0]
    df = df[df['euclidean_distance']<=30]

    print('New length of original dataframe:', len(df))
    return df, df_distance_none, df_distance_high

I tried to split the data to improve results. If there were no Kaggle competition with unhelpful data, df_distance_none and df_distance_high would not exist. The test points that traveled far distances posed problems, presumably because they did not have to deal with New York traffic. I tried this idea late in the game, and had minimal improvements at best.

In [359]:
def df_pipeline(df, no_dist=False):
    df = reset_index(df)
    df = add_cols(df)
    df = one_hot_cols(df)
    return df

In [360]:
def X_pipeline(X, no_dist=False):
    X = choose_predictor_cols(X, no_dist=no_dist)
    X = standard_scaler(X)
    return X

In [361]:
def test_pipeline(test_set=False, max_scaler=True):
    df = file_to_dataFrame('test.csv')
    print('Length of test_df:)', len(df))
    df = add_distance(df)
    df = df_pipeline(df)
    df, df_distance_none, df_distance_high = data_frame_split(df)
    
    X_test, y_test = make_Xtest_ytest(df)
    X_test_distance_none, y_test_distance_none = make_Xtest_ytest(df_distance_none)
    X_test_distance_high, y_test_distance_high = make_Xtest_ytest(df_distance_high)
    
    X_test = X_pipeline(df)
    X_test_distance_none = X_pipeline(X_test_distance_none, no_dist=True)
    X_test_distance_high = X_pipeline(X_test_distance_high)
    
    return X_test, y_test, X_test_distance_none, y_test_distance_none, X_test_distance_high, y_test_distance_high

In [362]:
def pipeline():
    
    df = file_to_dataFrame('train.csv')
    df = row_elimination(df)
    df = df_pipeline(df)
    df, df_distance_none, df_distance_high = data_frame_split(df)
    df = min_Fare(df)
    df = max_Fare(df)
    
    X, y = make_X_y(df)
    X_distance_none, y_distance_none = make_X_y(df_distance_none)
    X_distance_high, y_distance_high = make_X_y(df_distance_high)
    
    X = X_pipeline(X)
    X_distance_none = X_pipeline(X_distance_none, no_dist=True)
    X_distance_high = X_pipeline(X_distance_high)
    
    return X, y, X_distance_none, y_distance_none, X_distance_high, y_distance_high

In [363]:
def save_keras_model(model, keras_distance_none=False, keras_distance_high=False):
    # serialize model to JSON
    model_json = model.to_json()
    
    if keras_distance_none:
        with open("dl_distance_none_model.json", "w") as json_file:
            json_file.write(model_json)
        # serialize weights to HDF5
        model.save_weights("dl_distance_none_model.h5")
        print("Saved deep learning model as 'dl_distance_none_model.json'")
    
    elif keras_distance_high:
        with open("dl_distance_high_model.json", "w") as json_file:
            json_file.write(model_json)
        # serialize weights to HDF5
        model.save_weights("dl_distance_high_model.h5")
        print("Saved deep learning model as 'dl_distance_high_model.json'")
    
    else:
        with open("dl_model.json", "w") as json_file:
            json_file.write(model_json)
        # serialize weights to HDF5
        model.save_weights("dl_model.h5")
        print("Saved deep learning model as 'dl_model.json'")
    return model
  
def open_keras_model(file, keras_distance_none=False, keras_distance_high=False):
    # load json and create model
    json_file = open(file, 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    # load weights into new model
    if keras_distance_none:
        loaded_model.load_weights("dl_distance_none_model.h5")
    elif keras_distance_high:
        loaded_model.load_weights("dl_distance_high_model.h5")
    else:
        loaded_model.load_weights("dl_model.h5")
    print("Loaded model from disk")
    return loaded_model

In [463]:
mean_fare = y.median()

In [466]:
def min_val(row):
    if row['fare_amount'] < 2.5:
        return 2.5
    else:
        return row['fare_amount']

def open_model(saved_model, keras=False, keras_distance_none=False, keras_distance_high=False):
    if keras:
        model = open_keras_model(saved_model)
    elif keras_distance_none:
        model = open_keras_model(saved_model, keras_distance_none=keras_distance_none)
    elif keras_distance_high:
        model = open_keras_model(saved_model, keras_distance_high=keras_distance_high)
    else:
        model = joblib.load(saved_model)
    return model

def kaggle_submit(y_test, saved_model, saved_model_distance_none, saved_model_distance_high, keras=False, keras_distance_none=False, keras_distance_high=False):
    saved_model = open_model(saved_model, keras=keras)
    saved_model_distance_none = open_model(saved_model_distance_none, keras_distance_none=keras_distance_none)
    saved_model_distance_high = open_model(saved_model_distance_high, keras_distance_high=keras_distance_high)
        
    y_test['fare_amount'] = saved_model.predict(X_test)
    
    y_est = 0.775*(saved_model_distance_none.predict(X_test_distance_none))
    y_mean = 0.225*(mean_fare)    
    y_test_distance_none['fare_amount'] = y_est + y_mean
    
    y_test_distance_high['fare_amount'] = saved_model_distance_high.predict(X_test_distance_high)
    
    y_test = pd.concat([y_test,y_test_distance_none,y_test_distance_high])
        
    y_test['fare_amount'] = y_test.apply(min_val, axis=1)
    
    y_test.to_csv('my_submission.csv', index=False)
    return y_test

In [365]:
X, y, X_distance_none, y_distance_none, X_distance_high, y_distance_high = pipeline()

Length of df: 2875000
NaN dropped: 23
US Mainland Only dropped: 59423
NYC Taxis Only dropped: 3079
Max Passengers 6 dropped: 9912
Distance Columns added...
New dataframe "df_distance_none" created with length: 29021
New dataframe "df_distance_high" created with length: 509
New length of original dataframe: 2773033
Min fares dropped: 2692680
Max fares dropped: 2691545


In [366]:
X_test, y_test, X_test_distance_none, y_test_distance_none, X_test_distance_high, y_test_distance_high = test_pipeline()

Length of test_df:) 9914
Distance Columns added...
New dataframe "df_distance_none" created with length: 85
New dataframe "df_distance_high" created with length: 3
New length of original dataframe: 9826


## Tests

### LR Test

In [383]:
X_sum = pd.concat([X, X_distance_high])

In [384]:
y_sum = pd.concat([y, y_distance_high])

In [439]:
ridge(X_sum, y_sum)

Length of X: 2692054
Lin reg train rmse: [3.62689569 3.63029866 3.64863394 3.5993213  4.65208551]
Lin reg train mean: 3.831447019811648
Lin reg train std: 0.4106220697799635
Linear Regression model saved as "ri_model.pkl"


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [443]:
ridge(X_distance_none, y_distance_none, distance_none=True)

Length of X: 29021
Lin reg train rmse: [13.74402807 12.12454211 12.15907035 11.3554581  12.41577521]
Lin reg train mean: 12.359774768520234
Lin reg train std: 0.7776325483568045
Linear Regression model saved as "ri_distance_none_model.pkl"


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [369]:
ridge(X_distance_high, y_distance_high, distance_high=True)

Length of X: 509
Lin reg train rmse: [51.05656482 45.74069735 50.12550284 67.8563549  53.73753999]
Lin reg train mean: 53.703331978952406
Lin reg train std: 7.530192773741098
Linear Regression model saved as "ri_distance_high_model.pkl"


Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

### keras Tests

In [370]:
dl_model_distance = deep_learning(X_distance_none, y_distance_none, keras_distance_none=True)

Train on 26190 samples, validate on 1379 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
11.161159377587223
Saved deep learning model as 'dl_distance_none_model.json'


In [371]:
dl_model_distance = deep_learning(X_distance_high, y_distance_high, keras_distance_high=True)

Train on 458 samples, validate on 25 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
38.46130646151483
Saved deep learning model as 'dl_distance_high_model.json'


In [372]:
dl_model = deep_learning(X, y)

Train on 2429118 samples, validate on 127849 samples
Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
3.0603275437869937
Saved deep learning model as 'dl_model.json'


In [467]:
kaggle_submit(y_test, 'rf_model.pkl', 'dl_distance_none_model.json', 'lg_model_2.pkl', keras=False, keras_distance_none=True, keras_distance_high=False)

Loaded model from disk


Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,10.025081
1,2015-01-27 13:08:24.0000003,10.261865
2,2011-10-08 11:53:44.0000002,4.292730
3,2012-12-01 21:12:12.0000002,9.042953
4,2012-12-01 21:12:12.0000003,16.544141
5,2012-12-01 21:12:12.0000005,10.167219
6,2011-10-06 12:10:20.0000001,5.374145
7,2011-10-06 12:10:20.0000003,48.807804
8,2011-10-06 12:10:20.0000002,11.000858
9,2014-02-18 15:22:20.0000002,6.555713


In [422]:
X_sub = X[0:150000]
y_sub = y[0:150000]

## Add LightGBM

In [434]:
import lightgbm as lgb

def lightgbm(X, y, distance_none=False):
    
    X_b, X_test_b, y_b, y_test_b = train_test_split(X,y)
    
    d_train = lgb.Dataset(X_b, label=y_b)
    params = {}
    params['learning_rate'] = 0.03
    params['boosting_type'] = 'gbdt'
    params['objective'] = 'regression'
    params['metric'] = 'rmse'
    params['sub_feature'] = 0.8
    params['num_leaves'] = 31
    params['min_data'] = 18
    params['max_depth'] = -1
    #params['early_stopping'] = 500
    params['subsample_for_bins'] = 200
    params['subsample'] = 1,
    params['subsample_freq'] = 1
    params['reg_alpha'] = 5
    params['reg_lambda'] = 10
    params['min_split_gain' ]=0.5
    params['min_child_weight']=1
    params['min_child_samples']= 10
    params['scale_pos_weight']=1
    params['num_threads']=4
    params['eval_freq']=50
    params['colsample_bytree']=0.6
    params['min_data']= 18
            
    lg_model = lgb.train(params, d_train, 27500)

    y_pred = lg_model.predict(X_test_b)
    rms = np.sqrt(mean_squared_error(y_test_b, y_pred))
    print(rms)
    
    if distance_none:
        joblib.dump(lg_model, 'lg_distance_none_model_2.pkl')
        print('LightGBM model saved as "lg_distance_none_model.pkl"')
    else:
        joblib.dump(lg_model, 'lg_model_2.pkl') 
        print('LightGBM model saved as "lg_model_2.pkl"')
        
    return lg_model

In [435]:
lightgbm(X, y)

2.7358241871912417
Linear Regression model saved as "lg_model_2.pkl"


<lightgbm.basic.Booster at 0x1a3fa64e48>

In [406]:
lightgbm(X_distance_none, y_distance_none, distance_none=False)

12.926970467670571
Linear Regression model saved as "lg_model.pkl"


<lightgbm.basic.Booster at 0x106e63358>

In [448]:
lightgbm(X_sum, y_sum)

2.8254572341080313
Linear Regression model saved as "lg_model_2.pkl"


<lightgbm.basic.Booster at 0x1a75e33710>

### RF Test

In [374]:
random_forest(X_distance_none, y_distance_none, distance_none=True)

Random Forest  rmse scores: [12.02188153 11.36308207]
Random Forest  mean score: 11.692481801368658
Random Forest  std: 0.32939972786696803
Random Forest model saved as "rf_distance_none_model.pkl"


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=16, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=3,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=375, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [375]:
random_forest(X_distance_high, y_distance_high, distance_high=True)

Random Forest  rmse scores: [37.25236114 50.27343311]
Random Forest  mean score: 43.76289712268644
Random Forest  std: 6.510535986444008
Random Forest model saved as "rf_model.pkl"


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=16, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=3,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=375, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [376]:
random_forest(X, y)

Random Forest  rmse scores: [2.82355583 2.80782918]
Random Forest  mean score: 2.8156925063124
Random Forest  std: 0.007863325819166267
Random Forest model saved as "rf_model.pkl"


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=16, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=3,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=375, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [397]:
random_forest(X_sum, y_sum)

Random Forest  rmse scores: [2.82402096 3.35434872]
Random Forest  mean score: 3.089184840584704
Random Forest  std: 0.2651638777896661
Random Forest model saved as "rf_sum_model.pkl"


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=16, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=3,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [398]:
random_forest(X_distance_none, y_distance_none, distance_none=True)

Random Forest  rmse scores: [12.01854581 11.35106257]
Random Forest  mean score: 11.684804191997772
Random Forest  std: 0.3337416186447584
Random Forest model saved as "rf_distance_none_500_model.pkl"


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=16, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=3,
           min_samples_split=12, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=-1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

## Conclusion

I would have liked more time in the competition to run hyperparamater tests with LightGBM, and to add more features. LightGBM and Random Forests delivered the best overall scores, while Ridge did very well with the high distance dataset and Deep Learning did quite well overall. I still want to run tests on all the data, and resample the distance_high and distance_none phenomenon to obtain better results. Although nowhere near commercial, approaching an RMSE of 2.50 is not bad. New Yorkers, who would likely have a lot more detail as what can be feature engineered, and more time and rides to run through deep learning tests would likely produce better resuls.