### Model Building

The client is launching a new ride sharing program in New York similar to Uber or Lyft. At the end of each trip they want their app to suggest a tip amount to the rider. The company has not acquired any of their own data yet, so they have tasked you with producing a model based off of the taxi data. This model should predict the likely tip amount for a trip based on the other trip attributes. You can assume that the ride sharing company can provide data that has the same attributes as the taxi data for each trip.

In building the model consider the following requirements:

- The model should be built from the taxi dataset. You can supplement the taxi data with external datasets, but this is not a requirement.
- Document your choice of model / algorithm, discussing why you chose it over alternatives.
    - Document how you assess your models performance.
    - Discuss any limitations or caveats of the model which might be an issue in implementing it.
    - Discuss how you might improve your model going forward.
    - Discuss how you might turn this model in to an API the company can use.

In [3]:
import numpy as np
import pandas as pd
import datetime as dt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, roc_curve

In [4]:
df = pd.read_csv('data/data_new-vars.csv')

In [5]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'PUdate', 'PUhour', 'DOdate', 'DOhour',
       'trip_duration'],
      dtype='object')

In [6]:
X = df[['RatecodeID', 'trip_distance',
       'PULocationID', 'DOLocationID', 'fare_amount', 'extra', 'tolls_amount',
       'total_amount', 'PUhour']]
y = df.tip_amount
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,RatecodeID,trip_distance,PULocationID,DOLocationID,fare_amount,extra,tolls_amount,total_amount,PUhour
10285523,1,1.12,164,234,7.0,0.5,0.0,9.96,23
24081110,1,5.3,88,162,17.5,1.0,0.0,23.15,17
28986573,1,0.6,239,238,4.5,0.0,0.0,5.3,10
7993506,5,0.0,264,132,63.0,0.0,0.0,76.56,19
11080193,1,7.43,13,140,29.5,0.0,0.0,36.36,12


In [8]:
rfc = RandomForestRegressor(random_state=42)
param_grid = { 
    "n_estimators" : [10,20,30],
    'max_features': ['auto', 'sqrt', 'log2'],
    "min_samples_split" : [2,4,8],
    "bootstrap": [True, False],
    'criterion' :['mse', 'mae']
}

In [None]:
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3)
CV_rfc.fit(X_train, y_train)

In [None]:
CV_rfc.best_params_

In [11]:
rfc_params = RandomForestRegressor(bootstrap=True,
                                    criterion='mae',
                                    max_features='auto',
                                    min_samples_split=2,
                                    n_estimators=20)

In [None]:
rfc_params.fit(X_train, y_train)

In [None]:
pred = rfc_params.predict(X_test)

In [None]:
print("RMSE for Random Forest on CV data: ",mean_squared_error(y_test, pred))

I chose a Random Forest model because it performs significally well as a regresor, there's no need for feature normalization, it reduces overfitting, and trains relatively fast. Combined with the grid search for parameters selection, they provide a good regression model without knowing the data thoroughly.

Having said that, xgboost or neural networks usually perform better but they are slower to train and the model engineering needs more work. Also, Random Forest are hard to interpret and usually perform worst than a xgboost on a bigger dataset.

Next steps to improve this model:
    
    - Do feature engineeering.
    - Move from sklearn to TensorFlow or Coffee to customize the model and increase accuracy.
    - Change the model in case the performance is worse as said before.
    
To make a API we would have to use a stream to calculate predictions based on the model used for input. Basically, each time data is upload the model would predict and return those predictions. The model should be trained once a day/week/etc to maintain accuracy depending on how often new data is entered in the system.
This workflow has a component of data cleaning and wrangling for the model to be able to interpret it.