### Model Building

The client is launching a new ride sharing program in New York similar to Uber or Lyft. At the end of each trip they want their app to suggest a tip amount to the rider. The company has not acquired any of their own data yet, so they have tasked you with producing a model based off of the taxi data. This model should predict the likely tip amount for a trip based on the other trip attributes. You can assume that the ride sharing company can provide data that has the same attributes as the taxi data for each trip.

In building the model consider the following requirements:

- The model should be built from the taxi dataset. You can supplement the taxi data with external datasets, but this is not a requirement.
- Document your choice of model / algorithm, discussing why you chose it over alternatives.
    - Document how you assess your models performance.
    - Discuss any limitations or caveats of the model which might be an issue in implementing it.
    - Discuss how you might improve your model going forward.
    - Discuss how you might turn this model in to an API the company can use.

In [1]:
import numpy as np
import datetime as dt

In [2]:
import dask.dataframe as dd
df = dd.read_csv('data/data_new-vars.csv')

In [3]:
df.columns

Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'RatecodeID', 'store_and_fwd_flag',
       'PULocationID', 'DOLocationID', 'payment_type', 'fare_amount', 'extra',
       'mta_tax', 'tip_amount', 'tolls_amount', 'improvement_surcharge',
       'total_amount', 'PUdate', 'PUhour', 'DOdate', 'DOhour',
       'trip_duration'],
      dtype='object')

In [4]:
from dask_ml.model_selection import train_test_split
train, test = df.random_split([0.80, 0.20], random_state=42)
y_train = train.tip_amount
y_test = test.tip_amount
X_train = train[['RatecodeID', 'trip_distance',
       'PULocationID', 'DOLocationID', 'fare_amount', 'extra', 'tolls_amount',
       'total_amount', 'PUhour']]
X_test = test[['RatecodeID', 'trip_distance',
       'PULocationID', 'DOLocationID', 'fare_amount', 'extra', 'tolls_amount',
       'total_amount', 'PUhour']]
X_test.dtypes

RatecodeID         int64
trip_distance    float64
PULocationID       int64
DOLocationID       int64
fare_amount      float64
extra            float64
tolls_amount     float64
total_amount     float64
PUhour             int64
dtype: object

In [5]:
from dask_ml.model_selection import GridSearchCV
from dask_ml.xgboost import XGBRegressor

from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)

In [6]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [7]:
pred = xgb.predict(X_test)

In [22]:
from dask_ml.metrics import mean_squared_error
print('RMSE for XGBoost: ', mean_squared_error(y_test, pred, multioutput='raw_values').compute())

RMSE for XGBoost:  1.5988745755291442


I chose a XGBoost model because it performs significally well as a regresor, there's no need for feature normalization, it reduces overfitting, and trains relatively fast. I didn't make any feature engineering which is one of the reasons this model leaves room (much room) for improvement.

Having said that, neural networks (LSTM) usually perform better but they are slower to train and the model engineering needs more work. 

Next steps to improve this model:
    
    - Do feature engineeering.
    - Move from sklearn to TensorFlow or Coffee to customize the model and increase accuracy.
    - Move to a parallel model.
    - Change the model in case the performance doesn't improve much as said before.
    
To make a API we would have to use a stream to calculate predictions based on the model used for input. Basically, each time data is upload the model would predict and return those predictions. The model should be trained once a day/week/etc to maintain accuracy depending on how often new data is entered in the system.
This workflow has a component of data cleaning and wrangling for the model to be able to interpret it.