# Modeling Notebook

In this notebook, we will be training a regression model from the data produced from the `Data_Prep` Notebook. We will try two different models:
- Simple Linear Regression using sklearn
- Simple Neural Network using keras

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense

## Modeling

### Load Train and Test Data
First we will load the train and test data for the current directory. Then we will split the data into train/validation sets with an 80/20 split, respectively. We will then want to scale the data since we are using a distance based evaluation metric. We can use sklearn's `scale` method to center the data around the mean for each column.

In [3]:
train_data = pd.read_csv('cleaned_train_data.csv')
test_data = pd.read_csv('formatted_test_data.csv')

In [4]:
X = train_data.drop(['key', 'fare_amount'], axis=1)
y = train_data['fare_amount']

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [24]:
X_val = scale(X_val)

In [13]:
X_scale = scale(X_train)

In [40]:
def predict_score(model, train_data, train_labels, val_data, val_labels, test_data):
    '''
    Predict outcomes for a given model.
    Compute RMSE for both train and validation sets.
    Construct the test csv and save it to current directory.
    '''
    train_preds = model.predict(train_data)
    train_rmse = np.sqrt(mean_squared_error(train_labels, train_preds))
    val_preds = model.predict(val_data)
    val_rmse = np.sqrt(mean_squared_error(val_labels, val_preds))
    print(f'Train RMSE: {train_rmse}')
    print(f'Validation RMSE: {val_rmse}')
    
    keys = test_data['key']
    X_test = scale(test_data.drop('key', axis=1))
    test_df = pd.DataFrame(columns = ['key','fare_amount'])
    test_df['key'] = keys
    test_df['fare_amount'] = model.predict(X_test)
    test_df.to_csv('submission.csv', index=False)
    return None

### Linear Model

Create a simple Linear Regression model to find out where our base estimate is at. We can do this using sklearn, along with the `predict_score()` function we created above.

In [39]:
lr_model = LinearRegression().fit(X_scale, y_train)
predict_score(lr_model, X_scale, y_train, X_val, y_val, test_data)

Train RMSE: 3.847982441777744
Validation RMSE: 3.842296488675296


NOTE: Although the RMSE seems to perform well on the train and validation sets, when submitting the test data to kaggle the RMSE is 5.498. This could mean that the model is not generalizing well when presented with new data (although it did perform well on the validation set). Again, this could be because the test data is cleaner than the validation set and therefore our model is not predicting as accurate as we would want.

### Neural Network

We will now create a simple Neural Network to be used to predict the fare amount. Note that the competition suggested using a NN as the model architecture, so we will most likely see improvements from the Linear Regression model. We will create a NN with 5 layers, with 10 input nodes and 1 output node. We will use mean-squared-error as our loss/metric along with the Adam optimizer.

In [16]:
model = Sequential()
model.add(Dense(256, activation='relu', input_dim=X.shape[1]))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1))

model.compile(loss='mse', optimizer='adam', metrics=['mse'])
model.fit(X_scale, y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1a9019165c8>

In [37]:
predict_score(model, X_scale, y_train, X_val, y_val, test_data)

submission = pd.read_csv('submission.csv')
submission.head()

Train RMSE: 3.0538828973591214
Validation RMSE: 3.062841607273841


Unnamed: 0,key,fare_amount
0,2015-01-27 13:08:24.0000002,10.224867
1,2015-01-27 13:08:24.0000003,11.021889
2,2011-10-08 11:53:44.0000002,5.21767
3,2012-12-01 21:12:12.0000002,8.711488
4,2012-12-01 21:12:12.0000003,14.700125


## Conclusion

Beginning with the Linear Regression model, our RMSE was around 5.498 which was on the lower end of the leaderboard. By using a simple Neural Network with 5 layers, we were able to lower the RMSE to around 3.91. This was a huge improvement from the previous model, and moved us up around +300 in the leaderboards. There is still room for improvement on our model (possibly longer training than we had done here) but overall we were able to acheive a desirable score. 

Discovering trends in our data and engineering new features such as the distance in kilometers per trip and extracting information for the date/time, we were able to successfully predict the fare amount that would be expected for a NYC taxi rider within a resonable range. We were able to eliminate most outliers in the data and this was a major part in the success of the model, although more cleaning and analysis could most likely be done.