# Model training & evaluation

## Abstract

This notebook runs the training of a chosen model for production.

The chosen model is a deep & wide neural net regressor.
<br>For the wide part of the model, the following features are used:
- `TripStartYear`
- `TripStartMonth`
- `TripStartDay`
- `TripStartHour`
- `TripStartMinute`
- `month_day`: feature cross of `TripStartMonth` & `TripStartDay`
- `day_hour`: feature cross of `TripStartDay` & `TripStartHour`

As for the deep part of the model, the following features are used:
- `historical_tripDuration`
- `histOneWeek_tripDuration`
- `historical_tripDistance`
- `histOneWeek_tripDistance`
- `rawDistance`
- `pickup_census_tract` embedded
- `dropoff_census_tract` embedded

Once the model trained, we will evaluate it with [Tensorflow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started).
<br>More specifically, TFMA runs the model on the test set for final evaluation and provides a visual interface to show its predictive weaknesses.

## Training

In [6]:
%%bash

sh train.sh

jobId: chicago_taxi_ml_train_model_20191009_204849
state: QUEUED


Job [chicago_taxi_ml_train_model_20191009_204849] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ai-platform jobs describe chicago_taxi_ml_train_model_20191009_204849

or continue streaming the logs with the command

  $ gcloud ai-platform jobs stream-logs chicago_taxi_ml_train_model_20191009_204849


## Run TFMA

In [None]:
%%bash

sh model_analysis/tfma_model_dataflow.sh

_As we did experience some "crashs" when running the above command in the notebook, it was prefered to directly use a console._

 In order to run, TFMA needs notebook extensions.
<br>To enable such extensions, keep in mind to switch on AI Platform from standard Jupyterlab to Jupyter notebook classic version.
<br>To do so, go to `Help > Launch Classic Notebook`.

Furthermore TFMA visuals cannot be saved neither in the notebook nor even in an HTML version.
<br>We need to re-run the cell everytime we want to visualize the metrics.

In [1]:
import tensorflow_model_analysis as tfma

print('TFMA version: {}'.format(tfma.version.VERSION_STRING))

  'You are using Apache Beam with Python 2. '


TFMA version: 0.14.0


In [2]:
train_result = tfma.load_eval_result(output_path='gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfma_model/train/')
eval_result = tfma.load_eval_result(output_path='gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfma_model/eval/')
test_result = tfma.load_eval_result(output_path='gs://szilard_aliz_sandbox/pierre_tasks/demo1/tfma_model/test/')

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


### Train set

In [3]:
tfma.view.render_slicing_metrics(train_result)

U2xpY2luZ01ldHJpY3NWaWV3ZXIoY29uZmlnPXsnd2VpZ2h0ZWRFeGFtcGxlc0NvbHVtbic6IDF9LCBkYXRhPVt7J21ldHJpY3MnOiB7dSdsYWJlbC9tZWFuJzogeydkb3VibGVWYWx1ZSc6IDHigKY=


### Eval set

In [4]:
tfma.view.render_slicing_metrics(eval_result)

U2xpY2luZ01ldHJpY3NWaWV3ZXIoY29uZmlnPXsnd2VpZ2h0ZWRFeGFtcGxlc0NvbHVtbic6IDF9LCBkYXRhPVt7J21ldHJpY3MnOiB7dSdsYWJlbC9tZWFuJzogeydkb3VibGVWYWx1ZSc6IDHigKY=


### Test set

In [5]:
tfma.view.render_slicing_metrics(test_result)

U2xpY2luZ01ldHJpY3NWaWV3ZXIoY29uZmlnPXsnd2VpZ2h0ZWRFeGFtcGxlc0NvbHVtbic6IDF9LCBkYXRhPVt7J21ldHJpY3MnOiB7dSdsYWJlbC9tZWFuJzogeydkb3VibGVWYWx1ZSc6IDHigKY=


In [3]:
tfma.view.render_slicing_metrics(test_result, slicing_column='TripStartMonth')

U2xpY2luZ01ldHJpY3NWaWV3ZXIoY29uZmlnPXsnd2VpZ2h0ZWRFeGFtcGxlc0NvbHVtbic6IDF9LCBkYXRhPVt7J21ldHJpY3MnOiB7dSdsYWJlbC9tZWFuJzogeydkb3VibGVWYWx1ZSc6IDbigKY=


In [4]:
tfma.view.render_slicing_metrics(test_result, slicing_column='TripStartDay')

U2xpY2luZ01ldHJpY3NWaWV3ZXIoY29uZmlnPXsnd2VpZ2h0ZWRFeGFtcGxlc0NvbHVtbic6IDF9LCBkYXRhPVt7J21ldHJpY3MnOiB7dSdsYWJlbC9tZWFuJzogeydkb3VibGVWYWx1ZSc6IDHigKY=


As a conclusion, there are no noticeable discrepancy in the model performance.

Here are the diverse model performances in terms of __RMSE__:
- training: __2.529__
- evaluation: __2.538__
- test: __2.647__

The model behaves well without any overfitting.
<br>Furthermore, either for training, evaluation or test, the model's performance is also very stable when partitioned with the different slices of `TripStartMonth` & `TripStartDay` - except for __Saturday__ where performance (in __RMSE__) drops from __~2.60__ to __3.073__.

Compared to the baseline model, the performance improvement on the holdout test is about __8.5%__.