# Simple example for using a premade tf.estimator

In this notebook we cover the creation of a couple premade estimators and using them for a prediction problem. 

Note: The contents of this notebook are inspired on [the GCP tutorials](https://github.com/GoogleCloudPlatform/training-data-analyst) and the [Tensorflow](https://www.tensorflow.org/) documentation. (Kudos to the authors!)

In [1]:
#First of all, some imports and general stuff (old C/C++ habits)
import tensorflow as tf
import numpy as np
import pandas as pd
import shutil
import os
from sodapy import Socrata  # pip install sodapy
import json
import datetime

OUTPUT_BASE_DIR = '../outputs/Simple-TF.estimator-example'
DATASET_DIR = '../datasets'

tf.logging.set_verbosity(tf.logging.INFO)

print("This notebook was tested with the Tensorflow version: {}".format(tf.__version__))

This notebook was tested with the Tensorflow version: 1.10.0


## Estimators
Lets start by quickly reviewing the concept of Tensorflow's Estimator API. The Estimator API provides methods to train ML models, to judge the model's accuracy, and to generate predictions. An Estimator is TensorFlow's high-level representation of a complete model. A premade Estimator is one of the common Estimators, a Linear Regressor for instance, that are provided by Tensorflow for out-of-the-box use.<br>
To write a TensorFlow program based on pre-made Estimators, on should:<br>
**(1)** Define the model's feature columns.<br>
**(2)** Create the input function(s).<br>
**(3)** Instantiate an Estimator, specifying the feature columns and various hyperparameters.<br>
**(4)** Use the Estimator (training, evaluation, prediction).<br>
This notebook cover these for points.

## The dataset

Before starting with the first step, lets just quickly review the dataset used for this exampe: the [TLC taxi dataset](https://data.cityofnewyork.us/Transportation/TLC-Taxi-Data/gkne-dk5s). This dataset contains information about taxi rides in NYC such as pickup location, number of passengers and fare amount.<br>
We are going to download this data directly from the official site. **Important:** Consider to download the dataset only ONCE. To do this set the *downloadData* variable to True for a single execution of the following cell. This is to avoid downloading multiple times the same data which will unnecessarly overload the server. Note that the full dataset contains over 156M rows (Sept-18) but we are downloading there only the first 200K entries.

In [2]:
downloadData = False

if downloadData:
    # Example authenticated client (needed for non-public datasets):
    # client = Socrata(data.cityofnewyork.us,
    #                  MyAppToken,
    #                  userame="user@example.com",
    #                  password="AFakePassword")

    # Unauthenticated client only works with public data sets. Note 'None'
    # in place of application token, and no username or password:
    client = Socrata("data.cityofnewyork.us", None)
    
    results = client.get("gkne-dk5s", limit=200000) # returned as JSON from API/converted to Python list of dictionaries by sodapy
    ##results_df = pd.DataFrame.from_records(results) # if we were to use directly without saving a local copy
    with open(os.path.join(DATASET_DIR, 'taxi-temp.json'), 'w') as outfile:
        json.dump(results, outfile)

Lets load the data to see what it contains. A pretty straightforward way of reading data is to load it from the source, e.g., a .csv file, into an internal representation, e.g., a panda.dataframe. Keep in mind that sufficient memory is required as the entire contents of the file are loaded at once, in this cas it should be ok. 

In [55]:
df_all = pd.read_json('../datasets/taxi-temp.json')

Lets see some of the data:

In [56]:
df_all.head()

Unnamed: 0,dropoff_datetime,dropoff_latitude,dropoff_longitude,fare_amount,imp_surcharge,mta_tax,passenger_count,payment_type,pickup_datetime,pickup_latitude,pickup_longitude,rate_code,store_and_fwd_flag,tip_amount,tolls_amount,total_amount,trip_distance,vendor_id
0,2014-02-26T17:34:00.000,40.767107,-73.982538,6.0,1.0,0.5,2,CSH,2014-02-26T17:28:00.000,40.757502,-73.973135,1,,0.0,0.0,7.5,1.07,VTS
1,2014-02-15T10:08:00.000,40.739422,-73.978727,6.5,0.0,0.5,2,CSH,2014-02-15T10:01:00.000,40.74735,-73.986125,1,,0.0,0.0,7.0,0.92,VTS
2,2014-12-06T05:03:10.000,40.648633,-73.781826,52.0,0.0,0.5,1,CRD,2014-12-06T04:42:16.000,40.754909,-73.971853,2,N,11.55,5.33,69.38,16.7,CMT
3,2014-01-23T01:17:00.000,40.673935,-73.964673,8.5,0.5,0.5,1,CSH,2014-01-23T01:08:00.000,40.683642,-73.977057,1,,0.0,0.0,9.5,1.85,VTS
4,2014-01-31T08:33:00.000,40.78992,-73.952352,17.0,0.0,0.5,1,CRD,2014-01-31T08:13:00.000,40.738062,-73.983817,1,,2.5,0.0,20.0,4.26,VTS


## The model: 
The ML model will be specified a little bit later, since we are using premade estimators, we will be able to choose among some different available options, idelly selecting the best match for our problem: the simplest model capable of adequately fitting our dataset. Keep in mind that *simplicity is the ultimate sophistication* or *the KISS principle*, depending on your preferences :)<br>

Two important pars of any ML model are the inputs and the outpus, which are directly related to the problem we are trying to solve. Lets now identify which are the outputs (i.e., the objective of our model) and the inputs (which are the features).
### Objective (outputs)
The objective of the Estimator that we will use in this example is to predict the fare amount of a cab ride (*fare_amount* coumn). This defines already the type of problem we are trying to solve here: a regression problem (the output is a single continuous value). The corresponding column becomes the objective.
### Features (inputs)
All the other columns can be directly used as inputs of the model (features). However, we are not going to do this because not all columns have an impact on (or are availible when) predicting the fare amount (e.g., the payement type). Moreover, the objective of this notebook is to show a simple way of implementing an estimator, not to produce a performant model. So lets drop out some columns and then show some statistics of the dataset, which is always a good place to start for understanding the data.

In [57]:
df_all = df_all.drop(columns=['imp_surcharge', 'mta_tax', 'passenger_count', 
                                'payment_type', 'rate_code', 'store_and_fwd_flag', 'tip_amount', 
                                'tolls_amount', 'total_amount', 'trip_distance', 'vendor_id'])
col_names = list(df_all.columns.values)

label = 'fare_amount'
features = col_names
features.remove(label)

df_all.describe()

Unnamed: 0,dropoff_latitude,dropoff_longitude,fare_amount,pickup_latitude,pickup_longitude
count,200000.0,200000.0,200000.0,200000.0,200000.0
mean,39.932246,-72.490769,12.651451,39.937156,-72.50123
std,5.734806,10.374468,10.448343,5.716558,10.341616
min,-180.0,-180.0,0.0,-180.0,-180.0
25%,40.73334,-73.991407,6.5,40.734567,-73.99219
50%,40.752772,-73.97995,9.5,40.752372,-73.981875
75%,40.767867,-73.96298,14.5,40.766825,-73.966936
max,45.341914,0.0,358.21,45.341914,0.0


The describe method shows some statistics about continuous variables, but what about the *dropoff_datetime* and *pickup_datetime* columns? these seems to be of type datetime, lets output the types of the columns

In [59]:
df_all.dtypes

dropoff_datetime      object
dropoff_latitude     float64
dropoff_longitude    float64
fare_amount          float64
pickup_datetime       object
pickup_latitude      float64
pickup_longitude     float64
dtype: object

These columns are classed as *obejct*, in fact the particular format is not even recognized as DateTime standard format. Lets cast these columns into float values.

In [60]:
df_all[['pickup_datetime','dropoff_datetime']] = df_all[['pickup_datetime','dropoff_datetime']].applymap(
            lambda x: datetime.datetime.strptime(x, "%Y-%m-%dT%H:%M:%S.%f").timestamp() )
print(df_all.dtypes)
df_all.describe()

dropoff_datetime     float64
dropoff_latitude     float64
dropoff_longitude    float64
fare_amount          float64
pickup_datetime      float64
pickup_latitude      float64
pickup_longitude     float64
dtype: object


Unnamed: 0,dropoff_datetime,dropoff_latitude,dropoff_longitude,fare_amount,pickup_datetime,pickup_latitude,pickup_longitude
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,1404001000.0,39.932246,-72.490769,12.651451,1404000000.0,39.937156,-72.50123
std,9068242.0,5.734806,10.374468,10.448343,9068208.0,5.716558,10.341616
min,1388532000.0,-180.0,-180.0,0.0,1388531000.0,-180.0,-180.0
25%,1396106000.0,40.73334,-73.991407,6.5,1396105000.0,40.734567,-73.99219
50%,1403608000.0,40.752772,-73.97995,9.5,1403607000.0,40.752372,-73.981875
75%,1411993000.0,40.767867,-73.96298,14.5,1411992000.0,40.766825,-73.966936
max,1420067000.0,45.341914,0.0,358.21,1420067000.0,45.341914,0.0


However we can now see that values on these transformed columns are huge, this may cause stability problems in our model. Besides it is always better to normalize all inputs (the outputs are not necessarly)

In [61]:
df_all[features] = (df_all[features] - df_all[features].mean() ) / df_all[features].std()
df_all.describe()

Unnamed: 0,dropoff_datetime,dropoff_latitude,dropoff_longitude,fare_amount,pickup_datetime,pickup_latitude,pickup_longitude
count,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0,200000.0
mean,-1.733351e-15,1.345292e-13,-1.37362e-13,12.651451,8.333068e-16,-7.267381e-14,1.308183e-13
std,1.0,1.0,1.0,10.448343,1.0,1.0,1.0
min,-1.705902,-38.35043,-10.36287,0.0,-1.705888,-38.4737,-10.39477
25%,-0.8706775,0.1396898,-0.1446472,6.5,-0.8706638,0.1394914,-0.144171
50%,-0.04335732,0.1430782,-0.1435428,9.5,-0.04336924,0.1426061,-0.1431735
75%,0.8813426,0.1457104,-0.1419071,14.5,0.8813003,0.1451343,-0.1417289
max,1.771663,0.9433045,6.98742,358.21,1.771737,0.9454566,7.010629


Now that the data is ready, the following step is to prepare the datasets for the different stages of a typical ML problem: Training, Validating and Testing. For simplicity we are going for a 70/20/10 split. 

In [62]:
# Split into train and eval
np.random.seed(seed=1984) #makes split reproducible
rands = np.random.rand(len(df_all))
df_train = df_all[rands < 0.7]
df_eval = df_all[ ( (rands >= 0.7) & (rands < 0.9) ) ]
df_test = df_all[rands >= 0.9]

print('Rows count: Train dataset -> {}, Eval dataset -> {}, Test dataset -> {}'.format(len(df_train), len(df_eval), len(df_test)) )

Rows count: Train dataset -> 140168, Eval dataset -> 39536, Test dataset -> 20296


### (1) Defining the feature columns
A feature column is an object describing how the model should use raw input data from the features dictionary. When you build an Estimator model, you pass it a list of feature columns that describes each of the features you want the model to use. The tf.feature_column module provides many options for representing data to the model, for this example a numeric_column will suffice, but keep in mind that there are other types of [feature columns](https://www.tensorflow.org/guide/feature_columns).

In [63]:
# This fucntion creates the Feature columns of the model
def make_feature_cols():
    input_columns = [tf.feature_column.numeric_column(k) for k in features] 
    return input_columns

### (2) Creating the input function(s)
Input functions supply data for training, evaluating, and prediction. More precisely, an input function is a function that returns a tf.data.Dataset object which outputs the following two-element tuple:
* Features - A Python dictionary in which:
   * Each key is the name of a feature.
   * Each value is an array containing all of that feature's values.
* Label - An array containing the values of the label for every example.

Note that the way in which the input function is written highly depends on the way our data is stored, their size and how do we like to read this data. For this simple example we loaded the entire dataset using *pandas dataframes* but keep in mind that there are other, more sophisticated ways of loading data, in particular for gargantuan amonuts of it.<br>
It is also possible to define several different input functions, e.g., one for training, another for evaluating and a third one for prediction. In this example, we will use a single function with some parameters:

In [64]:
def make_input_fn(df, numEpochs, predictionMode=False):
    return tf.estimator.inputs.pandas_input_fn(
        x = df,
        y = None if predictionMode else df[label],
        batch_size = 128,#1 if predictionMode else 128,
        num_epochs = numEpochs,
        shuffle = True,
        queue_capacity = 1000,
        num_threads = 1
      )

### (3) Instantiation of the Estimator
As stated before, the problem at hand is a Regression problem, fortunately there are [several premade estimators](https://www.tensorflow.org/api_docs/python/tf/estimator) for regression problems. Our first Estimator will be the LinearRegressor. To create this particular estimator we only need to provide the feature columns, the [rest of the parameters](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor) are optional. We are also setting the output directory, where the model checkpoints and other data will be stored.

In [65]:
model = tf.estimator.LinearRegressor(
            feature_columns = make_feature_cols(), 
            model_dir = OUTPUT_BASE_DIR)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_session_config': None, '_keep_checkpoint_max': 5, '_evaluation_master': '', '_num_worker_replicas': 1, '_tf_random_seed': None, '_master': '', '_save_summary_steps': 100, '_is_chief': True, '_device_fn': None, '_train_distribute': None, '_task_type': 'worker', '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_save_checkpoints_secs': 600, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002D452B44208>, '_service': None, '_num_ps_replicas': 0, '_model_dir': '../outputs/Simple-TF.estimator-example', '_save_checkpoints_steps': None, '_global_id_in_cluster': 0, '_task_id': 0}


### (4) Using the Estimator
The most common operations that are performed by an estimator are:
* Training
* Evaluate
* Predict

#### Trainning


In [66]:
shutil.rmtree(OUTPUT_BASE_DIR, ignore_errors = True) # start fresh each time
model.train(input_fn = make_input_fn(df_train, numEpochs = 10))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ../outputs/Simple-TF.estimator-example\model.ckpt.
INFO:tensorflow:loss = 38451.5, step = 1
INFO:tensorflow:global_step/sec: 233.273
INFO:tensorflow:loss = 26013.86, step = 101 (0.434 sec)
INFO:tensorflow:global_step/sec: 271.889
INFO:tensorflow:loss = 23387.113, step = 201 (0.365 sec)
INFO:tensorflow:global_step/sec: 276.009
INFO:tensorflow:loss = 17613.441, step = 301 (0.361 sec)
INFO:tensorflow:global_step/sec: 272.628
INFO:tensorflow:loss = 12952.629, step = 401 (0.369 sec)
INFO:tensorflow:global_step/sec: 279.083
INFO:tensorflow:loss = 14641.863, step = 501 (0.356 sec)
INFO:tensorflow:global_step/sec: 272.259
INFO:tensorflow:loss = 14640.592, step = 601 (0.368 sec)
INFO:tensorflow:global_step/sec: 279.8

INFO:tensorflow:loss = 12900.455, step = 8001 (0.358 sec)
INFO:tensorflow:global_step/sec: 289.154
INFO:tensorflow:loss = 8653.774, step = 8101 (0.347 sec)
INFO:tensorflow:global_step/sec: 262.624
INFO:tensorflow:loss = 32829.695, step = 8201 (0.379 sec)
INFO:tensorflow:global_step/sec: 285.037
INFO:tensorflow:loss = 14086.114, step = 8301 (0.351 sec)
INFO:tensorflow:global_step/sec: 299.079
INFO:tensorflow:loss = 13691.841, step = 8401 (0.334 sec)
INFO:tensorflow:global_step/sec: 277.539
INFO:tensorflow:loss = 24889.527, step = 8501 (0.361 sec)
INFO:tensorflow:global_step/sec: 284.23
INFO:tensorflow:loss = 9968.984, step = 8601 (0.352 sec)
INFO:tensorflow:global_step/sec: 270.055
INFO:tensorflow:loss = 9525.364, step = 8701 (0.370 sec)
INFO:tensorflow:global_step/sec: 301.779
INFO:tensorflow:loss = 7213.026, step = 8801 (0.331 sec)
INFO:tensorflow:global_step/sec: 287.907
INFO:tensorflow:loss = 10762.988, step = 8901 (0.347 sec)
INFO:tensorflow:global_step/sec: 269.331
INFO:tensorflow

<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x2d452a1f6a0>

#### Evaluating

In [68]:
def print_rmse(model, name, df):
    metrics = model.evaluate(input_fn = make_input_fn(df_eval, numEpochs=1))
    print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss'])))

print_rmse(model, 'validation', df_eval)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-09-06-16:21:07
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ../outputs/Simple-TF.estimator-example\model.ckpt-10951
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-09-06-16:21:08
INFO:tensorflow:Saving dict for global step 10951: average_loss = 108.52657, global_step = 10951, label/mean = 12.739655, loss = 13885.781, prediction/mean = 12.624673
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10951: ../outputs/Simple-TF.estimator-example\model.ckpt-10951
RMSE on validation dataset = 10.417609214782715


#### Predicting

Note that this should be normally unlabelled data, and in general a different format is used for this stage (hence another input function should exist, but for sake of simplicity we are going to reuse the validation data)

In [69]:
predictions = model.predict( input_fn = make_input_fn(df_eval, numEpochs=1, predictionMode=True) )
for i in range(5):
    print(next(predictions))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ../outputs/Simple-TF.estimator-example\model.ckpt-10951
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'predictions': array([12.518064], dtype=float32)}
{'predictions': array([12.089642], dtype=float32)}
{'predictions': array([12.102652], dtype=float32)}
{'predictions': array([12.807324], dtype=float32)}
{'predictions': array([12.625249], dtype=float32)}


## Another model?
By reviewing the results we can see that the performance of the model is not really great, its pretty bad actually.. 
This explains why the RMSE was so high, the model essentially predicts the same amount for every trip. Would a more complex model help? Let's try using a deep neural network. <br>
The code to do this is quite straightforward as well.
In fact, we dont need to redo steps (1) and (2), just define a new premade estimator and use it. lets quickly define a DNNRegressor with 3 hidden layers containing 32, 8 and 2 neurons respectively.

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTPUT_BASE_DIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTPUT_BASE_DIR)

### Tensorboard
To open tensorboard, C&P the following command on your terminal.

In [None]:
print("$ tensorboard --logdir {} --host=127.0.0.1".format(os.path.abspath(OUTPUT_BASE_DIR)) )

Launch the training then click [here](http://localhost:6006)

In [None]:
model.train(input_fn = make_input_fn(df_train, numEpochs = 10));

In [None]:
print_rmse(model, 'validation', df_valid)

Well, the results didnt improve really.. however since the objective of this notebook is to show a simple guide to implement a premade estimator we will leave the production of better models for other notebook