<h1>2b. Machine Learning using tf.estimator </h1>

In this notebook, we will create a machine learning model using tf.estimator and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

print(tf.__version__)

1.14.0


Read data created in the previous chapter.

In [5]:
# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)
df_test = pd.read_csv('./taxi-test.csv', header = None, names = CSV_COLUMNS)
df_test.tail()

Unnamed: 0,fare_amount,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key
1661,13.0,-73.974001,40.73749,-73.949836,40.782886,1,1661
1662,3.3,-74.013494,40.707932,-74.010228,40.711958,1,1662
1663,16.0,-74.00109,40.731757,-73.95064,40.78912,6,1663
1664,8.5,-73.993035,40.749641,-73.974519,40.751438,1,1664
1665,33.5,-73.973316,40.750992,-73.808302,40.758884,2,1665


# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)
df_test = pd.read_csv('./taxi-test.csv', header = None, names = CSV_COLUMNS)
df_train.header

In [6]:
def make_train_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[LABEL],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000
  )

In [7]:
def make_eval_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = df[LABEL],
    batch_size = 128,
    shuffle = False,
    queue_capacity = 1000
  )

Our input function for predictions is the same except we don't provide a label

In [8]:
def make_prediction_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    x = df,
    y = None,
    batch_size = 128,
    shuffle = False,
    queue_capacity = 1000
  )

### Create feature columns for estimator

In [9]:
def make_feature_cols():
  input_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]
  return input_columns

<h3> Linear Regression with tf.Estimator framework </h3>

In [10]:
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.LinearRegressor(
      feature_columns = make_feature_cols(), model_dir = OUTDIR)

model.train(input_fn = make_train_input_fn(df_train, num_epochs = 10))

I0801 01:56:38.866556 140209688872704 estimator.py:1790] Using default config.
I0801 01:56:38.868912 140209688872704 estimator.py:209] Using config: {'_train_distribute': None, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f84ba2dc160>, '_protocol': None, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_service': None, '_is_chief': True, '_master': '', '_experimental_distribute': None, '_task_type': 'worker', '_num_ps_replicas': 0, '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_eval_distribute': None, '_global_id_in_cluster': 0, '_task_id': 0, '_experimental_max_worker_delay_secs': None, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_evaluation_master': '', '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_model_dir': 'taxi_trained'}
W0801 01:56:38.8997

<tensorflow_estimator.python.estimator.canned.linear.LinearRegressor at 0x7f84ba2dc080>

Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [11]:
def print_rmse(model, df):
  metrics = model.evaluate(input_fn = make_eval_input_fn(df))
  print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))
print_rmse(model, df_valid)

I0801 01:57:22.622717 140209688872704 estimator.py:1145] Calling model_fn.
I0801 01:57:23.313694 140209688872704 estimator.py:1147] Done calling model_fn.
I0801 01:57:23.338406 140209688872704 evaluation.py:255] Starting evaluation at 2019-08-01T01:57:23Z
I0801 01:57:23.454712 140209688872704 monitored_session.py:240] Graph was finalized.
W0801 01:57:23.456446 140209688872704 deprecation.py:323] From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1276: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
I0801 01:57:23.459549 140209688872704 saver.py:1280] Restoring parameters from taxi_trained/model.ckpt-608
I0801 01:57:23.525096 140209688872704 session_manager.py:500] Running local_init_op.
I0801 01:57:23.556308 140209688872704 session_manager.py:502] Done running local_init_op.
I0801 01:57:24.02

RMSE on dataset = 10.576610565185547


This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [12]:
predictions = model.predict(input_fn = make_prediction_input_fn(df_test))
for items in predictions:
  print(items)

I0801 01:57:36.041475 140209688872704 estimator.py:1145] Calling model_fn.
I0801 01:57:36.299868 140209688872704 estimator.py:1147] Done calling model_fn.
I0801 01:57:36.412351 140209688872704 monitored_session.py:240] Graph was finalized.
I0801 01:57:36.416234 140209688872704 saver.py:1280] Restoring parameters from taxi_trained/model.ckpt-608
I0801 01:57:36.471588 140209688872704 session_manager.py:500] Running local_init_op.
I0801 01:57:36.478801 140209688872704 session_manager.py:502] Done running local_init_op.


{'predictions': array([9.887442], dtype=float32)}
{'predictions': array([9.885124], dtype=float32)}
{'predictions': array([9.885907], dtype=float32)}
{'predictions': array([9.883687], dtype=float32)}
{'predictions': array([9.887203], dtype=float32)}
{'predictions': array([9.887005], dtype=float32)}
{'predictions': array([9.885714], dtype=float32)}
{'predictions': array([9.885741], dtype=float32)}
{'predictions': array([9.887341], dtype=float32)}
{'predictions': array([9.885398], dtype=float32)}
{'predictions': array([9.887444], dtype=float32)}
{'predictions': array([9.887593], dtype=float32)}
{'predictions': array([9.881483], dtype=float32)}
{'predictions': array([9.885046], dtype=float32)}
{'predictions': array([9.935465], dtype=float32)}
{'predictions': array([9.885782], dtype=float32)}
{'predictions': array([9.886547], dtype=float32)}
{'predictions': array([9.936393], dtype=float32)}
{'predictions': array([9.886843], dtype=float32)}
{'predictions': array([9.884271], dtype=float32)}


This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well.

<h3> Deep Neural Network regression </h3>

In [13]:
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.DNNRegressor(hidden_units = [32, 8, 2],
      feature_columns = make_feature_cols(), model_dir = OUTDIR)
model.train(input_fn = make_train_input_fn(df_train, num_epochs = 100));
print_rmse(model, df_valid)

I0801 01:57:52.848378 140209688872704 estimator.py:1790] Using default config.
I0801 01:57:52.850056 140209688872704 estimator.py:209] Using config: {'_train_distribute': None, '_device_fn': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f84b22a3a90>, '_protocol': None, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_service': None, '_is_chief': True, '_master': '', '_experimental_distribute': None, '_task_type': 'worker', '_num_ps_replicas': 0, '_save_checkpoints_secs': 600, '_save_summary_steps': 100, '_keep_checkpoint_max': 5, '_eval_distribute': None, '_global_id_in_cluster': 0, '_task_id': 0, '_experimental_max_worker_delay_secs': None, '_keep_checkpoint_every_n_hours': 10000, '_num_worker_replicas': 1, '_evaluation_master': '', '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_model_dir': 'taxi_trained'}
I0801 01:57:52.8864

RMSE on dataset = 10.443714141845703


We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!

But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.

<h2> Benchmark dataset </h2>

Let's do this on the benchmark dataset.

In [14]:
from google.cloud import bigquery
import numpy as np
import pandas as pd

def create_query(phase, EVERY_N):
  """
  phase: 1 = train 2 = valid
  """
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  EXTRACT(DAYOFWEEK FROM pickup_datetime) * 1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime) * 1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  CONCAT(CAST(pickup_datetime AS STRING), CAST(pickup_longitude AS STRING), CAST(pickup_latitude AS STRING), CAST(dropoff_latitude AS STRING), CAST(dropoff_longitude AS STRING)) AS key
FROM
  `nyc-tlc.yellow.trips`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """

  if EVERY_N == None:
    if phase < 2:
      # Training
      query = "{0} AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 4) < 2".format(base_query)
    else:
      # Validation
      query = "{0} AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), 4) = {1}".format(base_query, phase)
  else:
    query = "{0} AND MOD(ABS(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING))), {1}) = {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(2, 100000)
df = bigquery.Client().query(query).to_dataframe()

In [15]:
print_rmse(model, df)

I0801 02:00:01.046935 140209688872704 estimator.py:1145] Calling model_fn.
I0801 02:00:01.739599 140209688872704 estimator.py:1147] Done calling model_fn.
I0801 02:00:01.767324 140209688872704 evaluation.py:255] Starting evaluation at 2019-08-01T02:00:01Z
I0801 02:00:01.912871 140209688872704 monitored_session.py:240] Graph was finalized.
I0801 02:00:01.916168 140209688872704 saver.py:1280] Restoring parameters from taxi_trained/model.ckpt-6071
I0801 02:00:01.995145 140209688872704 session_manager.py:500] Running local_init_op.
I0801 02:00:02.037075 140209688872704 session_manager.py:502] Done running local_init_op.
I0801 02:00:03.221067 140209688872704 evaluation.py:275] Finished evaluation at 2019-08-01-02:00:03
I0801 02:00:03.228214 140209688872704 estimator.py:2039] Saving dict for global step 6071: average_loss = 88.91338, global_step = 6071, label/mean = 11.232336, loss = 11297.169, prediction/mean = 11.163781
I0801 02:00:03.233803 140209688872704 estimator.py:2099] Saving 'check

RMSE on dataset = 9.429388999938965


RMSE on benchmark dataset is <b>9.41</b> (your results will vary because of random seeds).

This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.

Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in the next chapters. In this chapter though, we will get our TensorFlow model ready for these improvements.

In a software sense, the rest of the labs in this chapter will be about refactoring the code so that we can improve it.

## Challenge Exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Simulate the necessary training dataset.
<p>
Hint (highlight to see):
<p style='color:white'>
The input features will be r and h and the label will be $\pi r^2 h$
Create random values for r and h and compute V.
Your dataset will consist of r, h and V.
Then, use a DNN regressor.
Make sure to generate enough data.
</p>

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License