<h2> Manual tuning of hyperparameters</h2>

- Objective: Prediction of median housing price at city block level
    - Evaluate accuracy of lineraRegresor class in TensorFlow using Root Mean Squared Error (RMSE)
    - Improve model accuracy by manual-tuning of hyperparameters

<h3> Step 1: Environment setup </h3>

In [1]:
# load liberaries
import math
import shutil
import numpy as np
import pandas as pd
import tensorflow as tf

tf.logging.set_verbosity(tf.logging.INFO)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

  from ._conv import register_converters as _register_converters


In [2]:
# Load data
df = pd.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")
df.to_csv('cal_housing.csv')

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.3,34.2,15.0,5612.0,1283.0,1015.0,472.0,1.5,66900.0
1,-114.5,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.8,80100.0
2,-114.6,33.7,17.0,720.0,174.0,333.0,117.0,1.7,85700.0
3,-114.6,33.6,14.0,1501.0,337.0,515.0,226.0,3.2,73400.0
4,-114.6,33.6,20.0,1454.0,326.0,624.0,262.0,1.9,65500.0


In [4]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


Lookig at the stat of total_rooms, it reflect the number of rooms per city block. In oder to use this data for price prediction of a single house we need to create feature appropriate to a single house

In [5]:
df['num_rooms_per_households'] = df['total_rooms'] / df['households']
df.describe()
# target variabl is median_house_value


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,num_rooms_per_households
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,-119.6,35.6,28.6,2643.7,539.4,1429.6,501.2,3.9,207300.9,5.4
std,2.0,2.1,12.6,2179.9,421.5,1147.9,384.5,1.9,115983.8,2.5
min,-124.3,32.5,1.0,2.0,1.0,3.0,1.0,0.5,14999.0,0.8
25%,-121.8,33.9,18.0,1462.0,297.0,790.0,282.0,2.6,119400.0,4.4
50%,-118.5,34.2,29.0,2127.0,434.0,1167.0,409.0,3.5,180400.0,5.2
75%,-118.0,37.7,37.0,3151.2,648.2,1721.0,605.2,4.8,265000.0,6.1
max,-114.3,42.0,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0,141.9


<h2> Build first model</h2>

In [11]:
# model LinearRegressor
# target variable, also called lable, is median_hose_value
# selected feature "num_rooms_per_household"

# step 1: preprocess data for tensorflow based model 
input_fn = tf.estimator.inputs.pandas_input_fn(x= df[["num_rooms_per_households"]],
                                              y = df["median_house_value"],
                                              num_epochs = 1,
                                              shuffle = True)
# step 2: create features
features = [tf.feature_column.numeric_column('num_rooms_per_households')]
# step 3: create model output directory
outdir = './housing_trained'
# step 4: restart after cleaning folder
shutil.rmtree(outdir, ignore_errors=True)
# step 5: instantiate linear model-- predicting continous variable
model = tf.estimator.LinearRegressor(model_dir=outdir,
                                    feature_columns=features)
# step 6: train model
model.train(input_fn = input_fn, steps=100)
# step 7: evaluate model

def print_rmse(model, name, input_fn):
  metrics = model.evaluate(input_fn = input_fn, steps=1)
  print('RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['loss'])))
print_rmse(model, 'trainig', input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_service': None, '_task_id': 0, '_model_dir': './housing_trained', '_log_step_count_steps': 100, '_task_type': 'worker', '_is_chief': True, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_session_config': None, '_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f62343b7b70>, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_master': '', '_evaluation_master': '', '_tf_random_seed': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_save_checkpoints_secs': 600}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./housing_trained/model.ckpt.
INFO:tensorflow:step = 1, 

<h2>Scale the output</h2>

- scale target value so the defual parameters are more appropriate 
- Note that the RMSE here is now in 100000s so if you get RMSE=0.9, it really means RMSE=90000.

In [12]:
SCALE = 100000
input_fn = tf.estimator.inputs.pandas_input_fn(x= df[["num_rooms_per_households"]],
                                              y = df["median_house_value"] / SCALE,
                                              num_epochs = 1,
                                              shuffle = True)
features = [tf.feature_column.numeric_column('num_rooms_per_households')]
outdir = './housing_trained'
shutil.rmtree(outdir, ignore_errors=True)
model = tf.estimator.LinearRegressor(model_dir=outdir, feature_columns=features)
model.train(input_fn=input_fn, steps=100)
print_rmse(model, 'training', input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_service': None, '_task_id': 0, '_model_dir': './housing_trained', '_log_step_count_steps': 100, '_task_type': 'worker', '_is_chief': True, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_session_config': None, '_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f62343b7748>, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_master': '', '_evaluation_master': '', '_tf_random_seed': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_save_checkpoints_secs': 600}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./housing_trained/model.ckpt.
INFO:tensorflow:step = 1, 

<h2>Tune parameter </h2>

- tune learning rate and batch size
- commonly evaluation is done on test/validatio set

In [99]:
SCALE = 100000
input_fn = tf.estimator.inputs.pandas_input_fn(x= df[["num_rooms_per_households"]],
                                              y = df["median_house_value"] / SCALE,
                                              num_epochs = 1,
                                              batch_size = 8,
                                              shuffle = True)
features = [tf.feature_column.numeric_column('num_rooms_per_households')]
outdir = './housing_trained'
shutil.rmtree(outdir, ignore_errors=True)
# add learning rate para
my_optimaization = tf.train.FtrlOptimizer(learning_rate=0.05)
model = tf.estimator.LinearRegressor(model_dir = outdir,
                                   feature_columns = features,
                                   optimizer = my_optimaization)
# update batchsize = sample_size[17000]*Epochs[1]/steps
model.train(input_fn=input_fn, steps=500)
print_rmse(model, 'training', input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_service': None, '_task_id': 0, '_model_dir': './housing_trained', '_log_step_count_steps': 100, '_task_type': 'worker', '_is_chief': True, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_session_config': None, '_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f62340a47b8>, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_master': '', '_evaluation_master': '', '_tf_random_seed': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_save_checkpoints_secs': 600}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./housing_trained/model.ckpt.
INFO:tensorflow:step = 1, 

<h3>Few rules of thumb for parameter tuning</h3>

 * Training error should steadily decrease, steeply at first, and should eventually plateau as training converges.
 * If the training has not converged, try running it for longer.
 * If the training error decreases too slowly, increasing the learning rate may help it decrease faster.
   * But sometimes the exact opposite may happen if the learning rate is too high.
 * If the training error varies wildly, try decreasing the learning rate.
   * Lower learning rate plus larger number of steps or larger batch size is often a good combination.
 * Very small batch sizes can also cause instability.  First try larger values like 100 or 1000, and decrease until you see degradation.


<h2> Add more features</h2>


In [14]:
SCALE = 100000
input_fn = tf.estimator.inputs.pandas_input_fn(x= df[["num_rooms_per_households","housing_median_age"]],
                                              y = df["median_house_value"] / SCALE,
                                              num_epochs = 1,
                                              batch_size = 10,
                                              shuffle = True)
features = [tf.feature_column.numeric_column(('num_rooms_per_households')),
            tf.feature_column.numeric_column('housing_median_age')]
outdir = './housing_trained'
shutil.rmtree(outdir, ignore_errors=True)
# add learning rate para
my_optimaization = tf.train.FtrlOptimizer(learning_rate=0.001)
model = tf.estimator.LinearRegressor(model_dir = outdir,
                                   feature_columns = features,
                                   optimizer = my_optimaization)
# update batchsize = sample_size[17000]*Epochs[1]/steps
model.train(input_fn=input_fn, steps=100)
print_rmse(model, 'training', input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_service': None, '_task_id': 0, '_model_dir': './housing_trained', '_log_step_count_steps': 100, '_task_type': 'worker', '_is_chief': True, '_global_id_in_cluster': 0, '_keep_checkpoint_max': 5, '_session_config': None, '_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f622dd5b828>, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_steps': None, '_master': '', '_evaluation_master': '', '_tf_random_seed': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_save_checkpoints_secs': 600}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./housing_trained/model.ckpt.
INFO:tensorflow:step = 1, 

This tutorial is based on Google Cloud ML course on Coursera!