## Example 2
## Linear Regression with Batch Gradient Descent

Another way to train the model is the Batch Gradient Descent method. Recall the following facts:
- **Gradient**: the vector composed of all partial derivatives. For example, let $L(x, y, z) = x^2 + xy + yz$, its partial derivatives are

\begin{array}{c}
\frac{\partial L}{\partial x} =& 2x + y\\
\frac{\partial L}{\partial y} =& x + z\\
\frac{\partial L}{\partial z} =& y\\
\end{array}

Therefore if $(x, y, z) = (1, 2, 3)$, its gradient vector is $(2x + y, x + z, y)=(4, 4, 2)$.

- **Gradient Descent**: a method to find a *local minimum* of a function. It is based on the observation that the value of the function decreases fastest if the point goes in the direction of the negative gradient.

![](Data/BGD_1.png)
- **Gradient Descent algorithm**: 

repeat until converge{

    parameter <-- parameter - (learning_rate) * (partial derivative with respect to this parameter)
    
}

- **Batch vs. Stochastic**: Since the cost function for a machine learning task is usually the average of the costs casued by each training instance, its gradient vector will be the average of the gradients of each individual cost. Batch Gradient Descent means that one uses the gradient of the cost function to perform gradient descent. Stochastic Gradient Descent means that one uses the gradient of one randomly chosen instance to perform gradient descnet, hoping that this gradient is *close* to the average gradient. SGD is much faster than BGD, but its performance is less stable.

- **Mini-Batch Stochastic Gradient Descent**: a variation of SGD that splits the training set into small batches that in each iteration, one batch is used to calculate the gradient.

In this example we will first conduct gradient descent using manually computed gradients, then we will use TensorFlow's autodiff reature to let TensorFlow compute the gradients automatically.

In [1]:
# Load California housing data
import numpy as np
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
m, n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]

# Feature scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to C:\Users\Amanda\scikit_learn_data


### 1. Manually computing the gradients
- Cost generated from each instance $(\textbf{x}^{(i)}, y^{(i)})$ is: 

$(\theta\cdot\textbf{x}^{(i)} - y^{(i)})^2 = (\theta_0\cdot 1 + \theta_1x^{(i)}_1 + \theta_2x^{(i)}_2 + \cdots + \theta_nx^{(i)}_n - y^{(i)})^2$.
- Its partial derivative with respect to $\theta_j$ is:

$2x^{(i)}_j(\theta_0\cdot 1 + \theta_1x^{(i)}_1 + \theta_2x^{(i)}_2 + \cdots + \theta_nx^{(i)}_n - y^{(i)})$.
- Average cost is: $\frac{1}{m}\sum_{i=1}^m(y^{(i)} - \theta\cdot\textbf{x}^{(i)})^2$.
- The partial derivative of the average cost with respect to $w_i$ is:

$\frac{1}{m}\sum_{i=1}^m2x^{(i)}_j(\theta_0\cdot 1 + \theta_1x^{(i)}_1 + \theta_2x^{(i)}_2 + \cdots + \theta_nx^{(i)}_n - y^{(i)})$.
- The update rule of gradient descent is: 

$\theta_j = \theta_j - \textit{(learning_rate)}\cdot\textit{partial derivative}$.

The formula is

$\theta_j = \theta_j - \textit{learning_rate}\cdot\frac{1}{m}\sum_{i=1}^m2x^{(i)}_j(\theta_0\cdot 1 + \theta_1x^{(i)}_1 + \theta_2x^{(i)}_2 + \cdots + \theta_nx^{(i)}_n - y^{(i)})$.

In [3]:
import tensorflow as tf

# Reset dataflow graph
tf.reset_default_graph()

n_epochs = 1000  # each training instance will be used 1000 times during training phase
learning_rate = 0.01

# Training data
X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

# Define theta variables with random initialization
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")

# Use y_pred to store the prediction theta*x
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y

# Use tf.square() and tf.reduce_mean() to construct the average cost
mse = tf.reduce_mean(tf.square(error), name="mse")

# Calculate gradients using formula
gradients = 2/m * tf.matmul(tf.transpose(X), error)

# Use tf.assign() to define the update rule
training_op = tf.assign(theta, theta - learning_rate * gradients)

# initializer
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    
    best_theta = theta.eval()
    print('best_theta:')
    print(best_theta)


ModuleNotFoundError: No module named 'tensorflow'

### 2. Using autodiff

Use tf.gradients(ys, xs) to ask TensorFlow automatically compute the derivatives of sum of ys with respect to xs. 

In [None]:
tf.reset_default_graph()

n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")

gradients = tf.gradients(mse, [theta])[0]

training_op = tf.assign(theta, theta - learning_rate * gradients)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    
    best_theta = theta.eval()

print("Best theta:")
print(best_theta)

### 3. Using a TensorFlow Optimizer
TensorFlow also provides a number of optimizers out of the box. 

In [None]:
# Use tf.train.GradientDescentOptimizer() and 
# its minimize() method to perform Gradient Descent

## Use TensorBoard to Visualize Dataflow Graph
So far we are relying on the print() function to visualize progress during training. There is a better way: use TensorBoard. If you feed it some training stats, TensorBoard will display nice interactive visualizations of these stats in a web browser. This is very useful to identify errors in the graph, to find bottlenecks, and so on.

In [3]:
# Let's visualize a simple graph on TensorBoard
tf.reset_default_graph()

a = tf.constant(10, name='a')
b = tf.constant(20, name='b')
c = a * b + b + 2

# location of log directory
logdir = './test_log'

# add a summary node
c_summary = tf.summary.scalar('c_summary', c)

with tf.Session() as sess:
    # create a writer object
    writer = tf.summary.FileWriter(logdir, sess.graph)
#     result = sess.run(c)
#     summary_value = c_summary.eval()
#     writer.add_summary(summary_value)
#     print('outcome: ', result)

    for k in range(10):
        result = sess.run(c)
        summary_value = c_summary.eval()
        writer.add_summary(summary_value, k)
        print('outcome: ', result)    
 
# Close FileWriter
writer.close()

outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222
outcome:  222


- Create **summary nodes** to store values that you want to visualize.
- Create a directory to save the log files.
- Create a **FileWriter** and use it to save the summary values.
- Close FileWriter at the end of the program.
Then open command line (Anaconda prompt is recommended) and execute the following command:

    **tensorboard --logdir (enter directory here)**
- Graph tab will visualize the dataflow graph
- Scalar tab will uisualize the summary stats that you saved.

In [None]:
# Use TensorBoard to visualize gradient descent.
tf.reset_default_graph()

# Use date and time to name the log directory
from datetime import datetime
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir = "tf_logs"
logdir = "{}/run-{}/".format(root_logdir, now)

In [None]:
n_epochs = 1000
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None, 1), name="y")
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name="mse")
gradients = tf.gradients(mse, [theta])[0]

training_op = tf.assign(theta, theta - learning_rate * gradients)

# optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
# training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()

In [None]:
mse_summary = tf.summary.scalar('MSE', mse)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

In [None]:
X_data = scaled_housing_data_plus_bias
y_data = housing.target.reshape(-1, 1)
with tf.Session() as sess:                                                        
    sess.run(init)                                                               

    for epoch in range(n_epochs):                      
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval(feed_dict={X:X_data, y:y_data}))
            summary_str = mse_summary.eval(feed_dict={X:X_data, y:y_data})
            file_writer.add_summary(summary_str, epoch)
        sess.run(training_op, feed_dict={X:X_data, y:y_data})
    
    best_theta = theta.eval()
    
print('best theta:', best_theta)
file_writer.close()                                                   

## Saving and Restoring Models

To save a model:
- Create a **Saver** node after all variable nodes are created.
- In the execution phase, call its save() method to save the model

To restore a model:
- Create a Saver node at the end of the construction phase.
- At the beginning of the execution phase, instead of initializing the variables using the init node, call the restore() method of the Saver object.

In [None]:
tf.reset_default_graph()

n_epochs = 1000                                                                       # not shown in the book
learning_rate = 0.01                                                                  # not shown

X = tf.constant(scaled_housing_data_plus_bias, dtype=tf.float32, name="X")            # not shown
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")            # not shown
theta = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0, seed=42), name="theta")
y_pred = tf.matmul(X, theta, name="predictions")                                      # not shown
error = y_pred - y                                                                    # not shown
mse = tf.reduce_mean(tf.square(error), name="mse")                                    # not shown
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)            # not shown
training_op = optimizer.minimize(mse)                                                 # not shown

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())                                # not shown
            save_path = saver.save(sess, "/tmp/my_model.ckpt")
        sess.run(training_op)
    
    best_theta = theta.eval()
    save_path = saver.save(sess, "/tmp/my_model_final.ckpt")

In [None]:
with tf.Session() as sess:
    saver.restore(sess, "/tmp/my_model_final.ckpt")
    best_theta_restored = theta.eval() 
    print('bets theta:')
    print(best_theta_restored)

# Homework (optional)
Perform Mini-Batch SGD on California housing data, set the size of mini batch to be 100. Visualize the learning curve on TensorBoard.