##### MA755 Machine Learning - Chapter 9. Up and Running with TensorFlow - 29 Apr 2017 

These notes are based on, and include images from, [_Hands-On Machine Learning with Scikit-Learn and TensorFlow_](http://shop.oreilly.com/product/0636920052289.do)
- by Aurélien Géron
- Published by O'Reilly Media, Inc., 2017

In [None]:
!pip3 install tensorflow

Import the TensorFlow library.

In [1]:
import tensorflow as tf

Create a sample TensorFlow compute graph.

In [2]:
x = tf.Variable(3, name="x")
y = tf.Variable(4, name="y")
f = x*x*y + y + 2

Run this graph in a session:
1. Create the session
1. Initialize `x` and `y`
1. Run the graph / evaluate `f`
1. Close the session

In [3]:
sess = tf.Session()
sess.run(x.initializer)
sess.run(y.initializer)
result = sess.run(f)
sess.close()

print(result)

42


Using the `with` statement produces code that is easier to read and less error prone.

In [4]:
with tf.Session() as sess:
    x.initializer.run()
    y.initializer.run()
    result = f.eval()

result

42

Rather than initializing each variable separately, you can initialize all of them.

In [5]:
init = tf.global_variables_initializer() 

with tf.Session() as sess:
    init.run()  
    result = f.eval()

result

42

Ever node you create (unless you explicitly state otherwise) is added to the default graph. The above nodes have been added to the default graph. The new node (below) 
is also added to the default graph.

In [6]:
x1 = tf.Variable(1)
x1.graph is tf.get_default_graph()

True

You might want to create multiple independent graphs. (I'd rather not.)

Create a new graph.

In [7]:
graph = tf.Graph()

Notice that `x2` is a graph.

In [8]:
with graph.as_default():
    x2 = tf.Variable(2)
x2.graph is graph

True

But it is not the default graph.

In [9]:
x2.graph is tf.get_default_graph()

False

Every variable/node/expression you create is by default added to the default graph. 

To reset/clear/empty the default graph:

In [10]:
tf.reset_default_graph()

> All node values are dropped between graph runs, except variable values, which are maintained by the session across graph runs (queues and readers also maintain some state, as we will see in Chapter 12). A variable starts its life when its initializer is run, and it ends when the session is closed.

### Linear Regression with TensorFlow

In [11]:
import numpy   as np
from sklearn.datasets import fetch_california_housing 

In [12]:
housing =  fetch_california_housing()
m, n = housing.data.shape
(m, n)

(20640, 8)

In [13]:
housing_data_plus_bias = np.c_[np.ones((m, 1)), housing.data]
housing_data_plus_bias.shape

(20640, 9)

Normalize the independent variables, and store in `housing_data_plus_bias_scaled`. 

In [14]:
import sklearn
housing_data_plus_bias_scaled = sklearn.preprocessing.scale(housing_data_plus_bias)
housing_data_plus_bias_scaled.shape

(20640, 9)

Create constants `X` and `y` which are set to the independent and dependent matrices of values.

In [15]:
X = tf.constant(housing_data_plus_bias       , dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")
(x, y)

(<tf.Variable 'x:0' shape=() dtype=int32_ref>,
 <tf.Tensor 'y:0' shape=(20640, 1) dtype=float32>)

Include a computation to calculate the transpose of `X` and store in `XT`. (GO TO BOARD)

In [16]:
XT = tf.transpose(X)
(X.shape, 
 XT.shape
)

(TensorShape([Dimension(20640), Dimension(9)]),
 TensorShape([Dimension(9), Dimension(20640)]))

Short diversion to demonstrate the transpose and inverse of a matrix.

In [18]:
mat = np.array([[1., 2.], [3., 4.]])
print(mat,"mat")
print(mat.transpose(),"mat transpose")
matinv = np.linalg.inv(mat)
print(matinv, "mat inverse")
print(mat.dot(matinv), "dot product of mat and matinv")

[[ 1.  2.]
 [ 3.  4.]] mat
[[ 1.  3.]
 [ 2.  4.]] mat transpose
[[-2.   1. ]
 [ 1.5 -0.5]] mat inverse
[[  1.00000000e+00   0.00000000e+00]
 [  8.88178420e-16   1.00000000e+00]] dot product of mat and matinv


The closed form equation for the vector $\theta$, which minimize the sum of squares of the residuals, is
$$\theta = \left(XT * X\right)^{-1} * XT * y
$$
Add this to the computation graph. 

In [19]:
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT, X)), XT), y)
theta

<tf.Tensor 'MatMul_2:0' shape=(9, 1) dtype=float32>

In [20]:
with tf.Session() as sess:
    theta_value = theta.eval()

theta_value

array([[ -3.74651413e+01],
       [  4.35734153e-01],
       [  9.33829229e-03],
       [ -1.06622010e-01],
       [  6.44106984e-01],
       [ -4.25131839e-06],
       [ -3.77322501e-03],
       [ -4.26648885e-01],
       [ -4.40514028e-01]], dtype=float32)

Note that we have not scaled the `X` matrix. 

### Implementing Gradient Descent

In [21]:
tf.reset_default_graph()

In [22]:
learning_rate = 0.01
import sklearn

### Create a compute graph that implements gradient descent

Create nodes `X` and `y` that are set to the independent variable values and the depend variable values (respecively.)

In [23]:
X = tf.constant(housing_data_plus_bias_scaled, dtype=tf.float32, name="X")
y = tf.constant(housing.target.reshape(-1, 1), dtype=tf.float32, name="y")

Create node `theta` that is initially set to zero, though it might be initialized to random numbers between `-1` and `+1`.

It will contain the coefficients that minimize the mean square error. 

In [24]:
theta  = tf.Variable(tf.zeros([n + 1, 1])                    , name="theta")
#theta  = tf.Variable(tf.random_uniform([n + 1, 1], -1.0, 1.0), name="theta") 

Create node `y_pred` to contain the predictions for $y$, which is a matrix multiplication of the `X` matrix of independent variable values by the coefficients. 

In [25]:
y_pred    = tf.matmul(X, theta, name="predictions")

Calculuate the error and store in node `error`. Calculate the mean square error and store in `mse`.

In [26]:
error     = y_pred - y
mse       = tf.reduce_mean(tf.square(error), name="mse")

Calculate the gradients, which is the vector that points in the direction of steepest increase. 

In [27]:
gradients = 2/m * tf.matmul(tf.transpose(X), error)
#gradients = tf.gradients(mse, [theta])[0]

Calculate a new value for `theta`.

In [28]:
training_op = tf.assign(theta, theta - learning_rate * gradients)
#optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
#training_op = optimizer.minimize(mse)

### The compute graph is complete with respect to gradient descent

Initialize the variables. (The only variable that is initialized is `theta`.)

In [29]:
n_epochs = 500

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)

    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch: ", epoch, 
                  "MSE:   ", mse.eval()
                 )
        sess.run(training_op)

    best_theta = theta.eval()
    
best_theta

Epoch:  0 MSE:    5.61049
Epoch:  100 MSE:    4.91507
Epoch:  200 MSE:    4.87571
Epoch:  300 MSE:    4.85614
Epoch:  400 MSE:    4.84216


array([[ 0.        ],
       [ 0.81662363],
       [ 0.17688054],
       [-0.1273783 ],
       [ 0.14135358],
       [ 0.0166393 ],
       [-0.04392168],
       [-0.48614213],
       [-0.44977275]], dtype=float32)

### Feeding values to the computation graph

Instead of creating a _constant_ node and setting its values when the compute graph is created, 
we create a _placeholder_ node whose values will be set when the graph is run. 

Following is a toy example. First create the nodes and compute graph.

In [30]:
A = tf.placeholder(tf.float32, shape=(None, 3))
B = A + 5

Then evaluate node `B` (which depends on node `A`) after providing values for node `A`.

In [31]:
with tf.Session() as sess:
    B_val_1 = B.eval(feed_dict={A: [[1, 2, 3]]})
    B_val_2 = B.eval(feed_dict={A: [[4, 5, 6], [7, 8, 9]]})

print(B_val_1)
print(B_val_2)

[[ 6.  7.  8.]]
[[  9.  10.  11.]
 [ 12.  13.  14.]]


This technique (of setting values when the graph is run) is used in mini batch gradient descent.

### Mini-batch Gradient Descent

Replace the constant nodes `X` and `y` with placeholder nodes (called `X` and `y`.)

In [32]:
X = tf.placeholder(tf.float32, shape=(None, n + 1), name="X")
y = tf.placeholder(tf.float32, shape=(None,     1), name="y")

The `fetch_batch` function retrieves the `batch_index`-th set of `batch_size` records from the datasets.

In [34]:
def fetch_batch(epoch, batch_index, batch_size):
    start = batch_index * batch_size
    end   = start       + batch_size
    X_batch = housing_data_plus_bias_scaled[start:end,]
    y_batch =                housing.target[start:end,].reshape((-1,1))
    return X_batch, y_batch

Create a node to initialize `theta`.

In [35]:
init = tf.global_variables_initializer()

Define the function `mbgd` to minimize `theta` based on mini batches. 

In [36]:
def mbgd(n_epochs, batch_size):

    n_batches = int(np.ceil(m / batch_size))

    with tf.Session() as sess:
        sess.run(init)

        for epoch in range(n_epochs):
            for batch_index in range(n_batches):
                X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

        return theta.eval()

In [37]:
mbgd(1000, 1000)

array([[ 0.        ],
       [ 0.82963377],
       [ 0.11875467],
       [-0.26555124],
       [ 0.30571574],
       [-0.0045021 ],
       [-0.0393269 ],
       [-0.89984936],
       [-0.87050635]], dtype=float32)

In [None]:
mbgd(1000, 
     housing_data_plus_bias_scaled.shape[0])

### Final Interview Schedule

