# Linear Regression with TensorFlow

TensorFlow operations (also called **ops** for short) can take any number of inputs and produce any number of outputs. For example, the addition and multiplication ops each take two inputs and produce one output. 

For example, the addition and multiplication ops each take two inputs and produce one output. Constants and variables take no input (they are called **source ops**). 

The inputs and outputs are **multidimensional arrays**, called `tensors` (hence the name **“tensor flow”**).

Just like NumPy arrays, tensors have a type and a shape. In fact, in the Python API tensors are simply represented by NumPy ndarrays. They typically contain floats, but you can also use them to carry strings (arbitrary byte arrays).

For example, the following code manipulates 2D arrays to perform Linear Regression on the California housing data‐ set

<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>

It starts by fetching the dataset; then it adds an extra bias input feature (`x0 = 1`) to all training instances (it does so using NumPy so it runs immediately); 

In [None]:
import tensorflow as tf
import numpy as np
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
m,n = housing.data.shape
housing_data_plus_bias = np.c_[np.ones((m,1)), housing.data]

Then it creates two TensorFlow constant nodes, `X` and `y`, to hold this data and the targets, and it uses some of the matrix operations provided by TensorFlow to define theta. 

These matrix functions—`transpose()`, `matmul()`, and `matrix_inverse()`— are self-explanatory, but as usual they do not perform any computations immediately; instead, they create nodes in the graph that will perform them when the graph is run. 

You may recognize that the definition of `theta` corresponds to the Normal Equation
$$\theta = (X^{T} X)^{-1}X^{T}y$$

In [None]:
X = tf.constant(housing_data_plus_bias, dtype = tf.float32, name = 'X')
y = tf.constant(housing.target.reshape(-1,1), dtype = tf.float32, name = 'y')
XT = tf.transpose(X)
theta = tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT,X)),XT), y)  
y_pred = tf.matmul(X, theta)

**Note** that housing.target is a 1D array, but we need to reshape it to a column vector to compute theta. 
Recall that NumPy’s reshape() function accepts –1 (meaning “unspecified”) for one of the dimensions: that dimension will be computed based on the array’s length and the remaining dimensions.

In [None]:
print(housing.target.shape)
print(housing.target.reshape(-1,1).shape)

Finally, the code creates a session and uses it to evaluate theta.

In [None]:
with tf.Session() as sess:
    theta_value = theta.eval()
    y_pred_value = y_pred.eval()
    
print(theta_value)

In [None]:
import matplotlib.pyplot as plt
# By running this special iPython command, we will be displaying plots inline:
%matplotlib inline 

id_sample = range(100);
plt.figure(figsize=(16, 4))
plt.plot(housing.target[id_sample], 'b', label = 'housing.target')
plt.plot(y_pred_value[id_sample], 'r', label = 'y_pred_values')
plt.ylabel('House price')
plt.xlabel('Samples')
plt.legend()
plt.grid(True)
plt.show()

The main benefit of this code versus computing the Normal Equation directly using NumPy is that TensorFlow will automatically run this on your GPU card if you have one.

We could also use other methods:

In [None]:
# Using Numpy
X = housing_data_plus_bias
y = housing.target.reshape(-1, 1)
theta_numpy = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)

print(theta_numpy)

In [None]:
# Using Sklearn
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing.data, housing.target.reshape(-1, 1))

print(np.r_[lin_reg.intercept_.reshape(-1, 1), lin_reg.coef_.T])

# Implementing Gradient Descent

Let’s try using `Batch Gradient Descent` instead of the `Normal Equation`. 
First we will do this by manually computing the gradients, then we will use `TensorFlow’s autodiff` feature to let TensorFlow compute the gradients automatically, and finally we will use a couple of TensorFlow’s out-of-the-box optimizers.

**Note**: When using Gradient Descent, remember that it is important to first normalize the input feature vectors, or else training may be much slower. You can do this using TensorFlow, NumPy, Scikit-Learn’s StandardScaler, or any other solution you prefer. The following code assumes that this normalization has already been done.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_housing_data = scaler.fit_transform(housing.data)
scaled_housing_data_plus_bias = np.c_[np.ones((m, 1)), scaled_housing_data]

print(scaled_housing_data_plus_bias.mean(axis=0))
print(scaled_housing_data_plus_bias.mean(axis=1))
print(scaled_housing_data_plus_bias.mean())
print(scaled_housing_data_plus_bias.shape)

## Manually Computing the Gradients

The following code should be fairly self-explanatory, except for a few new elements:
- The `random_uniform()` function creates a node in the graph that will generate a tensor containing random values, given its shape and value range, much like NumPy’s rand() function.
- The `assign()` function creates a node that will assign a new value to a variable. In this case, it implements the Batch Gradient Descent step $\theta^{next step} = \theta –  \bigtriangledown_{\theta}MSE(\theta)$.
- The main loop executes the training step over and over again (n_epochs times), and every 100 iterations it prints out the current Mean Squared Error (mse). 
You should see the MSE go down at every iteration.

In [None]:
# reset the tensorflow graph
tf.reset_default_graph()

n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype = tf.float32, name = 'X')
y = tf.constant(housing.target.reshape(-1,1), dtype = tf.float32, name = 'y')
theta  =  tf.Variable(tf.random_uniform([n+1, 1], -1.0, 1.0), name = 'theta')
y_pred = tf.matmul(X, theta, name = 'predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name = 'mse')

gradients = 2.0/m*tf.matmul(tf.transpose(X), error)
training_op = tf.assign(theta, theta - learning_rate*gradients)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)        
    y_pred_value = y_pred.eval()    
    best_theta = theta.eval()

print("best_theta:")    
print(best_theta)

# something went wrong here, I need to check why the mse is 'non'

In [None]:
id_sample = range(100);
plt.figure(figsize=(16, 4))
plt.plot(housing.target[id_sample], 'b', label = 'housing.target')
plt.plot(y_pred_value[id_sample], 'r', label = 'y_pred_values')
plt.ylabel('House price')
plt.xlabel('Samples')
plt.legend()
plt.grid(True)
plt.show()

## Using autodiff

The preceding code works fine, but it requires mathematically deriving the gradients from the cost function (MSE). In the case of Linear Regression, it is reasonably easy, but if you had to do this with deep neural networks you would get quite a headache: it would be tedious and error-prone.

In [None]:
# reset the tensorflow graph
tf.reset_default_graph()

n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype = tf.float32, name = 'X')
y = tf.constant(housing.target.reshape(-1,1), dtype = tf.float32, name = 'y')
theta  =  tf.Variable(tf.random_uniform([n+1, 1], -1.0, 1.0), name = 'theta')
y_pred = tf.matmul(X, theta, name = 'predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name = 'mse')

gradients = tf.gradients(mse, [theta])[0]
training_op = tf.assign(theta, theta - learning_rate * gradients)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 200 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    y_pred_value = y_pred.eval()
    best_theta = theta.eval()
print("best_theta:")     
print(best_theta)

In [None]:
id_sample = range(100);
plt.figure(figsize=(16, 4))
plt.plot(housing.target[id_sample], 'b', label = 'housing.target')
plt.plot(y_pred_value[id_sample], 'r', label = 'y_pred_values')
plt.ylabel('House price')
plt.xlabel('Samples')
plt.legend()
plt.grid(True)
plt.show()

# Using an Optimizer
So TensorFlow computes the gradients for you. But it gets even easier: it also provides a number of optimizers out of the box, including a Gradient Descent optimizer.

In [None]:
# reset the tensorflow graph
tf.reset_default_graph()

n_epochs = 1000
learning_rate = 0.01

X = tf.constant(scaled_housing_data_plus_bias, dtype = tf.float32, name = 'X')
y = tf.constant(housing.target.reshape(-1,1), dtype = tf.float32, name = 'y')
theta  =  tf.Variable(tf.random_uniform([n+1, 1], -1.0, 1.0), name = 'theta')
y_pred = tf.matmul(X, theta, name = 'predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name = 'mse')

optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate)
#optimizer = tf.train.MomentumOptimizer(learning_rate = learning_rate, momentum = 0.9)

training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 200 == 0:
            print("Epoch", epoch, "MSE =", mse.eval())
        sess.run(training_op)
    y_pred_value = y_pred.eval()
    best_theta = theta.eval()
print("best_theta:")   
print(best_theta)

In [None]:
id_sample = range(100);
plt.figure(figsize=(16, 4))
plt.plot(housing.target[id_sample], 'b', label = 'housing.target')
plt.plot(y_pred_value[id_sample], 'r', label = 'y_pred_values')
plt.ylabel('House price')
plt.xlabel('Samples')
plt.legend()
plt.grid(True)
plt.show()

# Feeding Data to the Training Algorithm

Let’s try to modify the previous code to implement `Mini-batch Gradient Descent`. For this, we need a way to replace `X` and `y` at every iteration with the next mini-batch. The simplest way to do this is to use placeholder nodes. 

These nodes are special because they don’t actually perform any computation, they just output the data you tell them to output at runtime. They are typically used to pass the training data to TensorFlow during training. If you don’t specify a value at runtime for a placeholder, you get an exception.

In [None]:
tf.reset_default_graph()

To create a placeholder node, you must call the `placeholder()` function and specify the output tensor’s data type. Optionally, you can also specify its shape, if you want to enforce it. If you specify None for a dimension, it means “any size.” For example, the following code creates a placeholder node A, and also a node B = A + 5.

In [None]:
A = tf.placeholder(tf.float32, shape = (None, 3))
B = A + 5
with tf.Session() as sess:
    B_val_1 = B.eval(feed_dict = {A: [[1,2,3]]})
    B_val_2 = B.eval(feed_dict = {A: [[1,2,3], [11,12,13]]})
print(B_val_1)
print(B_val_2)

When we evaluate B, we pass a `feed_dict` to the `eval()` method that specifies the value of A. Note that A must have rank 2 (i.e., it must be two-dimensional) and there must be `three columns` (or else an exception is raised), but it can have any number of rows.
```python
with tf.Session() as sess:
    B_val_3 = B.eval(feed_dict = {A: [[1,2]]}) # this is wrong, feed must be three columns
print(B_val_3)
```
<span style="color:red"> ValueError: Cannot feed value of shape (1, 2) for Tensor 'Placeholder_1:0', which has shape '(?, 3)'</span>


To implement Mini-batch Gradient Descent, we only need to tweak the existing code slightly. First change the definition of X and y in the construction phase to make them placeholder nodes:

In [None]:
tf.reset_default_graph() # reset the graph

n_epochs = 1000
learning_rate = 0.01
batch_size = 5000
n_batches = int(np.ceil(m/batch_size))

X = tf.placeholder(tf.float32, shape=(None, n+1), name = 'X') # <-- change
y = tf.placeholder(tf.float32, shape=(None,1), name = 'y')  # <-- change

theta  =  tf.Variable(tf.random_uniform([n+1, 1], -1.0, 1.0), name = 'theta')
y_pred = tf.matmul(X, theta, name = 'predictions')
error = y_pred - y
mse = tf.reduce_mean(tf.square(error), name = 'mse')

optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate)
#optimizer = tf.train.MomentumOptimizer(learning_rate = learning_rate, momentum = 0.9)
training_op = optimizer.minimize(mse)

init = tf.global_variables_initializer()


def fetch_batch(epoch, batch_index, batch_size):
    np.random.seed(epoch * n_batches + batch_index)  
    indices = np.random.randint(m, size=batch_size)  
    X_batch = scaled_housing_data_plus_bias[indices] 
    y_batch = housing.target.reshape(-1, 1)[indices] 
    return X_batch, y_batch

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print("Epoch", epoch, "MSE =", mse.eval(feed_dict = {X:scaled_housing_data_plus_bias, y:housing.target.reshape(-1, 1)}))        
        for batch_index in range(n_batches):
            X_batch, y_batch = fetch_batch(epoch, batch_index, batch_size)    
            sess.run(training_op, feed_dict = {X:X_batch, y:y_batch})            
    y_pred_value = y_pred.eval(feed_dict = {X:scaled_housing_data_plus_bias, y:housing.target.reshape(-1, 1)})
    best_theta = theta.eval()

In [None]:
print("best_theta:")   
print(best_theta)

**-END-**

