# Neural networks for regression via Tensorflow

### Agenda:
1. What is nonlinear regression?


2. Learn how to do linear and nonlinear regression using Tensorflow. 

    a. How do you set up the computation graph? 
    
    b. How do you train and test the model?
   
   
2. Explain the role of activation functions.

    a. No activation function means you are doing linear regression.
    
    b. What activation function would you choose?

### What is nonlinear regression?

Given $X$ and $Y$ data, find a function $F$ that best explains the relationship between $X$ and $Y$, i.e., find $F$ such that $Y \approx F(X)$. Regression is the act of finding a suitable $F$. When $F$ is restricted to be linear, you are performing linear regression. It is the multivariate form of your familiar question: "fit the best line to given data". When $F$ can be nonlinear, finding that function $F$ is called nonlinear regression.

Given: $({X}_1, {Y}_1), \ldots, ({X}_n, {Y}_n)$. 

Objective: Find $F$ that minimizes the error between $Y$ and $F(X)$ on the data, i.e., solve
$$ \underset{F}{\text{minimize}}\ \ \frac{1}{n}\sum_{i=1}^n \| {Y}_i - F({X}_i) \|^2. $$

 
### What is a neural network and how can it do nonlinear regression?
A neural network defines a sequence of operations on the input that produces an output. In the above diagram, your data $X$ goes through the neural network to produce the output $Y$. What does each layer represent? With the output of the last layer, it performs an affine transformation and then passes it through an **activation function** to compute the output of the current layer.

<img src="neuralNet.png"  width="700">

In essence, a neural network **parameterizes** the function $F$ in terms of the weights and biases in the neural network. In the above figure, the parameters are given by
$$\theta := ( W^1, b^1, W^2, b^2 ).$$
Call this parametric representation $F_\theta$. Then, the nonlinear regression via a neural network seeks $\theta$ that solves
$$ \underset{\theta}{\text{minimize}}\ \ J(\theta) := \frac{1}{n}\sum_{i=1}^n \| {Y}_i - F_\theta({X}_i) \|^2. $$


### How does a neural network optimize over $\theta$?

There are a variety of optimizer routines that one can use, the simplest among them is **stochastic gradient descent**. You sample one among your $n$ data points and sequentially update the parameters in $\theta$ as follows. 
$$ \theta_{k+1} := \theta_{k} - \alpha_k \nabla_\theta \| {Y}_i - F_\theta({X}_i) \|^2, $$
starting from a possibly random initial parameter vector $\theta_0$. Here, $\alpha$'s define a sequence of stepsizes, and $\nabla J$ stands for the gradient of $J$.

1. Can you use one data point at a time to perform an update? If so, how should you sample the data?
2. Can you update $\theta$ using a batch of gradients computed on a batch of data? 
3. Can you have different step-sizes for different parameters within $\theta$?
4. How fast does it converge?

Machine learning research has focussed on each of these questions both using theoretical analysis and extensive simulation studies.


### In the parametric description of $F$, what is an activation function?

Without activation function, the output of the neural network is a linear function of $X$. To understand that statement, assume in the above network that activation function is identity, and therefore, we have $Z = X W^1 + b^1$, and 
$$ Y = Z W^2 + b^2 = ( X W^1 + b^1 ) W^2 + b^2 = X (W^1 W^2) + b^1 W^2 + b^2, $$
a linear function of $X$. Utilizing the same logic, you can show that no matter how many layers you have or how many neurons you have in these layers, the output of a neural network is linear in $X$ without an activation function. You need a nonlinear activation function to learn complicated nonlinear functions.

## Example: Nonlinear regression via Tensorflow.

Construct $n=1000$ data points $({X}_1, {Y}_1), \ldots ({X}_n, {Y}_n)$, where ${Y}_i = F({X}_i)$. Obtain the ${X}$'s by sampling uniformly between -5 and 5, n times.

Start with the customary imports.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 

### Prepare the training data.

In [None]:
# Prepare training data in Xbar and Ybar.

n = 1000
trainX = np.random.uniform(-5, 5, size=[n, 1])
F = lambda x: np.sin(x) + 2 * np.cos(2 * x)
trainY = np.array([F(x) for x in trainX])

# Draw a scatter plot of the sampled points.
plt.figure(1)
plt.scatter(trainX, trainY)
plt.show()

### Define the test data.

Create a data set of $m=200$ points from $[-5, 5]$ on which you will test the accuracy of the regressor. Always keep the training and testing data different. If you don't and you seek to minimize training error, you may overfit your regressor to your training samples. Overfitting means that you fit the regressor so well on the training set, that it starts to perform way worse on samples it has never seen before. That is, your learnt function does not accurately capture the true relationship between $X$ and $Y$. 

In [None]:
m = 200
testX = np.random.uniform(-5, 5, size=[m, 1])
testY = np.array([F(x) for x in testX])

plt.figure(1)
plt.scatter(testX, testY)
plt.show()

### Use Tensorflow to learn the function. 

Tensorflow allows you to construct a neural network as a computation graph. When you define the computation graph, it does not compute anything, but rather defines the sequence of operations that you will perform on the data. The data you will supply should be defined as placeholders, and the parameters defining the neural network should be defined as variables.

In [None]:
# Define the structure of the neural network as a computation graph in Tensorflow.
X = tf.placeholder(dtype=tf.float32, shape=[None, 1])

nHidden = 15

W1 = tf.Variable(tf.truncated_normal(shape=[1, nHidden]))
b1 = tf.Variable(tf.truncated_normal(shape=[nHidden]))
Z1 = tf.nn.relu(tf.matmul(X, W1) + b1)

W2 = tf.Variable(tf.truncated_normal(shape=[nHidden, 1]))
b2 = tf.Variable(tf.truncated_normal(shape=[1]))
Yhat = tf.matmul(Z1, W2) + b2

Few comments about the above code:

1. $X$ will only be defined at runtime, and is left unspecified, except possibly its size. Notice that the first among the shape parameters is 'None'. This indicates to Tensorflow that we may pass multiple $X$'s to train or test the neural network. The second shape parameter being 1, our input data is scalar.

2. How many hidden layers does this neural network have? There is one hidden layer with 15 neurons.

3. What constitutes the parameters $\theta$ for this neural network that we aim to optimize over? Weights W1, W2 and biases b1, b2.

4. The activation function used here is 'relu'. We will learn more about activation functions.

### Define the optimization routine for training the neural network.

We aim to find $\theta$ that minimizes the discrepancy between 'trainY' and $F_\theta$ applied on 'trainX'. Define the target $Y$ as a placeholder and define an optimizer to minimize the mean squared distance between the neural network predictions $\hat{Y}$ and its target $Y$. Similar to $X$, the placeholder $Y$ can have variable shape, depending upon how many data points are used to train or test the neural network. Therefore, define the first argument in its shape parameter as 'None', and the second arguemnt as one. 

**Question.** Why is the second argument in the shape of $Y$ equal to 1?

In [None]:
Y = tf.placeholder(dtype=tf.float32, shape=[None, 1])

# Define a loss function.
loss = tf.losses.mean_squared_error(labels=Y, predictions=Yhat)

# Define the optimizer.
optimizer = tf.train.AdamOptimizer(learning_rate=0.35).minimize(loss)

**Question.** Can you relate 'learning_rate' to our iterative description of how $\theta$'s are updated?

### Define a metric to judge the accuracy of prediction via neural network.

To evaluate the performance of your regressor, you need to define an accuracy metric. A commonly used metric for nonlinear regression is "root mean squared error" (RMSE). On any test data, if the neural network produces $\hat{Y}_1, \ldots, \hat{Y}_m$, and the true values are $Y_1, \ldots, Y_m$, then the RMSE of the prediction is given by
$$ \text{RMSE} = \sqrt{\sum_{i=1}^m \frac{(Y_i - \hat{Y}_i)^2}{N}}.$$


In [None]:
_, accuracy = tf.metrics.root_mean_squared_error(labels=Y, predictions=Yhat)

**Question.** Why don't we use the function 'accuracy' as the 'loss' function and seek to minimize the RMSE?

### Train the classifier in a Tensorflow session.

Having defined the computation graph, we now train the neural network. In a session, you initialize an instance of the computation graph, and pass the data ${X}, {Y}$ to it multiple times (called 'epochs'), and run the optimizer function. 

In [None]:
sess = tf.Session()
with sess.as_default():

    # Initialize the computation graph. 
    sess.run(tf.global_variables_initializer())
    sess.run(tf.local_variables_initializer())
    
    print("Started the training module.")

    # Define the number of epochs.
    nEpochs = 4000
    
    for epoch in range(nEpochs):

        lossEpoch = 0
        # In each epoch, use 'optimizer' to reduce the 'loss' over the entire data. Make sure to pass
        # the appropriate data to the placeholders.
        _, lossEpoch = sess.run([optimizer, loss], feed_dict={X: trainX, Y: trainY})
         
        print("Epoch: %d, Loss: = %1.1f" % (epoch + 1, lossEpoch))

    print("End of training process...")

If you don't think your neural network has converged, play with the learning rate, number of epochs, activation functions, etc.

#### Test the accuracy on test data and visualize it.

In [None]:
    # You are still inside the session. Notice the indentation!
    
    # Output the accuracy of the regressor on the test data.
    predictedY, accuracyOfPrediction = sess.run([Yhat, accuracy], feed_dict={X: testX, Y: testY})
    print("RMSE of regressor on test data = %1.2f" % accuracyOfPrediction)
    
    plt.figure(2)
    plt.scatter(testX, testY)
    plt.scatter(testX, np.array(predictedY))
    plt.show()

### Exercises: Check the result after making each of these changes. Make a note of your observations. Change back to the original setting for the next exercise.

1. Change the number of neurons in the hidden layer to 3.
2. Change back the number of hidden neurons to 15 and then remove the activation function in the hidden layer.
3. Keep the activation in the hidden layer as 'relu', and additionally use a relu activation on the output layer.
4. Remove activation in the output layer, but use a 'tanh' activation in the hidden layer.
5. Vary the learning rate and check the quality of the training.