# Neural networks

At this point in the semester, you have learned about neural networks and how they can be used and to find nonlinear decision boundaries for classification, and how they can be used for regression. In completing this notebook you will become more familiar with some important concepts in machine learning:

- __Backpropagation algorithm__
    - Implement the forward and backward pass for the neural network
- __Batch gradient descent__
    - Choose the correct step size for gradient descent
    - Understand the role of step size in convergence of the backprop learning algorithm
- __Decision boundary geometry__
    - Understand how multi-layer perceptrons create nonlinear decision boundaries
- __Regularization and generalization__
    - Understand the connection between parameter count and dataset size
    - Regularize a network so that it can generalize well
- __Applications__
    - Train your own neural network on digits data, or a dataset of your choice
    - Evaluate the network

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import util
import runClassifier
import datasets
import matplotlib.pyplot as plt
import sklearn
from sklearn.datasets import make_moons

import neuralnet

## Noisy, linearly inseparable data

For the first part of this project, we will use this dataset. The dataset consists of points randomly sampled from two interleaving half-circles. As you can see, there is significant noise in the data, and the classes are not lienarly separable.

In [None]:
# Generate a dataset and plot it
np.random.seed(0)
X, Y = make_moons(200, noise=0.20)
plt.scatter(X[:,0], X[:,1], s=40, c=Y, cmap=plt.cm.Spectral)

## Forward pass

We have provided a partial implemention of a neural network for binary classification. In this project we will work in binary classification, using three-layer networks. Our network will taken in a number of inputs as we define, and the network will produce an output in 2 dimensions. One output is the probability of the positive class, and the other output is the probability of the negative class.

__TASK 1:__ Given an input matrix `X` containing our input features, write code to execute the forward pass for each input and return the neural network's outputs for the probability of each class. Your code will be in the method `forward_prop`.

Now we will test out your `forward_prop` implementation to make sure it works. You should see `array([[0.56129696, 0.43870304]])` after running this cell.

In [None]:
nn = neuralnet.ThreeLayerNet()
nn.init_weights_task1()
nn.forward_prop(X[0])

## Backpropagation

Now our network can produce an output given an input. Then we can check the predicted output with the target output, and calculate the loss. The error will be back-propagated through the network, so that we can update our weights by batch gradient descent, where the function we are minimizing is the loss function of our network for a given dataset.

__TASK 2:__ Now that you know how to compute forward propagation, your next task is to perform backprop. Compute the chage in the weights and biases in the `train` method. You should define `dW1`,`db1`,`dW2`,`db2` so that our network correctly backpropagates and performs an update of the parameters. Check your implementation below. Your loss after 19000 iterations should be around 0.07.

In [None]:
nn = neuralnet.ThreeLayerNet()
iterations,losses = nn.train(X,Y,hdim=3,print_loss=True)

So now we can minimize the loss on a dataset of our choice. Let's see how our neural network separated the classes by visualizing the decision boundary.

In [None]:
nn.plot_decision_boundary(X,Y)

Our neural network was able to find a decision boundary that successfully separates the classes.

## Learning rate and convergence

Experiment with a few different values for the learning rate `eta`. The cell makes a simple plot of loss vs the number of iterations.

__TASK 3:__ Set `eta` to a (nonzero) value for which the neural network does not converge.

In [None]:
nn = neuralnet.ThreeLayerNet()
iterations,losses = nn.train(X,Y,eta=0, output_dim=2, hdim=2,print_loss=False)
plt.plot(iterations,losses)

__TASK 4:__ Set `eta` to a (nonzero) value for which the neural network converges too slowly.

In [None]:
nn = neuralnet.ThreeLayerNet()
iterations,losses = nn.train(X,Y,eta=0, output_dim=2, hdim=2,print_loss=False)
plt.plot(iterations,losses)

__TASK 5:__ Set `eta` to a (nonzero) value for which the neural network converges at an appropriate rate.

In [None]:
nn = neuralnet.ThreeLayerNet()
iterations,losses = nn.train(X,Y,eta=0, output_dim=2, hdim=2,print_loss=False)
plt.plot(iterations,losses)

__QUESTION 1:__ For each of the 3 curves, describe how exactly `eta` has affected the shape of the curve, according to gradient descent.

__ANSWER 1:__ 

## Decision boundary geometry

Let's try to classify our dataset with fewer hidden nodes, and see whether we needed 3 units in our hidden layers.

In [None]:
nn = neuralnet.ThreeLayerNet()
iterations,losses = nn.train(X,Y,hdim=2,print_loss=True)
nn.plot_decision_boundary(X,Y)

__QUESTION 2:__ You should find that this network does not split the data. Our inputs were in 2 dimensions, and our hidden units were also in 2 dimensions. Explain the geometry of the decision boundary, why it has this shape, and why the algorithm does not find a decision boundary that separates the data.

__ANSWER 2:__ 

## Number of hidden units

Let's now get a sense of how varying the size of our hidden layers affects the result.

In [None]:
nn = neuralnet.ThreeLayerNet()

plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50]
for i, hdim in enumerate(hidden_layer_dimensions):
    plt.subplot(5, 2, i+1)
    plt.title('Hidden Layer size %d' % hdim)
    iterations,losses = nn.train(X,Y,hdim=hdim)
    nn.plot_decision_boundary(X,Y)
plt.show()

__QUESTION 3:__ As we increase the number of hidden units, what happens to our ability to generalize? Keep in mind the data-generating distribution that describes our data, from the invocation of `make_moons` above.

__ANSWER 3:__ 

## Applications

__TASK 6:__ In the cell below, write a routine for training a neural network on `DigitData`. The goal is to train a network that can generalize well. Your network may be any architecture. Plot the loss curve.

- If you like, you may choose a different problem, with a dataset of your choice.
- You may use any technique to optimize your network.
- We don't expect anything fancy, but if you do something impressive, you may earn extra credit.

In [None]:
data = datasets.DigitData
X,Y = data.X,data.Y
Y = np.array([0 if y == -1 else 1 for y in Y])

# your code here

__QUESTION 4:__ Describe your network. How does it perform?

- You can earn some extra credit if you include an error analysis. This of course requires that the problem you choose is difficult for your network to learn.

__ANSWER 4:__ 