# Discussion 06

### Neural Networks in Python

Welcome to Discussion 06. In this discussion, we'll see how to train neural networks in Python using the `keras` package. To run this discussion notebook, you will need to `pip install tensorflow`. Make sure that you have Tensorflow version 2 or greater.

There are multiple deep learning packages in Python. You may have heard of some of them: Keras, TensorFlow, Theano, Caffe, PyTorch, etc. In this discussion, we'll get started with TensorFlow and Keras.

<img src="./tf-keras.jpg" width=50%>

Keras is a high-level deep learning framework for Python developed by François Chollet (who happens to be an engineer at Google). Keras requires a low-level "backend" to handle to heavy lifting. Supported backends are TensorFlow, Theano, and CNTK.

TensorFlow is a low-level machine learning library published by the Google Brain project. It is the most popular choice of backend for Keras. Recently, the TensorFlow team officially adopted Keras as its high-level API. Likewise, all future Keras development will be for the TensorFlow backend only. You can therefore view them as part of the same project (even though they have different histories).

As long as you have version 2 or greater of TensorFlow installed, you can import the Keras API as follows:

In [None]:
from tensorflow import keras

## A First NN

### Creating the Model

As a start, let's try to train a neural network to solve the `xor` problem described in lecture. In this problem, we have four training examples: (0, 0) with label 0, (0, 1) with label 1, (1, 0) with label 1, and (1, 1) with label 0.

In [None]:
import numpy as np

X = np.array([
    [1, 0],
    [0, 1],
    [1, 1],
    [0, 0]
])

y = np.array([1, 1, 0, 0])

No linear prediction function is able to learn this without a non-linear feature transformation. But a neural network with a hidden layer will be able to learn a suitable feature transformation and be able to classify all four points correctly. The network architecture discussed in class is shown below:

<img src="./nn.png" width=50%>

Let's see how to build this with Keras. The first step is to create an *input* layer.

In [None]:
inputs = keras.Input(shape=2)

This tells Keras that there will be two numbers as input. Next, we create a hidden layer that takes the input layer as input. We do this with Keras as follows:

In [None]:
hidden_layer = keras.layers.Dense(2, activation='relu')(inputs)

Notice that we've chosen the ReLU as the activation function. By using a "Dense" layer, all four edges from the inputs to the hidden nodes have been created, as have two edges for the biases $b_1^{(1)}$ and $b^{(1)}_2$. Also note that we're using Keras's "functional" API by treating the `Dense` layer object as a function and calling it with the inputs created above.

Next we create the outputs. This, too, is a dense layer, but we'll use a linear activation function for now:

In [None]:
outputs = keras.layers.Dense(1, activation='linear')(hidden_layer)

Lastly, we need to put everything together into a `Model`:

In [None]:
model = keras.Model(inputs=inputs, outputs=outputs)

Note that the `inputs` are connected to the `outputs` through the `hidden_layer`.

We can get a summary of the model as follows:

In [None]:
model.summary()

Notice that this says there are 9 parameters. Check to make sure this matches what we expect.

### Training the Model

Now we need to train the model. This requires choosing a loss function and an optimization algorithm. In practice, gradient descent itself is rarely used. Instead, variations on gradient descent are employed. For example, a popular optimization algorithm is RMSprop -- it is famous for being documented nowhere but used almost everywhere. Like gradient descent, RMSprop requires a learning rate.

We also need to choose a loss function. For now, let's use the square loss.

We implement these choices by compiling the model:

In [None]:
model.compile(
    optimizer=keras.optimizers.RMSprop(learning_rate=1e-3),
    loss=keras.losses.MeanSquaredError()
)

Next, we fit the model by providing it with a training set. The optimization algorithm requires that we specify a number of "epochs". This is the number of times it will pass through the data. The larger we set this number, the more times the training algorithm sees each data point.

In [None]:
model.fit(X, y, epochs=1000, verbose=0)

We can treat our model like a function. Calling it on `X` evaluates the neural network at every training point. We *should* see something like `[1, 1, 0, 0]` when we evaluate the below:

In [None]:
model(X)

Unless you were lucky, you probably don't see anything close to the expected answers. Why is this? Even though this neural network is capable of obtaining 100% accuracy on the data set (as verified in lecture), it requires a specific choice of weights that is hard to find. More precisely, the optimization algorithm is stuck in a local minimum that is far from the global minimum.

For the following, it might be useful to have all of the above code in one cell for the purpose of copying and pasting:

In [None]:
inputs = keras.Input(shape=2)
hidden_layer = keras.layers.Dense(2, activation='relu')(inputs)
outputs = keras.layers.Dense(1, activation='linear')(hidden_layer)
model = keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer=keras.optimizers.RMSprop(learning_rate=.001),
    loss=keras.losses.MeanSquaredError()
)

model.fit(X, y, epochs=3000, verbose=0)

## Adding Hidden Nodes

One way to make the training easier is to increase the number of hidden nodes.

**Question 01**. Change the number of hidden nodes in the cell below and re-train the model until it predicts something close to `[1, 1, 0, 0]` on `X`.

*Note*: try re-training your model several times. Sometimes you may get stuck in a local minimum, but most times you should find a good solution.

In [None]:
...

In [None]:
model(X)

## Visualizing the Decision Boundary

The function below will plot the decision boundary:

In [None]:
import matplotlib.pyplot as plt

In [None]:
def plot_decision_boundary(xlim, ylim):

    xx, yy = np.meshgrid(
        np.linspace(xlim[0], xlim[1], 100),
        np.linspace(ylim[0], ylim[1], 100),
    )

    zz = np.asarray(model(np.column_stack((xx.flat, yy.flat))))

    plt.contour(xx, yy, zz.reshape(xx.shape), levels=[0.5])
    plt.contourf(xx, yy, np.sign(zz.reshape(xx.shape) - .5), cmap='RdYlGn', alpha=.2)

Here is the decision boundary of the model learned above:

In [None]:
plot_decision_boundary(xlim=[-.5, 1.5], ylim=[-.5, 1.5])
plt.scatter([0, 1], [1, 0], color='green')
plt.scatter([1, 0], [1, 0], color='red')

In the coming weeks, we will explore using different activation functions. For instance, instead of ReLU, we can use sigmoidal activation functions.

**Question 02**. Explore using different numbers of hidden neurons with sigmoidal activations and plot the decision boundary. Try adding another hidden layer. What effec

In [None]:
...

In [None]:
model(X)

In [None]:
plot_decision_boundary(xlim=[-.5, 1.5], ylim=[-.5, 1.5])
plt.scatter([0, 1], [1, 0], color='green')
plt.scatter([1, 0], [1, 0], color='red')

## A More Complex Pattern

Here is our favorite "moons" data set:

In [None]:
import sklearn.datasets

In [None]:
X_moons, y_moons = sklearn.datasets.make_moons(n_samples=400, noise=.1)

In [None]:
plt.scatter(*X_moons.T, c=y_moons, cmap='RdYlGn')

**Question 03**. Train a deep neural network that obtains close to 100% training accuracy on this data.

In [None]:
...

In [None]:
plot_decision_boundary(xlim=[-1.5, 2.5], ylim=[-1.5, 1.5])
plt.scatter(*X_moons.T, c=y_moons, cmap='RdYlGn')