[View in Colaboratory](https://colab.research.google.com/github/chokkan/deeplearningclass/blob/master/mlp-binary.ipynb)

# Feedforward Neural Networks

This Jupyter notebook explains various ways of implementing single-layer and multi-layer neural networks. The implementations are arranged by concrete (explicit) to abstract order so that one can understand the black-boxed process in deep learning frameworks.

In order to focus on understanding the internals of training, this notebook uses a simple and classic example: *threshold logic units*.
Supposing $x=0$ as *false* and $x=1$ as *true*, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and NAND ($|$). Multi-layer neural networks can realize logical compounds such as XOR.

| $x_1$ | $x_2$ | AND | OR | NAND | XOR |
| :---: |:-----:|:---:|:--:|:----:|:---:|
| 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |


## Using numpy

### Single-layer perceptron

A single layer perceptron predicts a binary label $\hat{y} \in \{0, 1\}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,
$$
\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)
$$

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias weight; and $g(.)$ denotes a Heaviside step function (we assume $g(0)=0$).

Let's train a NAND gate with two inputs ($d = 2$). More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias weight $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

| $x_1$ | $x_2$ | $y$  |
| :---: |:-----:|:----:|
| 0 | 0 | 1|
| 0 | 1 | 1|
| 1 | 0 | 1|
| 1 | 1 | 0|

We convert the truth table into a training set consisting of all mappings of the NAND gate,
$$
\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\
\boldsymbol{x}_2 = (0, 1), y_2 = 1 \\
\boldsymbol{x}_3 = (1, 0), y_3 = 1 \\
\boldsymbol{x}_4 = (1, 1), y_4 = 0 \\
$$

In order to train a weight vector bias weight in a unified code, we include a bias term as an additional dimension to inputs. More concretely, we append $1$ to each input,
$$
\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\
\boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\
\boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\
\boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\
$$

Then, the formula of the single-layer perceptron becomes,
$$
\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)
$$
Therefore, $w_1$ and $w_2$ present weights for $x_1$ and $x_2$, respectively, and $w_3$ does a bias weight.

The code below implements Rosenblatt's perceptron algorithm with a fixed number of iterations (100 times). We use a constant learning rate 0.5 for simplicity.


In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    for i in range(len(y)):
        y_pred = np.heaviside(np.dot(x[i], w), 0)
        w += (y[i] - y_pred) * eta * x[i]

In [2]:
w

array([-1. , -0.5,  1.5])

In [3]:
np.heaviside(np.dot(x, w), 0)

array([1., 1., 1., 0.])

### Single-layer perceptron with mini-batch

It is desireable to reduce the execusion run by the Python interpreter because it is relatively slow. The common technique to speed up a machine-learning code written in Python is to to execute computations in the matrix library (e.g., numpy).

The single-layer perceptron makes predictions for four inputs,
$$
\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\
\hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\
\hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\
\hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\
$$

Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
X = \begin{pmatrix} 
  \boldsymbol{x}_1 \\ 
  \boldsymbol{x}_2 \\ 
  \boldsymbol{x}_3 \\ 
  \boldsymbol{x}_4 \\ 
\end{pmatrix}
$$

Then, we can write the four predictions in one dot-product computation,
$$
\hat{Y} = X \cdot \boldsymbol{w}
$$

The code below implements this idea. The function `np.heaviside()` yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

This technique is frequently used in mini-batch training.

In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = np.heaviside(np.dot(x, w), 0)
    w += np.dot((y - y_pred), x)

In [5]:
w

array([-1., -1.,  2.])

In [6]:
np.heaviside(np.dot(x, w), 0)

array([1., 1., 1., 0.])

### Stochastic gradient descent (SGD) with mini-batch

In [0]:
import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    w -= np.dot((y_pred - y), x)

In [8]:
w

array([-5.59504346, -5.59504346,  8.57206068])

In [9]:
sigmoid(np.dot(x, w))

array([0.99981071, 0.95152498, 0.95152498, 0.06798725])

## Automatic differentiation

### Using autograd

Installing [autograd](https://github.com/HIPS/autograd) (do this once).

In [10]:
!pip install autograd

Collecting autograd
  Downloading https://files.pythonhosted.org/packages/08/7a/1ccee2a929d806ba3dbe632a196ad6a3f1423d6e261ae887e5fef2011420/autograd-1.2.tar.gz
Building wheels for collected packages: autograd
  Running setup.py bdist_wheel for autograd ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/72/6f/c2/40f130cca2c91f31d354bf72de282922479c09ce0b7853c4c5
Successfully built autograd
Installing collected packages: autograd
Successfully installed autograd-1.2


In [11]:
import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))

0.47407698418010663
[-0.37754067 -0.37754067 -0.37754067]


### Using pytorch

In [12]:
!pip install torch torchvision

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/69/43/380514bd9663f1bf708abeb359b8b48d3fabb1c8e95bb3427a980a064c57/torch-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (484.0MB)
[K    100% |████████████████████████████████| 484.0MB 23kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x5bf78000 @  0x7fe8c0ede1c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hCollecting torchvision
[?25l  Downloading https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 16.8MB/s 
[?25hCollecting pillow>=4.1.1 (from torchvision)
[?25l  Downloading https://files.pythonhosted.org/packages/d1/24/f53ff6b61b3d728b9

In [13]:
import torch

dtype = torch.float

x = torch.tensor([1, 1, 1], dtype=dtype)
w = torch.tensor([1.0, 1.0, -1.5], dtype=dtype, requires_grad=True)

loss = -torch.dot(x, w).sigmoid().log()
loss.backward()
print(loss.item())
print(w.grad)

0.4740769565105438
tensor([-0.3775, -0.3775, -0.3775])


## Implementing neural networks with pytorch

### Single-layer neural network using automatic differentiation

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
w = torch.randn(3, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = x.mm(w).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()      # The loss value.
    #print(t, loss.item())
    loss.backward()             # Compute the gradients of the loss.

    with torch.no_grad():
        w -= eta * w.grad       # Update weights using SGD.        
        w.grad.zero_()          # Clear the gradients for the next iteration.

In [15]:
w

tensor([[-4.2076],
        [-4.2069],
        [ 6.5032]])

In [16]:
x.mm(w).sigmoid()

tensor([[ 0.9985],
        [ 0.9086],
        [ 0.9085],
        [ 0.1288]])

### Multi-layer neural network using automatic differentiation

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
w1 = torch.randn(3, 2, dtype=dtype, requires_grad=True)
w2 = torch.randn(2, 1, dtype=dtype, requires_grad=True)
b2 = torch.randn(1, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()
    #print(t, loss.item())
    loss.backward()
    
    with torch.no_grad():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad
        
        # Clear the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()
        b2.grad.zero_()

In [18]:
print(w1)
print(w2)
print(b2)

tensor([[ 6.6428,  6.5513],
        [-6.8549, -6.2793],
        [-3.5480,  3.1245]])
tensor([[ 11.2418],
        [-10.7852]])
tensor([[ 5.0735]])


In [19]:
x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()

tensor([[ 0.0071],
        [ 0.9904],
        [ 0.9936],
        [ 0.0060]])

### Single-layer neural network with high-level NN modules

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [21]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2608, -4.2610]])),
             ('0.bias', tensor([ 6.5830]))])

In [22]:
model(x).sigmoid()

tensor([[ 0.9986],
        [ 0.9107],
        [ 0.9107],
        [ 0.1258]])

### Multi-layer neural network with high-level NN modules

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [24]:
model.state_dict()

OrderedDict([('0.weight', tensor([[ 5.1500,  5.1430],
                      [ 7.0787,  7.0394]])),
             ('0.bias', tensor([-7.8774, -3.1887])),
             ('2.weight', tensor([[-12.1440,  11.3434]])),
             ('2.bias', tensor([-5.2512]))])

In [25]:
model(x).sigmoid()

tensor([[ 0.0081],
        [ 0.9940],
        [ 0.9941],
        [ 0.0063]])

### Single-layer neural network with an optimizer.

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [27]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2332, -4.2330]])),
             ('0.bias', tensor([ 6.5417]))])

In [28]:
model(x).sigmoid()

tensor([[ 0.9986],
        [ 0.9096],
        [ 0.9096],
        [ 0.1274]])

### Multi-layer neural networks using an optimizer

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [30]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-3.9487,  8.3311],
                      [ 4.9970,  8.4306]])),
             ('0.bias', tensor([ 2.4715, -1.4012])),
             ('2.weight', tensor([[-6.8466,  7.5803]])),
             ('2.bias', tensor([-0.7328]))])

In [31]:
model(x).sigmoid()

tensor([[ 0.0039],
        [ 0.4986],
        [ 0.9954],
        [ 0.5020]])