<a href="https://colab.research.google.com/github/chokkan/deeplearning/blob/master/notebook/binary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feedforward Neural Networks

This Jupyter notebook explains various ways of implementing single-layer and multi-layer neural networks. The implementations are arranged by concrete (explicit) to abstract order so that one can understand the black-boxed processing in deep learning frameworks.

In order to focus on understanding the internals of training, this notebook uses a simple and classic example: *threshold logic units*.
Supposing $x=0$ as *false* and $x=1$ as *true*, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and NAND ($|$). Multi-layer neural networks can realize logical compounds such as XOR.

| $x_1$ | $x_2$ | AND | OR | NAND | XOR |
| :---: |:-----:|:---:|:--:|:----:|:---:|
| 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |


## Using numpy

### Single-layer perceptron

A single layer perceptron predicts a binary label $\hat{y} \in \{0, 1\}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,
$$
\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)
$$

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias weight; and $g(.)$ denotes a Heaviside step function (we assume $g(0)=0$).

Let's train a NAND gate with two inputs ($d = 2$). More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias weight $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

| $x_1$ | $x_2$ | $y$  |
| :---: |:-----:|:----:|
| 0 | 0 | 1|
| 0 | 1 | 1|
| 1 | 0 | 1|
| 1 | 1 | 0|

We convert the truth table into a training set consisting of all mappings of the NAND gate,
$$
\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\
\boldsymbol{x}_2 = (0, 1), y_2 = 1 \\
\boldsymbol{x}_3 = (1, 0), y_3 = 1 \\
\boldsymbol{x}_4 = (1, 1), y_4 = 0 \\
$$

In order to train a weight vector and bias weight in a unified code, we include a bias term as an additional dimension to inputs. More concretely, we append $1$ to each input,
$$
\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\
\boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\
\boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\
\boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\
$$

Then, the formula of the single-layer perceptron becomes,
$$
\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)
$$
In other words, $w_1$ and $w_2$ present weights for $x_1$ and $x_2$, respectively, and $w_3$ does a bias weight.

The code below implements Rosenblatt's perceptron algorithm with a fixed number of iterations (100 times). We use a constant learning rate 0.5 for simplicity.


In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([0, 0, 0, 1])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    for i in range(len(y)):
        y_pred = np.heaviside(np.dot(x[i], w), 0)
        w += (y[i] - y_pred) * eta * x[i]

In [2]:
w

array([ 1. ,  0.5, -1. ])

In [3]:
np.heaviside(np.dot(x, w), 0)

array([0., 0., 0., 1.])

### Single-layer perceptron with mini-batch

It is desireable to reduce the execusion run by the Python interpreter, which is relatively slow. The common technique to speed up a machine-learning code written in Python is to to execute computations within the matrix library (e.g., numpy).

The single-layer perceptron makes predictions for four inputs,
$$
\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\
\hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\
\hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\
\hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\
$$

Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
X = \begin{pmatrix} 
  \boldsymbol{x}_1 \\ 
  \boldsymbol{x}_2 \\ 
  \boldsymbol{x}_3 \\ 
  \boldsymbol{x}_4 \\ 
\end{pmatrix}
$$

Then, we can write the four predictions in one dot-product computation,
$$
\hat{Y} = X \cdot \boldsymbol{w}
$$

The code below implements this idea. The function `np.heaviside()` yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

This technique is frequently used in mini-batch training.

In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = np.heaviside(np.dot(x, w), 0)
    w += np.dot((y - y_pred), x)

In [5]:
w

array([-1., -1.,  2.])

In [6]:
np.heaviside(np.dot(x, w), 0)

array([1., 1., 1., 0.])

### Stochastic gradient descent (SGD) with mini-batch

In [0]:
import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    w -= np.dot((y_pred - y), x)

In [8]:
w

array([-5.59504346, -5.59504346,  8.57206068])

In [9]:
sigmoid(np.dot(x, w))

array([0.99981071, 0.95152498, 0.95152498, 0.06798725])

## Automatic differentiation

### Using autograd

In [10]:
import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))

0.47407698418010663
[-0.37754067 -0.37754067 -0.37754067]


### Using pytorch

In [11]:
import torch

dtype = torch.float

x = torch.tensor([1, 1, 1], dtype=dtype)
w = torch.tensor([1.0, 1.0, -1.5], dtype=dtype, requires_grad=True)

loss = -torch.dot(x, w).sigmoid().log()
loss.backward()
print(loss.item())
print(w.grad)

0.4740769565105438
tensor([-0.3775, -0.3775, -0.3775])


## Implementing neural networks with pytorch

### Single-layer neural network using automatic differentiation

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
w = torch.randn(3, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = x.mm(w).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()      # The loss value.
    #print(t, loss.item())
    loss.backward()             # Compute the gradients of the loss.

    with torch.no_grad():
        w -= eta * w.grad       # Update weights using SGD.        
        w.grad.zero_()          # Clear the gradients for the next iteration.

In [13]:
w

tensor([[-4.3418],
        [-4.3396],
        [ 6.7016]], requires_grad=True)

In [14]:
x.mm(w).sigmoid()

tensor([[0.9988],
        [0.9139],
        [0.9137],
        [0.1213]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network using automatic differentiation

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
w1 = torch.randn(3, 2, dtype=dtype, requires_grad=True)
w2 = torch.randn(2, 1, dtype=dtype, requires_grad=True)
b2 = torch.randn(1, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()
    #print(t, loss.item())
    loss.backward()
    
    with torch.no_grad():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad
        
        # Clear the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()
        b2.grad.zero_()

In [16]:
print(w1)
print(w2)
print(b2)

tensor([[-8.1681, -8.9206],
        [ 2.7621, -5.8603],
        [-1.3929,  1.5767]], requires_grad=True)
tensor([[ 6.7857],
        [-8.6870]], requires_grad=True)
tensor([[0.0030]], requires_grad=True)


In [17]:
x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()

tensor([[0.0029],
        [0.9950],
        [0.4995],
        [0.5026]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with high-level NN modules

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [19]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2303, -4.2308]])),
             ('0.bias', tensor([6.5379]))])

In [20]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9095],
        [0.9095],
        [0.1275]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network with high-level NN modules

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

eta = 0.5
for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [22]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-0.1494, -0.0430],
                      [-0.1074,  0.0421]])),
             ('0.bias', tensor([-0.6541, -0.5074])),
             ('2.weight', tensor([[-0.0254, -0.0099]])),
             ('2.bias', tensor([0.0118]))])

In [23]:
model(x).sigmoid()

tensor([[0.4998],
        [0.4999],
        [0.5001],
        [0.5001]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with an optimizer.

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [25]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2643, -4.2638]])),
             ('0.bias', tensor([6.5877]))])

In [26]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9108],
        [0.9108],
        [0.1256]], grad_fn=<SigmoidBackward>)

### Multi-layer neural networks using an optimizer

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [28]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-5.5683, -5.5688],
                      [ 7.3365,  7.3404]])),
             ('0.bias', tensor([ 8.3403, -3.3190])),
             ('2.weight', tensor([[9.4095, 9.2770]])),
             ('2.bias', tensor([-13.7123]))])

In [29]:
model(x).sigmoid()

tensor([[0.0183],
        [0.9860],
        [0.9860],
        [0.0200]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with a customizable NN class.

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class SingleLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_out):
        super(SingleLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_out, bias=True)

    def forward(self, x):
        return self.linear1(x)

model = SingleLayerNN(2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [31]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[-4.2721, -4.2722]])),
             ('linear1.bias', tensor([6.5998]))])

In [32]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9111],
        [0.9111],
        [0.1251]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network with a customizable NN class.



In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class ThreeLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super(ThreeLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_hidden, bias=True)
        self.linear2 = torch.nn.Linear(d_hidden, d_out, bias=True)

    def forward(self, x):
        return self.linear2(self.linear1(x).sigmoid())

model = ThreeLayerNN(2, 2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [34]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[ 7.5323, -5.1217],
                      [ 6.7442,  2.6708]])),
             ('linear1.bias', tensor([-5.8752, -8.3220])),
             ('linear2.weight', tensor([[ 7.0725, -6.3992]])),
             ('linear2.bias', tensor([-0.0026]))])

In [35]:
model(x).sigmoid()

tensor([[0.5039],
        [0.4938],
        [0.9922],
        [0.0101]], grad_fn=<SigmoidBackward>)