<a href="https://colab.research.google.com/github/chokkan/deeplearning/blob/master/notebook/binary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feedforward Neural Networks

This Jupyter notebook explains various ways of implementing single-layer and multi-layer neural networks. The implementations are arranged by concrete (explicit) to abstract order so that one can understand the black-boxed processing in deep learning frameworks.

In order to focus on understanding the internals of training, this notebook uses a simple and classic example: *threshold logic units*.
Supposing $x=0$ as *false* and $x=1$ as *true*, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and NAND ($|$). Multi-layer neural networks can realize logical compounds such as XOR.

| $x_1$ | $x_2$ | AND | OR | NAND | XOR |
| :---: |:-----:|:---:|:--:|:----:|:---:|
| 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |


## Using numpy

In [14]:
import numpy as np

### Single-layer perceptron

A single layer perceptron predicts a binary label $\hat{y} \in \{0, 1\}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,
$$
\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)
$$

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias weight; and $g(.)$ denotes a Heaviside step function (we assume $g(0)=0$).

For simplicity, let us consider examples with two-dimensional inputs ($d=2$).
We can represent an input vector $\boldsymbol{x} \in \mathbb{R}^2$ and weight vector $\boldsymbol{w} \in \mathbb{R}^2$ with `numpy.array`. We also define the bias term $b$.

In [21]:
x = np.array([0, 1])
w = np.array([1.0, 1.0])
b = 1.0

The following code computes $\boldsymbol{w} \cdot \boldsymbol{x} + b$,


In [22]:
np.dot(x, w) + b

2.0

Applying Heaviside step function $g$ to the result yields a binary label $\hat{y}$,

In [24]:
np.heaviside(np.dot(x, w) + b, 0)

1.0

#### Including the bias term into the weight vector

For the simplicity of implementations, we include a bias term `b` as an additional dimension to the weight vector `w`. More concretely, we append an element with the value of $1$ to each input,
$$
\boldsymbol{x} = (0, 1) \rightarrow \boldsymbol{x}' = (0, 1, 1)
$$
and expand the dimension of the weight vector $\boldsymbol{w} \in \mathbb{R}^{3}$.

Then, the formula of the single-layer perceptron becomes,
$$
\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)
$$
In other words, $w_1$ and $w_2$ present weights for $x_1$ and $x_2$, respectively, and $w_3$ does a bias weight.

In [25]:
x = np.array([0, 1, 1])
w = np.array([1.0, 1.0, 1.0])

We can simplify the code to predict a binary label $\hat{y}$,

In [27]:
np.heaviside(np.dot(x, w), 0)

1.0

#### Training a NAND gate

Let's train a NAND gate with two inputs. More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias weight $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

| $x_1$ | $x_2$ | $y$  |
| :---: |:-----:|:----:|
| 0 | 0 | 1|
| 0 | 1 | 1|
| 1 | 0 | 1|
| 1 | 1 | 0|

We convert the truth table into a training set consisting of all mappings of the NAND gate,
$$
\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\
\boldsymbol{x}_2 = (0, 1), y_2 = 1 \\
\boldsymbol{x}_3 = (1, 0), y_3 = 1 \\
\boldsymbol{x}_4 = (1, 1), y_4 = 0 \\
$$

As explained earlier, we include the bias term into the last dimension.
$$
\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\
\boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\
\boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\
\boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\
$$

The code below implements Rosenblatt's perceptron algorithm with a fixed number of iterations (50 times). We use a constant learning rate 0.5 for simplicity.


In [56]:
import random
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([0, 0, 0, 1])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(50):
    # Pick an instance index (i) at random.
    i = random.choice(range(len(y)))
    # Predict the label for the instance x[i] with the current parameter w.
    y_pred = np.heaviside(np.dot(x[i], w), 0)
    # Show the detail of the instance and the current parameter.
    print(f'#{t}: i={i}, x={x[i]}, w={w}, y={y[i]}, y_pred={y_pred}, y_err={y[i] - y_pred}')
    # Update the parameter.
    w += (y[i] - y_pred) * eta * x[i]

#0: i=1, x=[0 1 1], w=[0. 0. 0.], y=0, y_pred=0.0, y_err=0.0
#1: i=2, x=[1 0 1], w=[0. 0. 0.], y=0, y_pred=0.0, y_err=0.0
#2: i=2, x=[1 0 1], w=[0. 0. 0.], y=0, y_pred=0.0, y_err=0.0
#3: i=2, x=[1 0 1], w=[0. 0. 0.], y=0, y_pred=0.0, y_err=0.0
#4: i=3, x=[1 1 1], w=[0. 0. 0.], y=1, y_pred=0.0, y_err=1.0
#5: i=2, x=[1 0 1], w=[0.5 0.5 0.5], y=0, y_pred=1.0, y_err=-1.0
#6: i=1, x=[0 1 1], w=[0.  0.5 0. ], y=0, y_pred=1.0, y_err=-1.0
#7: i=2, x=[1 0 1], w=[ 0.   0.  -0.5], y=0, y_pred=0.0, y_err=0.0
#8: i=3, x=[1 1 1], w=[ 0.   0.  -0.5], y=1, y_pred=0.0, y_err=1.0
#9: i=3, x=[1 1 1], w=[0.5 0.5 0. ], y=1, y_pred=1.0, y_err=0.0
#10: i=3, x=[1 1 1], w=[0.5 0.5 0. ], y=1, y_pred=1.0, y_err=0.0
#11: i=3, x=[1 1 1], w=[0.5 0.5 0. ], y=1, y_pred=1.0, y_err=0.0
#12: i=0, x=[0 0 1], w=[0.5 0.5 0. ], y=0, y_pred=0.0, y_err=0.0
#13: i=1, x=[0 1 1], w=[0.5 0.5 0. ], y=0, y_pred=1.0, y_err=-1.0
#14: i=3, x=[1 1 1], w=[ 0.5  0.  -0.5], y=1, y_pred=0.0, y_err=1.0
#15: i=2, x=[1 0 1], w=[1.  0.5 0. ], 

We can confirm the learned parameter and classification results.

In [43]:
w

array([ 0.5,  0.5, -0.5])

In [44]:
np.heaviside(np.dot(x, w), 0)

array([0., 0., 0., 1.])

### Single-layer perceptron with mini-batch

It is desireable to reduce the execusion run by the Python interpreter, which is extremely slow. The common technique to speed up a machine-learning code written in Python is to to execute computations within the matrix library (e.g., numpy).

The single-layer perceptron makes predictions for four inputs,
$$
\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\
\hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\
\hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\
\hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\
$$

Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
X = \begin{pmatrix} 
  \boldsymbol{x}_1 \\ 
  \boldsymbol{x}_2 \\ 
  \boldsymbol{x}_3 \\ 
  \boldsymbol{x}_4 \\ 
\end{pmatrix}
$$

Then, we can write the four predictions in one dot-product computation,
$$
\hat{Y} = X \cdot \boldsymbol{w}
$$

The code below implements this idea. The function `np.heaviside()` yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

This technique is frequently used in mini-batch training.

In [57]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(10):
    y_pred = np.heaviside(np.dot(x, w), 0)
    print(f'#{t}: w={w}, Y={y}, Ypred={y_pred}, Yerr={y-y_pred}, dw={np.dot((y - y_pred), x)}')
    w += np.dot((y - y_pred), x)

#0: w=[0. 0. 0.], Y=[1 1 1 0], Ypred=[0. 0. 0. 0.], Yerr=[1. 1. 1. 0.], dw=[1. 1. 3.]
#1: w=[1. 1. 3.], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
#2: w=[0. 0. 2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
#3: w=[-1. -1.  1.], Y=[1 1 1 0], Ypred=[1. 0. 0. 0.], Yerr=[0. 1. 1. 0.], dw=[1. 1. 2.]
#4: w=[0. 0. 3.], Y=[1 1 1 0], Ypred=[1. 1. 1. 1.], Yerr=[ 0.  0.  0. -1.], dw=[-1. -1. -1.]
#5: w=[-1. -1.  2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
#6: w=[-1. -1.  2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
#7: w=[-1. -1.  2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
#8: w=[-1. -1.  2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]
#9: w=[-1. -1.  2.], Y=[1 1 1 0], Ypred=[1. 1. 1. 0.], Yerr=[0. 0. 0. 0.], dw=[0. 0. 0.]


We can confirm the learned parameter and classification results.

In [51]:
w

array([-1., -1.,  2.])

In [52]:
np.heaviside(np.dot(x, w), 0)

array([1., 1., 1., 0.])

### Stochastic gradient descent (SGD) with mini-batch

In [58]:
import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    print(f'#{t}: w={w}, Y={y}, Ypred={y_pred}, Yerr={y-y_pred}, dw={np.dot((y - y_pred), x)}')
    w -= np.dot((y_pred - y), x)

#0: w=[0. 0. 0.], Y=[1 1 1 0], Ypred=[0.5 0.5 0.5 0.5], Yerr=[ 0.5  0.5  0.5 -0.5], dw=[0. 0. 1.]
#1: w=[0. 0. 1.], Y=[1 1 1 0], Ypred=[0.73105858 0.73105858 0.73105858 0.73105858], Yerr=[ 0.26894142  0.26894142  0.26894142 -0.73105858], dw=[-0.46211716 -0.46211716  0.07576569]
#2: w=[-0.46211716 -0.46211716  1.07576569], Y=[1 1 1 0], Ypred=[0.74569184 0.64877263 0.64877263 0.53781052], Yerr=[ 0.25430816  0.35122737  0.35122737 -0.53781052], dw=[-0.18658315 -0.18658315  0.41895238]
#3: w=[-0.64870031 -0.64870031  1.49471806], Y=[1 1 1 0], Ypred=[0.81678537 0.69973111 0.69973111 0.54916993], Yerr=[ 0.18321463  0.30026889  0.30026889 -0.54916993], dw=[-0.24890104 -0.24890104  0.23458248]
#4: w=[-0.89760135 -0.89760135  1.72930054], Y=[1 1 1 0], Ypred=[0.84932293 0.6967141  0.6967141  0.48353042], Yerr=[ 0.15067707  0.3032859   0.3032859  -0.48353042], dw=[-0.18024452 -0.18024452  0.27371846]
#5: w=[-1.07784586 -1.07784586  2.003019  ], Y=[1 1 1 0], Ypred=[0.88111369 0.71609499 0.71609499

In [None]:
w

array([-5.59504346, -5.59504346,  8.57206068])

In [None]:
sigmoid(np.dot(x, w))

array([0.99981071, 0.95152498, 0.95152498, 0.06798725])

## Automatic differentiation

Consider a loss function,
$$
l_{\boldsymbol{x}}(\boldsymbol{w}) = - \log \sigma(\boldsymbol{w} \cdot \boldsymbol{x}) = - \log \frac{1}{1 + e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}
$$

This section shows implementations in different libraries of deep learning for computing the loss value $l_{\boldsymbol{x}}(\boldsymbol{w})$ and gradients $\frac{\partial l_{\boldsymbol{x}}(\boldsymbol{w})}{\partial \boldsymbol{w}}$ when $\boldsymbol{x} = (1, 1, 1)$ and $\boldsymbol{w} = (1, 1, -1.5)$.

### Using autograd

See: https://github.com/HIPS/autograd

In [59]:
import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))

0.47407698418010663
[-0.37754067 -0.37754067 -0.37754067]


### Using pytorch

See: https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

In [None]:
import torch

dtype = torch.float

x = torch.tensor([1, 1, 1], dtype=dtype)
w = torch.tensor([1.0, 1.0, -1.5], dtype=dtype, requires_grad=True)

loss = -torch.dot(x, w).sigmoid().log()
loss.backward()
print(loss.item())
print(w.grad)

### Using TensorFlow Eager

See: https://www.tensorflow.org/guide/autodiff

In [66]:
import tensorflow as tf

dtype = tf.float32

x = tf.constant([1, 1, 1], dtype=dtype, name='x')
w = tf.Variable([1.0, 1.0, -1.5], dtype=dtype, name='w')

with tf.GradientTape() as tape:
    loss = -tf.math.log(tf.math.sigmoid(tf.tensordot(x, w, 1)))

print(loss.numpy())
print(tape.gradient(loss, w))

0.47407696
tf.Tensor([-0.37754062 -0.37754062 -0.37754062], shape=(3,), dtype=float32)


### Using JAX

See: https://github.com/google/jax

In [4]:
import jax
import jax.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = jax.jit(jax.grad(loss))
print(loss(w, x))
print(grad_loss(w, x))

0.47407696
[-0.37754068 -0.37754068 -0.37754068]


## Implementing neural networks with pytorch

### Single-layer neural network using automatic differentiation

In [23]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
w = torch.randn(3, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = x.mm(w).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()      # The loss value.
    loss.backward()             # Compute the gradients of the loss.

    print(f'#{t}: w={w.t().data[0].numpy()}, loss={loss.item()}')

    with torch.no_grad():
        w -= eta * w.grad       # Update weights using SGD.        
        w.grad.zero_()          # Clear the gradients for the next iteration.

#0: w=[-2.0885355  -1.4806837  -0.15213448], loss=4.949068546295166
#1: w=[-1.6484289  -1.0742207   0.97523004], loss=2.3098151683807373
#2: w=[-1.3915058 -0.8860434  1.6314366], loss=1.5687907934188843
#3: w=[-1.3432871 -0.8970635  1.922385 ], loss=1.43471097946167
#4: w=[-1.3743029 -0.9756532  2.0871632], loss=1.370697021484375
#5: w=[-1.4271579 -1.069197   2.2132783], loss=1.3172904253005981
#6: w=[-1.4855065 -1.1632609  2.3249977], loss=1.2688961029052734
#7: w=[-1.5445642 -1.2539637  2.4296527], loss=1.2244752645492554
#8: w=[-1.6029084 -1.340456   2.5296922], loss=1.1834886074066162
#9: w=[-1.6601355 -1.4227742  2.626064 ], loss=1.1455150842666626
#10: w=[-1.7161564 -1.5011961  2.7192283], loss=1.1102027893066406
#11: w=[-1.7709782 -1.5760455  2.809472 ], loss=1.0772548913955688
#12: w=[-1.8246365 -1.6476331  2.8970122], loss=1.0464177131652832
#13: w=[-1.877173  -1.7162416  2.9820302], loss=1.01747465133667
#14: w=[-1.9286293 -1.7821238  3.0646856], loss=0.9902386665344238
#15: 

In [7]:
w

tensor([[-4.2430],
        [-4.2432],
        [ 6.5565]], requires_grad=True)

In [8]:
x.mm(w).sigmoid()

tensor([[0.9986],
        [0.9100],
        [0.9100],
        [0.1268]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network using automatic differentiation

In [25]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
w1 = torch.randn(3, 2, dtype=dtype, requires_grad=True)
w2 = torch.randn(2, 1, dtype=dtype, requires_grad=True)
b2 = torch.randn(1, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()
    loss.backward()

    #print(f'#{t}: loss={loss.item()}')

    with torch.no_grad():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad
        
        # Clear the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()
        b2.grad.zero_()

In [26]:
print(w1)
print(w2)
print(b2)

tensor([[-5.9381,  7.3522],
        [-5.9374,  7.3486],
        [ 8.8981, -3.3369]], requires_grad=True)
tensor([[9.9424],
        [9.8626]], requires_grad=True)
tensor([[-14.5647]], requires_grad=True)


In [27]:
x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()

tensor([[0.0136],
        [0.9898],
        [0.9898],
        [0.0145]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with high-level NN modules

In [29]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [30]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2650, -4.2642]])),
             ('0.bias', tensor([6.5885]))])

In [31]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9109],
        [0.9108],
        [0.1256]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network with high-level NN modules

In [None]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

eta = 0.5
for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.

In [None]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-0.1494, -0.0430],
                      [-0.1074,  0.0421]])),
             ('0.bias', tensor([-0.6541, -0.5074])),
             ('2.weight', tensor([[-0.0254, -0.0099]])),
             ('2.bias', tensor([0.0118]))])

In [None]:
model(x).sigmoid()

tensor([[0.4998],
        [0.4999],
        [0.5001],
        [0.5001]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with an optimizer.

In [None]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [None]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2643, -4.2638]])),
             ('0.bias', tensor([6.5877]))])

In [None]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9108],
        [0.9108],
        [0.1256]], grad_fn=<SigmoidBackward>)

### Multi-layer neural networks using an optimizer

In [None]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [None]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-5.5683, -5.5688],
                      [ 7.3365,  7.3404]])),
             ('0.bias', tensor([ 8.3403, -3.3190])),
             ('2.weight', tensor([[9.4095, 9.2770]])),
             ('2.bias', tensor([-13.7123]))])

In [None]:
model(x).sigmoid()

tensor([[0.0183],
        [0.9860],
        [0.9860],
        [0.0200]], grad_fn=<SigmoidBackward>)

### Single-layer neural network with a customizable NN class.

In [None]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class SingleLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_out):
        super(SingleLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_out, bias=True)

    def forward(self, x):
        return self.linear1(x)

model = SingleLayerNN(2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [None]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[-4.2721, -4.2722]])),
             ('linear1.bias', tensor([6.5998]))])

In [None]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9111],
        [0.9111],
        [0.1251]], grad_fn=<SigmoidBackward>)

### Multi-layer neural network with a customizable NN class.



In [None]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class ThreeLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super(ThreeLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_hidden, bias=True)
        self.linear2 = torch.nn.Linear(d_hidden, d_out, bias=True)

    def forward(self, x):
        return self.linear2(self.linear1(x).sigmoid())

model = ThreeLayerNN(2, 2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(reduction='sum')

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.

    #print(f'#{t}: loss={loss.item()}')
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.

In [None]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[ 7.5323, -5.1217],
                      [ 6.7442,  2.6708]])),
             ('linear1.bias', tensor([-5.8752, -8.3220])),
             ('linear2.weight', tensor([[ 7.0725, -6.3992]])),
             ('linear2.bias', tensor([-0.0026]))])

In [None]:
model(x).sigmoid()

tensor([[0.5039],
        [0.4938],
        [0.9922],
        [0.0101]], grad_fn=<SigmoidBackward>)