[View in Colaboratory](https://colab.research.google.com/github/chokkan/deeplearning/blob/master/notebook/binary.ipynb)

# Feedforward Neural Networks

This Jupyter notebook explains various ways of implementing single-layer and multi-layer neural networks. The implementations are arranged by concrete (explicit) to abstract order so that one can understand the black-boxed processing in deep learning frameworks.

In order to focus on understanding the internals of training, this notebook uses a simple and classic example: *threshold logic units*.
Supposing $x=0$ as *false* and $x=1$ as *true*, single-layer neural networks can realize logic units such as AND ($\wedge$), OR ($\vee$), NOT ($\lnot$), and NAND ($|$). Multi-layer neural networks can realize logical compounds such as XOR.

| $x_1$ | $x_2$ | AND | OR | NAND | XOR |
| :---: |:-----:|:---:|:--:|:----:|:---:|
| 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |


## Using numpy

### Single-layer perceptron

A single layer perceptron predicts a binary label $\hat{y} \in \{0, 1\}$ for a given input vector $\boldsymbol{x} \in \mathbb{R}^d$ ($d$ presents the number of dimensions of inputs) by using the following formula,
$$
\hat{y} = g(\boldsymbol{w} \cdot \boldsymbol{x} + b) = g(w_1 x_1 + w_2 x_2 + ... + w_d x_d + b)
$$

Here, $\boldsymbol{w} \in \mathbb{R}^d$ is a weight vector; $b \in \mathbb{R}$ is a bias weight; and $g(.)$ denotes a Heaviside step function (we assume $g(0)=0$).

Let's train a NAND gate with two inputs ($d = 2$). More specifically, we want to find a weight vector $\boldsymbol{w}$ and a bias weight $b$ of a single-layer perceptron that realizes the truth table of the NAND gate: $\{0,1\}^2 \to \{0,1\}$.

| $x_1$ | $x_2$ | $y$  |
| :---: |:-----:|:----:|
| 0 | 0 | 1|
| 0 | 1 | 1|
| 1 | 0 | 1|
| 1 | 1 | 0|

We convert the truth table into a training set consisting of all mappings of the NAND gate,
$$
\boldsymbol{x}_1 = (0, 0), y_1 = 1 \\
\boldsymbol{x}_2 = (0, 1), y_2 = 1 \\
\boldsymbol{x}_3 = (1, 0), y_3 = 1 \\
\boldsymbol{x}_4 = (1, 1), y_4 = 0 \\
$$

In order to train a weight vector and bias weight in a unified code, we include a bias term as an additional dimension to inputs. More concretely, we append $1$ to each input,
$$
\boldsymbol{x}'_1 = (0, 0, 1), y_1 = 1 \\
\boldsymbol{x}'_2 = (0, 1, 1), y_2 = 1 \\
\boldsymbol{x}'_3 = (1, 0, 1), y_3 = 1 \\
\boldsymbol{x}'_4 = (1, 1, 1), y_4 = 0 \\
$$

Then, the formula of the single-layer perceptron becomes,
$$
\hat{y} = g((w_1, w_2, w_3) \cdot \boldsymbol{x}') = g(w_1 x_1 + w_2 x_2 + w_3)
$$
In other words, $w_1$ and $w_2$ present weights for $x_1$ and $x_2$, respectively, and $w_3$ does a bias weight.

The code below implements Rosenblatt's perceptron algorithm with a fixed number of iterations (100 times). We use a constant learning rate 0.5 for simplicity.


In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([0, 0, 0, 1])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    for i in range(len(y)):
        y_pred = np.heaviside(np.dot(x[i], w), 0)
        w += (y[i] - y_pred) * eta * x[i]

In [2]:
w

array([ 1. ,  0.5, -1. ])

In [3]:
np.heaviside(np.dot(x, w), 0)

array([0., 0., 0., 1.])

### Single-layer perceptron with mini-batch

It is desireable to reduce the execusion run by the Python interpreter, which is relatively slow. The common technique to speed up a machine-learning code written in Python is to to execute computations within the matrix library (e.g., numpy).

The single-layer perceptron makes predictions for four inputs,
$$
\hat{y}_1 = g(\boldsymbol{x}_1 \cdot \boldsymbol{w}) \\
\hat{y}_2 = g(\boldsymbol{x}_2 \cdot \boldsymbol{w}) \\
\hat{y}_3 = g(\boldsymbol{x}_3 \cdot \boldsymbol{w}) \\
\hat{y}_4 = g(\boldsymbol{x}_4 \cdot \boldsymbol{w}) \\
$$

Here, we define $\hat{Y} \in \mathbb{R}^{4 \times 1}$ and $X \in \mathbb{R}^{4 \times d}$ as,
$$
\hat{Y} = \begin{pmatrix} 
  \hat{y}_1 \\ 
  \hat{y}_2 \\ 
  \hat{y}_3 \\ 
  \hat{y}_4 \\ 
\end{pmatrix},
X = \begin{pmatrix} 
  \boldsymbol{x}_1 \\ 
  \boldsymbol{x}_2 \\ 
  \boldsymbol{x}_3 \\ 
  \boldsymbol{x}_4 \\ 
\end{pmatrix}
$$

Then, we can write the four predictions in one dot-product computation,
$$
\hat{Y} = X \cdot \boldsymbol{w}
$$

The code below implements this idea. The function `np.heaviside()` yields a vector corresponding to the four predictions, applying the step function for every element of the argument.

This technique is frequently used in mini-batch training.

In [0]:
import numpy as np

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = np.heaviside(np.dot(x, w), 0)
    w += np.dot((y - y_pred), x)

In [5]:
w

array([-1., -1.,  2.])

In [6]:
np.heaviside(np.dot(x, w), 0)

array([1., 1., 1., 0.])

### Stochastic gradient descent (SGD) with mini-batch

In [0]:
import numpy as np

def sigmoid(v):
    return 1.0 / (1 + np.exp(-v))

# Training data for NAND.
x = np.array([
    [0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]
    ])
y = np.array([1, 1, 1, 0])
w = np.array([0.0, 0.0, 0.0])

eta = 0.5
for t in range(100):
    y_pred = sigmoid(np.dot(x, w))
    w -= np.dot((y_pred - y), x)

In [8]:
w

array([-5.59504346, -5.59504346,  8.57206068])

In [9]:
sigmoid(np.dot(x, w))

array([0.99981071, 0.95152498, 0.95152498, 0.06798725])

## Automatic differentiation

### autograd

Installing [autograd](https://github.com/HIPS/autograd) (do this once).

In [10]:
!pip install autograd

Collecting autograd
  Downloading https://files.pythonhosted.org/packages/08/7a/1ccee2a929d806ba3dbe632a196ad6a3f1423d6e261ae887e5fef2011420/autograd-1.2.tar.gz
Building wheels for collected packages: autograd
  Running setup.py bdist_wheel for autograd ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/72/6f/c2/40f130cca2c91f31d354bf72de282922479c09ce0b7853c4c5
Successfully built autograd
Installing collected packages: autograd
Successfully installed autograd-1.2


In [11]:
import autograd
import autograd.numpy as np

def loss(w, x):
    return -np.log(1.0 / (1 + np.exp(-np.dot(x, w))))

x = np.array([1, 1, 1])
w = np.array([1.0, 1.0, -1.5])

grad_loss = autograd.grad(loss)
print(loss(w, x))
print(grad_loss(w, x))

0.47407698418010663
[-0.37754067 -0.37754067 -0.37754067]


### PyTorch

In [12]:
!pip install torch torchvision

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 26kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x586c2000 @  0x7f423cd5f1c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hCollecting torchvision
[?25l  Downloading https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 1.7MB/s 
Collecting pillow>=4.1.1 (from torchvision)
[?25l  Downloading https://files.pythonhosted.org/packages/d1/24/f53ff6b61b3d728b90934bdd

In [13]:
import torch

dtype = torch.float

x = torch.tensor([1, 1, 1], dtype=dtype)
w = torch.tensor([1.0, 1.0, -1.5], dtype=dtype, requires_grad=True)

loss = -torch.dot(x, w).sigmoid().log()
loss.backward()
print(loss.item())
print(w.grad)

0.4740769565105438
tensor([-0.3775, -0.3775, -0.3775])


### Chainer

In [14]:
!pip install chainer

Collecting chainer
[?25l  Downloading https://files.pythonhosted.org/packages/ad/c6/61ff9041ea7427fc1e39768f740ab8b880f8ef20960a5f791e978e8d81c0/chainer-4.3.1.tar.gz (400kB)
[K    100% |████████████████████████████████| 409kB 4.7MB/s 
[?25hCollecting filelock (from chainer)
  Downloading https://files.pythonhosted.org/packages/2d/ba/db7e0717368958827fa97af0b8acafd983ac3a6ecd679f60f3ccd6e5b16e/filelock-3.0.4.tar.gz
Building wheels for collected packages: chainer, filelock
  Running setup.py bdist_wheel for chainer ... [?25l- \ | / - done
[?25h  Stored in directory: /content/.cache/pip/wheels/8a/ef/b0/e67e0555c4d520566d6565d9634ecb7fbb1594758236bb7b40
  Running setup.py bdist_wheel for filelock ... [?25l- done
[?25h  Stored in directory: /content/.cache/pip/wheels/35/ba/67/4cc48738870c3b54f9e3b5d78bf9de130befb70c1d359faf8b
Successfully built chainer filelock
Installing collected packages: filelock, chainer
Successfully installed chainer-4.3.1 filelock-3.0.4


In [15]:
import numpy as np
from chainer import Variable
import chainer.functions as F

dtype = np.float32

x = np.array([1,1,1], dtype=dtype)
w = Variable(np.array([1.0,1.0,-1.5], dtype=dtype), requires_grad=True)

loss = -F.log(F.sigmoid(np.dot(x,w)))
loss.backward()
print(loss.data)
print(w.grad)

0.47407696
[-0.37754062 -0.37754062 -0.37754062]


### TensorFlow

In [16]:
import tensorflow as tf

x = tf.constant([1., 1., 1.])
w = tf.Variable([1.0, 1.0, -1.5])

loss = -tf.log(tf.sigmoid(tf.tensordot(x, w, axes=1)))
grad = tf.gradients(loss, w)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    loss_value = sess.run(loss)
    grad_value = sess.run(grad)
    print(loss_value)
    print(grad_value)

0.47407696
[array([-0.37754062, -0.37754062, -0.37754062], dtype=float32)]


### MXNet

In [17]:
!pip install mxnet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/bb/53/5d33f71c5224a676112679458714eb728f6db8cae15f39fcdf27226f6e41/mxnet-1.2.1.post1-py2.py3-none-manylinux1_x86_64.whl (24.2MB)
[K    100% |████████████████████████████████| 24.2MB 1.6MB/s 
[?25hCollecting graphviz<0.9.0,>=0.8.1 (from mxnet)
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: graphviz, mxnet
Successfully installed graphviz-0.8.4 mxnet-1.2.1.post1


In [18]:
import mxnet as mx
from mxnet import nd, autograd, gluon

x = nd.array([1., 1., 1.])
w = nd.array([1.0, 1.0, -1.5])
w.attach_grad()

with autograd.record():
    loss = -nd.dot(x, w).sigmoid().log()
loss.backward()
print(loss)
print(w.grad)


[0.47407696]
<NDArray 1 @cpu(0)>

[-0.37754065 -0.37754065 -0.37754065]
<NDArray 3 @cpu(0)>


## Single-layer neural network using automatic differentiation

### PyTorch

In [0]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
w = torch.randn(3, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = x.mm(w).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()      # The loss value.
    #print(t, loss.item())
    loss.backward()             # Compute the gradients of the loss.

    with torch.no_grad():
        w -= eta * w.grad       # Update weights using SGD.        
        w.grad.zero_()          # Clear the gradients for the next iteration.

In [20]:
w

tensor([[-4.2454],
        [-4.2453],
        [ 6.5599]], requires_grad=True)

In [21]:
x.mm(w).sigmoid()

tensor([[0.9986],
        [0.9101],
        [0.9101],
        [0.1267]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import numpy as np
import chainer
from chainer import Variable
import chainer.functions as F

dtype = np.float32

# Training data for NAND
x = Variable(np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype))
y = Variable(np.array([[1], [1], [1], [0]], dtype=dtype))
w = Variable(np.random.rand(3, 1).astype(dtype=dtype), requires_grad=True)

eta = 0.5
for t in range(100):
    # y_pred = \sigma(x \cdot w)
    y_pred = F.sigmoid(F.matmul(x, w))
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -F.sum(F.log(ll))    # The loss value.
    #print(t, loss)
    loss.backward()             # Compute the gradients of the loss.

    with chainer.no_backprop_mode():
        w -= eta * w.grad       # Update weights using SGD.
        w.cleargrad()           # Clear the gradients for the next iteration.

In [23]:
w

variable([[-4.245978 ],
          [-4.2458024],
          [ 6.5606837]])

In [24]:
F.sigmoid(F.matmul(x, w))

variable([[0.9985871 ],
          [0.910102  ],
          [0.9100877 ],
          [0.12662926]])

### TensorFlow

In [25]:
import tensorflow as tf

# Training data for NAND
x_data = [[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]]
y_data = [[1], [1], [1], [0]]

x = tf.placeholder(tf.float32, [4, 3])
y = tf.placeholder(tf.float32, [4, 1])
w = tf.Variable(tf.random_normal([3,1]))

# y_pred = \sigma(x \cdot w)
y_pred = tf.sigmoid(tf.matmul(x, w))
ll = y * y_pred + (1 - y) * (1 - y_pred)
loss = -tf.reduce_sum(tf.log(ll))
grad = tf.gradients(loss, w)

eta = 0.5
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for t in range(100):
        grads = sess.run(grad, feed_dict={x: x_data, y: y_data})
        sess.run(w.assign_sub(eta * grads[0]))
    print(sess.run(w))
    print(sess.run(y_pred, feed_dict={x: x_data, y: y_data}))

[[-4.1939316]
 [-4.193681 ]
 [ 6.483271 ]]
[[0.99847347]
 [0.9080112 ]
 [0.9079903 ]
 [0.12961791]]


### MXNet

In [0]:
import mxnet as mx
from mxnet import nd, autograd

# Training data for NAND.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[1], [1], [1], [0]])
w = nd.random.normal(0, 1, shape=(3, 1))
w.attach_grad()

eta = 0.5
for t in range(100):
    with autograd.record():
        # y_pred = \sigma(x \cdot w).
        y_pred = nd.dot(x, w).sigmoid()
        ll = y * y_pred + (1 - y) * (1 - y_pred)
        loss = -ll.log().sum()      # The loss value.
        #print(t, loss)
    loss.backward()                 # Compute the gradients of the loss.
    w -= eta * w.grad               # Update weights using SGD.

In [27]:
w


[[-4.2020216]
 [-4.20314  ]
 [ 6.4963117]]
<NDArray 3x1 @cpu(0)>

In [28]:
nd.dot(x, w).sigmoid()


[[0.9984933 ]
 [0.90831   ]
 [0.90840304]
 [0.12911019]]
<NDArray 4x1 @cpu(0)>

## Multi-layer neural network using automatic differentiation

### PyTorch

In [0]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
w1 = torch.randn(3, 2, dtype=dtype, requires_grad=True)
w2 = torch.randn(2, 1, dtype=dtype, requires_grad=True)
b2 = torch.randn(1, 1, dtype=dtype, requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -ll.log().sum()
    #print(t, loss.item())
    loss.backward()
    
    with torch.no_grad():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad
        
        # Clear the gradients for the next iteration.
        w1.grad.zero_()
        w2.grad.zero_()
        b2.grad.zero_()

In [30]:
print(w1)
print(w2)
print(b2)

tensor([[ 6.8666, -6.5697],
        [-7.0005,  6.2410],
        [-3.7858, -3.4025]], requires_grad=True)
tensor([[11.2838],
        [11.3932]], requires_grad=True)
tensor([[-5.5641]], requires_grad=True)


In [31]:
x.mm(w1).sigmoid().mm(w2).add(b2).sigmoid()

tensor([[0.0071],
        [0.9945],
        [0.9946],
        [0.0062]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import numpy as np
import chainer
from chainer import Variable
import chainer.functions as F

dtype = np.float32

# Training data for XOR.
x = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]], dtype=dtype)
y = np.array([[0], [1], [1], [0]],dtype=dtype)
w1 = Variable(np.random.randn(3, 2).astype(dtype),requires_grad=True)
w2 = Variable(np.random.randn(2, 1).astype(dtype),requires_grad=True)
b2 = Variable(np.random.randn(1).astype(dtype), requires_grad=True)

eta = 0.5
for t in range(1000):
    # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
    y_pred = F.sigmoid(F.bias(F.matmul(F.sigmoid(F.matmul(x, w1)), w2), b2))
    ll = y * y_pred + (1 - y) * (1 - y_pred)
    loss = -F.sum(F.log(ll))
    #print(t, loss.data)
    loss.backward()
    with chainer.no_backprop_mode():
        # Update weights using SGD.
        w1 -= eta * w1.grad
        w2 -= eta * w2.grad
        b2 -= eta * b2.grad

        # Clear the gradients for the next iteration.
        w1.cleargrad()
        w2.cleargrad()
        b2.cleargrad()

In [33]:
print(w1)
print(w2)
print(b2)

variable([[ 6.359068  -7.071981 ]
          [-6.6552386  6.8526063]
          [-3.472819  -3.7479024]])
variable([[11.586926]
          [11.482569]])
variable([-5.663216])


In [34]:
F.sigmoid(F.bias(F.matmul(F.sigmoid(F.matmul(x,w1)) ,w2), b2))

variable([[0.00636774],
          [0.9951651 ],
          [0.9950907 ],
          [0.00554883]])

### TensorFlow

In [35]:
import tensorflow as tf

# Training data for XOR.
x_data = [[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]]
y_data = [[0], [1], [1], [0]]

x = tf.placeholder(tf.float32, [4, 3])
y = tf.placeholder(tf.float32, [4, 1])
w1 = tf.Variable(tf.random_normal([3, 2]))
w2 = tf.Variable(tf.random_normal([2, 1]))
b2 = tf.Variable(tf.random_normal([1, 1]))

y_pred = tf.sigmoid(tf.add(tf.matmul(tf.sigmoid(tf.matmul(x, w1)), w2), b2))
ll = y * y_pred + (1 - y) * (1 - y_pred)
log = tf.log(ll)
loss = -tf.reduce_sum(log)
grad = tf.gradients(loss, [w1, w2, b2])

eta = 0.5
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for t in range(1000):
        w1_grad, w2_grad, b2_grad = sess.run(grad, feed_dict={x: x_data, y: y_data})
        sess.run(tf.assign_sub(w1, eta * w1_grad))
        sess.run(tf.assign_sub(w2, eta * w2_grad))
        sess.run(tf.assign_sub(b2, eta * b2_grad))
        
    print(sess.run(w1))
    print(sess.run(w2))
    print(sess.run(b2))
    print(sess.run(y_pred, feed_dict={x: x_data, y: y_data}))

[[ 7.0503726  6.132738 ]
 [-6.8502035 -6.4016724]
 [ 3.4437704 -3.2621946]]
[[-10.97688 ]
 [ 11.618423]]
[[5.131303]]
[[0.00619888]
 [0.99167174]
 [0.9942344 ]
 [0.0052963 ]]


### MXNet

In [0]:
import mxnet as mx
from mxnet import nd, autograd

# Training data for XOR.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[0], [1], [1], [0]])

w1 = nd.random.normal(0, 1, shape=(3, 2))
w2 = nd.random.normal(0, 1, shape=(2, 1))
b2 = nd.random.normal(0, 1, shape=(1, 1))
w1.attach_grad()
w2.attach_grad()
b2.attach_grad()

eta = 0.5
for t in range(1000):
    with autograd.record():
        # y_pred = \sigma(w_2 \cdot \sigma(x \cdot w_1) + b_2)
        y_pred = (nd.dot(nd.dot(x, w1).sigmoid(), w2) + b2).sigmoid()
        ll = y * y_pred + (1 - y) * (1 - y_pred)
        loss = -ll.log().sum()
    loss.backward()
    
    # Update weights using SGD.
    w1 -= eta * w1.grad
    w2 -= eta * w2.grad
    b2 -= eta * b2.grad

In [37]:
print(w1)
print(w2)
print(b2)


[[-6.5379534  8.085663 ]
 [ 6.8499675 -7.9502664]
 [ 3.2666583  4.045889 ]]
<NDArray 3x2 @cpu(0)>

[[-9.515034]
 [-9.384542]]
<NDArray 2x1 @cpu(0)>

[[13.912197]]
<NDArray 1x1 @cpu(0)>


In [38]:
print((nd.dot(nd.dot(x, w1).sigmoid(), w2) + b2).sigmoid())


[[0.01124511]
 [0.98540175]
 [0.9849283 ]
 [0.01007298]]
<NDArray 4x1 @cpu(0)>


## Single-layer neural network with high-level NN modules



### PyTorch

In [39]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.



In [40]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2161, -4.2162]])),
             ('0.bias', tensor([6.5164]))])

In [41]:
model(x).sigmoid()

tensor([[0.9985],
        [0.9089],
        [0.9089],
        [0.1283]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import chainer
import numpy as np
from chainer import Variable, Function
import chainer.functions as F
import chainer.links as L

dtype = np.float32

# Training data for NAND
x = Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = Variable(np.array([[1], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 1, nobias=False)            # 2 dims (with bias) -> 1 dim
)
# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

eta = 0.5
for t in range(100):
    y_pred = model(x)                       # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)
    # print(t, loss.data)
    model.cleargrads()                      # Zero-clear the gradients.
    loss.backward()                         # Compute the gradients.

    with chainer.no_backprop_mode():
        for para in model.params():
            para.data -= eta * para.grad    # Update the parameters using SGD.

In [43]:
for para in model.params():
    print(para)

variable W([[-2.1200492 -2.1241121]])
variable b([3.4434984])


In [44]:
F.sigmoid(model(x))

variable([[0.9690367 ],
          [0.78907955],
          [0.789755  ],
          [0.30988365]])

## Multi-layer neural network with high-level NN modules

### PyTorch

In [45]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

eta = 0.5
for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y)           # Compute the loss.
    #print(t, loss.item())
    
    model.zero_grad()                   # Zero-clear the gradients.
    loss.backward()                     # Compute the gradients.
        
    with torch.no_grad():
        for param in model.parameters():
            param -= eta * param.grad   # Update the parameters using SGD.



In [46]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-5.8098,  5.7082],
                      [ 3.7659, -4.1443]])),
             ('0.bias', tensor([-3.5452, -2.0044])),
             ('2.weight', tensor([[6.4493, 5.9124]])),
             ('2.bias', tensor([-2.9767]))])

In [47]:
model(x).sigmoid()

tensor([[0.1097],
        [0.9438],
        [0.8879],
        [0.0900]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import chainer
import numpy as np
from chainer import Variable, Function
import chainer.functions as F
import chainer.links as L

dtype = np.float32

# Training data for XOR.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[0], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
init=chainer.initializers.HeNormal()
model = chainer.Sequential(
    L.Linear(2, 2, nobias=False, initialW=init), # 2 dims (with bias) -> 2 dims
    F.sigmoid,                                   # Sigmoid function
    L.Linear(2, 1, nobias=False, initialW=init), # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

eta = 0.5
for t in range(1000):
    y_pred = model(x)                            # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)
    # print(t, loss.data)
    model.cleargrads()                           # Zero-clear the gradients.
    loss.backward()                              # Compute the gradients.

    with chainer.no_backprop_mode():
        for para in model.params():
            para.data -= eta * para.grad     # Update the parameters using SGD.

In [49]:
for para in model.params():
    print(para)

variable W([[-5.715909  -3.6899567]
            [ 6.2749352 -4.1101246]])
variable b([0.88635695 2.5805264 ])
variable W([[-5.1726246 -4.258353 ]])
variable b([4.2472806])


In [50]:
F.sigmoid(model(x))

variable([[0.03311139],
          [0.96059114],
          [0.48713586],
          [0.5061473 ]])

## Single-layer neural network with an optimizer.

### PyTorch

In [51]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.



In [52]:
model.state_dict()

OrderedDict([('0.weight', tensor([[-4.2385, -4.2383]])),
             ('0.bias', tensor([6.5496]))])

In [53]:
model(x).sigmoid()

tensor([[0.9986],
        [0.9098],
        [0.9098],
        [0.1271]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import chainer
import numpy as np
from chainer import functions as F
from chainer import links as L
chainer.config.train = True

dtype=np.float32

# Training data for NAND.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[1], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 1, nobias=False),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = F.sigmoid_cross_entropy

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = chainer.optimizers.SGD(lr=0.5)
optimizer.setup(model)

for t in range(100):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)   # Compute the loss.
    #print(t, loss.data)
    
    model.cleargrads()          # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.update()          # Update the parameters using the gradients.

In [55]:
for para in model.params():
    print(para)

variable W([[-2.2661328 -2.1941178]])
variable b([3.5988352])


In [56]:
F.sigmoid(model(x))

variable([[0.9733728 ],
          [0.8029314 ],
          [0.7912873 ],
          [0.29704365]])

### TensorFlow (Keras)

In [57]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Activation
from tensorflow.keras import optimizers

# Training data for NAND.
x_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_data = np.array([[1], [1], [1], [0]])

# Define a neural network using high-level modules.
model = Sequential([
    Flatten(),
    Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=optimizers.SGD(lr=0.5),
    loss='binary_crossentropy',
    metrics=['accuracy']
    )

model.fit(x_data, y_data, epochs=100)

  return _inspect.getargspec(target)


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7faad2c70ef0>

In [58]:
model.get_weights()

[array([[-2.03777  ],
        [-2.1895845]], dtype=float32), array([3.432624], dtype=float32)]

In [59]:
model.predict(x_data)

  return _inspect.getargspec(target)


array([[0.96870875],
       [0.77609265],
       [0.80136603],
       [0.31115386]], dtype=float32)

### MXNet

In [0]:
import mxnet as mx
from mxnet import nd, autograd, gluon

# Training data for NAND.
x = nd.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = nd.array([[1], [1], [1], [0]])

# Define a neural network using high-level modules.
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(1))
net.collect_params().initialize(mx.init.Normal(sigma=1.))
  
# Binary cross-entropy loss agter sigmoid function.
loss_fn = gluon.loss.SigmoidBinaryCrossEntropyLoss()

# Optimizer based on SGD
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})

for t in range(100):
    with autograd.record():
        # Make predictions.
        y_pred = net(x)
        # Compute the loss.
        loss = loss_fn(y_pred, y)
    # Compute the gradients of the loss.
    loss.backward()
    # Update weights using SGD.
    # the batch_size is set to one to be consistent with the slide.
    trainer.step(batch_size=1)

In [61]:
for v in net.collect_params().values():
    print(v, v.data())

Parameter sequential0_dense0_weight (shape=(1, 2), dtype=float32) 
[[-4.260149 -4.260375]]
<NDArray 1x2 @cpu(0)>
Parameter sequential0_dense0_bias (shape=(1,), dtype=float32) 
[6.582048]
<NDArray 1 @cpu(0)>


In [62]:
net(x).sigmoid()


[[0.99861693]
 [0.9106561 ]
 [0.9106745 ]
 [0.12581538]]
<NDArray 4x1 @cpu(0)>

## Multi-layer neural networks using an optimizer

### PyTorch

In [63]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network using high-level modules.
model = torch.nn.Sequential(
    torch.nn.Linear(2, 2, bias=True),   # 2 dims (with bias) -> 2 dims
    torch.nn.Sigmoid(),                 # Sigmoid function
    torch.nn.Linear(2, 1, bias=True),   # 2 dims (with bias) -> 1 dim
)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.



In [64]:
model.state_dict()

OrderedDict([('0.weight', tensor([[ 5.9422, -6.2540],
                      [-6.6625,  6.4892]])),
             ('0.bias', tensor([-3.2577, -3.5896])),
             ('2.weight', tensor([[10.4214, 10.3275]])),
             ('2.bias', tensor([-5.0840]))])

In [65]:
model(x).sigmoid()

tensor([[0.0119],
        [0.9910],
        [0.9907],
        [0.0103]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import chainer
import numpy as np
from chainer import functions as F
from chainer import links as L

dtype=np.float32

# Training data for XOR.
x = chainer.Variable(np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype))
y = chainer.Variable(np.array([[0], [1], [1], [0]], dtype=np.int32))

# Define a neural network using high-level modules.
model = chainer.Sequential(
    L.Linear(2, 2, nobias=False),
    F.sigmoid,
    L.Linear(2, 1, nobias=False),
)

# Binary corss-entropy loss after sigmoid function.
loss_fn=F.sigmoid_cross_entropy

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = chainer.optimizers.SGD(lr=0.5)
optimizer.setup(model)

for t in range(1000):
    y_pred = model(x)                   # Make predictions.
    loss = loss_fn(y_pred, y, normalize=False)  # Compute the loss.
    #print(t, loss.data)
    
    model.cleargrads()          # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.update()          # Update the parameters using the gradients.

In [67]:
for para in model.params():
    print(para)

variable W([[ 1.7916094   0.20070258]
            [-5.864411   -5.7855535 ]])
variable b([-0.447787  1.010492])
variable W([[-1.5884778 -5.9605026]])
variable b([1.7095171])


In [68]:
F.sigmoid(model(x))

variable([[0.03627935],
          [0.7237286 ],
          [0.5995398 ],
          [0.5987538 ]])

### TensorFlow (Keras)

In [69]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense, Activation
from tensorflow.nn import sigmoid_cross_entropy_with_logits
from scipy.special import expit

# Training data for XOR.
x_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_data = np.array([[0], [1], [1], [0]])

# Define a neural network using high-level modules.
model = Sequential([
    Flatten(),
    Dense(2, activation='sigmoid'),    # 2 dims (with bias) -> 2 dims
    Dense(1, activation='sigmoid')     # 2 dims (with bias) -> 2 dims
])

model.compile(
    optimizer=optimizers.SGD(lr=0.5),
    loss='binary_crossentropy',
    metrics=['accuracy']
    )

model.fit(x_data, y_data, epochs=1000, verbose=0)

  return _inspect.getargspec(target)


<tensorflow.python.keras.callbacks.History at 0x7faad6c94cc0>

In [70]:
model.get_weights()

[array([[5.910432 , 3.7894218],
        [5.8785963, 3.7844138]], dtype=float32),
 array([-2.432777 , -5.7740874], dtype=float32),
 array([[ 7.7418504],
        [-8.216927 ]], dtype=float32),
 array([-3.507641], dtype=float32)]

In [71]:
model.predict(x_data)

  return _inspect.getargspec(target)


array([[0.0517463 ],
       [0.9528718 ],
       [0.95300215],
       [0.05638292]], dtype=float32)

### MXNet

In [0]:
import mxnet as mx
from mxnet import nd, autograd, gluon

# Training data for XOR.
x = nd.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = nd.array([[0], [1], [1], [0]])

# Define a neural network using high-level modules.
net = gluon.nn.Sequential()
with net.name_scope():
    net.add(gluon.nn.Dense(2))
    net.add(gluon.nn.Activation('sigmoid'))
    net.add(gluon.nn.Dense(1))
net.collect_params().initialize(mx.init.Normal(sigma=1.))

# Binary cross-entropy loss agter sigmoid function.
loss_fn = gluon.loss.SigmoidBinaryCrossEntropyLoss()

# Optimizer based on SGD
trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': 0.5})

for t in range(1000):
    with autograd.record():
        # Make predictions.
        y_pred = net(x)
        # Compute the loss.
        loss = loss_fn(y_pred, y)
    # Compute the gradients of the loss.
    loss.backward()
    # Update weights using SGD.
    # the batch_size is set to one to be consistent with the slide.
    trainer.step(batch_size=1)

In [73]:
for v in net.collect_params().values():
    print(v, v.data())

Parameter sequential1_dense0_weight (shape=(2, 3), dtype=float32) 
[[-6.3653517 -6.364059   5.089471 ]
 [ 7.591934   7.5853524 -1.585847 ]]
<NDArray 2x3 @cpu(0)>
Parameter sequential1_dense0_bias (shape=(2,), dtype=float32) 
[ 4.5148544 -1.9408077]
<NDArray 2 @cpu(0)>
Parameter sequential1_dense1_weight (shape=(1, 2), dtype=float32) 
[[9.933848 9.837673]]
<NDArray 1x2 @cpu(0)>
Parameter sequential1_dense1_bias (shape=(1,), dtype=float32) 
[-14.568179]
<NDArray 1 @cpu(0)>


In [74]:
net(x).sigmoid()


[[0.01269207]
 [0.99064106]
 [0.99064684]
 [0.0132224 ]]
<NDArray 4x1 @cpu(0)>

### Single-layer neural network with a customizable NN class.

In [75]:
import torch

dtype = torch.float

# Training data for NAND.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[1], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class SingleLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_out):
        super(SingleLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_out, bias=True)

    def forward(self, x):
        return self.linear1(x)

model = SingleLayerNN(2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(100):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.



In [76]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[-4.2283, -4.2277]])),
             ('linear1.bias', tensor([6.5341]))])

In [77]:
model(x).sigmoid()

tensor([[0.9985],
        [0.9094],
        [0.9094],
        [0.1276]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
from chainer import optimizers

x = Variable(np.array([[0,0],[0,1],[1,0],[1,1]],dtype=np.float32))
y = Variable(np.array([[1],[1],[1],[0]],dtype=np.int32))

class Linear(chainer.Chain):
    def __init__(self):
        super().__init__()
        with self.init_scope():
            self.l1 = L.Linear(2,1)
    def __call__(self,x):
        return self.l1(x)

model = Linear()

optimizer = optimizers.SGD(lr=0.5).setup(model)
for t in range(1000):
    y_pred = model(x)
    loss = F.sigmoid_cross_entropy(y_pred,y)
    #print(t,loss.data)
    model.cleargrads()
    loss.backward()
    optimizer.update()

In [79]:
F.sigmoid(model(x))

variable([[0.9998992 ],
          [0.96028996],
          [0.9602896 ],
          [0.0556649 ]])

### Multi-layer neural network with a customizable NN class.



In [80]:
import torch

dtype = torch.float

# Training data for XOR.
x = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=dtype)
y = torch.tensor([[0], [1], [1], [0]], dtype=dtype)
                                        
# Define a neural network model.
class ThreeLayerNN(torch.nn.Module):
    def __init__(self, d_in, d_hidden, d_out):
        super(ThreeLayerNN, self).__init__()
        self.linear1 = torch.nn.Linear(d_in, d_hidden, bias=True)
        self.linear2 = torch.nn.Linear(d_hidden, d_out, bias=True)

    def forward(self, x):
        return self.linear2(self.linear1(x).sigmoid())

model = ThreeLayerNN(2, 2, 1)

# Binary corss-entropy loss after sigmoid function.
loss_fn = torch.nn.BCEWithLogitsLoss(size_average=False)

# Optimizer based on SGD (change "SGD" to "Adam" to use Adam)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

for t in range(1000):
    y_pred = model(x)           # Make predictions.
    loss = loss_fn(y_pred, y)   # Compute the loss.
    #print(t, loss.item())
    
    optimizer.zero_grad()       # Zero-clear gradients.
    loss.backward()             # Compute the gradients.
    optimizer.step()            # Update the parameters using the gradients.



In [81]:
model.state_dict()

OrderedDict([('linear1.weight', tensor([[-6.7652,  6.8954],
                      [-6.3203,  6.0169]])),
             ('linear1.bias', tensor([ 3.4285, -3.1931])),
             ('linear2.weight', tensor([[-10.6559,  11.2825]])),
             ('linear2.bias', tensor([4.9741]))])

In [82]:
model(x).sigmoid()

tensor([[0.0074],
        [0.9931],
        [0.9901],
        [0.0063]], grad_fn=<SigmoidBackward>)

### Chainer

In [0]:
import chainer
import numpy as np
from chainer import functions as F
from chainer import links as L
from chainer import optimizers

x = Variable(np.array([[0,0],[0,1],[1,0],[1,1]],dtype=np.float32))
y = Variable(np.array([[0],[1],[1],[0]],dtype=np.int32))

class Linear(chainer.Chain):
    def __init__(self):
        super().__init__()
        with self.init_scope():
            self.l1 = L.Linear(2,2)
            self.l2 = L.Linear(2,1)
      
    def __call__(self,x):
        h = F.sigmoid(self.l1(x))
        o = self.l2(h)
        return o
    
model = Linear()
optimizer = optimizers.SGD(lr=0.5).setup(model)
for t in range(1000):
    y_pred = model(x)
    loss = F.sigmoid_cross_entropy(y_pred,y)
    #print(t,loss.data)
    model.cleargrads()
    loss.backward()
    optimizer.update()

In [84]:
F.sigmoid(model(x))

variable([[0.5009184 ],
          [0.49899408],
          [0.5010106 ],
          [0.49909088]])