# Lab-08-2 Multi Layer Perceptron

- Review: XOR
- Multi Layer Perceptron
- Backpropagation
- Code: xor-nn
- Code: xor-nn-wide-deep

## Multilayer Perceptron

Input/Hidden/Output layer


- We need to use MLP, multilayer perceptrons (multilayer neural nets)

- No one on earth had found a viable way to train MLPs good enough to learn such simple functions (그래서 뉴럴 네트워크로는 xor 문제를 푸는 게 불가능하다고 생각하게 된 것.)

(By Marvin Minsky, founder of the MIT AI Lab, 1969)

## Backpropagation

(X1, X2, ...) => ... => Y1

output $O$, 원래 정답 $G(t)$에 대해 $Loss = O - G(t)$

$\frac{\partial Loss}{\partial W}$

**뉴럴 네트워크에 있는 weight들에 대한 미분 값을 계산하게 되고, 이 gradient를 가지고 뒷단에 있는 weight부터 loss 값을 최소화시킬 수 있도록 weight를 업데이트하는 방식을 backpropagation 알고리즘이라 한다.**

In [None]:
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = torch.FloatTensor([[0], [1], [1], [0]])

# NN 학습 위해 레이어 선언해야 하는데, 원래는 nn.Linear 레이어를 주로 사용했지만 이번엔 weight, bias 직접 선언해보겠다.
# nn layers
w1 = torch.Tensor(2, 2)
b1 = torch.Tensor(2)
w2 = torch.Tensor(2, 1)
b2 = torch.Tensor(1)

def sigmoid(x):
    # sigmoid function
    return 1.0 / (1.0 + torch.exp(-x))
    # return torch.div(torch.tensor(1), torch.add(torch.tensor(1.0), torch.exp(-x)))

def sigmoid_prime(x):
    # derivative of the sigmoid function
    return sigmoid(x) * (1 - sigmoid(x))

In [None]:
for step in range(10001):
    
    # forward
    
    l1 = torch.add(torch.matmul(X, w1), b1)    # l1 = X x w1 + b1
    a1 = sigmoid(l1)
    l2 = torch.add(torch.matmul(a1, w2), b2)   # l2 = a1 x w2 + b2
    Y_pred = sigmoid(l2)                       # Y prediction
    
    cost = -torch.mean(Y * torch.log(Y_pred) + (1 - Y) * torch.log(1 - Y_pred))
        # binary cross entropy
    
    
    # Back prop (chain rule)
    
    # Loss derivative
    d_Y_pred = (Y_pred - Y) / (Y_pred * (1.0 - Y_pred) + 1e-7)    # dLoss/dy (y는 예측 값)
        # binary cross entropy를 미분한 식. 1e-7은 0으로 나누어지는 경우를 막아주기 위한 term
    
    # Layer 2
    d_l2 = d_Y_pred * sigmoid_prime(l2)                          # dLoss/dl2 = (dLoss/dy)(dy/dl2)
    d_b2 = d_l2                                                  # dLoss/db2
    d_w2 = torch.matmul(torch.transpose(a1, 0, 1), d_b2)         # dLoss/dw2, (2, 4) x (4, 1) = (2, 1)
    
    # Layer 1
    d_a1 = torch.matmul(d_b2, torch.transpose(w2, 0, 1))
    d_l1 = d_a1 * sigmoid_prime(l1)
    d_b1 = d_l1
    d_w1 = torch.matmul(torch.transpose(X, 0, 1), d_b1)
    
    
    # Weight update
    
    w1 = w1 - learning_rate * d_w1
    b1 = b1 - learning_rate * torch.mean(d_b1, 0)
    w2 = w2 - learning_rate * d_w2
    b2 = b2 - learning_rate * torch.mean(d_b2, 0)
    
    if step % 100 == 0:
        print(step, cost.item())

## Code: xor-nn

In [2]:
import torch

X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = torch.FloatTensor([[0], [1], [1], [0]])
# nn layers
linear1 = torch.nn.Linear(2, 2, bias=True)
linear2 = torch.nn.Linear(2, 1, bias=True)
sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid)
# define cost/loss & optimizer
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)
    # cost/loss function
    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()
    if step % 100 == 0:
        print(step, cost.item())

0 0.7624428272247314
100 0.6916636824607849
200 0.6850571632385254
300 0.6277700662612915
400 0.5413362979888916
500 0.39542973041534424
600 0.12126626074314117
700 0.05854424089193344
800 0.03754853457212448
900 0.027400746941566467
1000 0.021485967561602592
1100 0.017632946372032166
1200 0.01493168156594038
1300 0.012936640530824661
1400 0.011404736898839474
1500 0.010192646645009518
1600 0.009210258722305298
1700 0.008398361504077911
1800 0.007716350723057985
1900 0.007135570980608463
2000 0.006635206285864115
2100 0.006199636030942202
2200 0.005817198660224676
2300 0.005478763021528721
2400 0.005177151411771774
2500 0.0049066971987485886
2600 0.004662837367504835
2700 0.004441898316144943
2800 0.0042407214641571045
2900 0.004056855104863644
3000 0.003888150444254279
3100 0.0037327734753489494
3200 0.003589252708479762
3300 0.0034562810324132442
3400 0.0033327334094792604
3500 0.0032176636159420013
3600 0.0031101868953555822
3700 0.003009598469361663
3800 0.0029152974020689726
3900 

In [4]:
# Accuracy computation
# True if hypothesis>0.5 else False
with torch.no_grad():
    hypothesis = model(X)
    predicted = (hypothesis > 0.5).float()
    accuracy = (predicted == Y).float().mean()
    print('\nHypothesis: ', hypothesis.detach().cpu().numpy(), '\nCorrect: ', predicted.detach().cpu().numpy(), '\nAccuracy: ', accuracy.item())


Hypothesis:  [[9.4024529e-04]
 [9.9870014e-01]
 [9.9912483e-01]
 [8.2351483e-04]] 
Correct:  [[0.]
 [1.]
 [1.]
 [0.]] 
Accuracy:  1.0


## Code: xx-nn-wide-deep

조금 더 레이어를 깊게 쌓아보자. 2개 더 쌓아서 4개.

In [5]:
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]])
Y = torch.FloatTensor([[0], [1], [1], [0]])

# nn layers
linear1 = torch.nn.Linear(2, 10, bias=True)
linear2 = torch.nn.Linear(10, 10, bias=True)
linear3 = torch.nn.Linear(10, 10, bias=True)
linear4 = torch.nn.Linear(10, 1, bias=True)
sigmoid = torch.nn.Sigmoid()

...

이러면 훨씬 더 낮은 loss 값 보임!