## Building up a single-layer perceptron for binary classification

### Dataset Initialization
- The following is dataset X with four samples (rows) and three features.

- Consider the third feature ("1") as a constant bias term.

In [2]:
import numpy as np

X = np.array([
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 1]
])

### Defining Activation Function

In [3]:
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

### Weight Initialization (Randomized)
- Normally, the weights should be learned.
- For practice purposes 

In [4]:
W = 2 * np.random.random((1,3)) -1
W

array([[-0.48076802, -0.24421082,  0.21005476]])

### Inference (Weight * x)

In [5]:
for k in range(X.shape[0]):
    x = X[k,:].T
    v = np.matmul(W, x)
    y = sigmoid(v)

    print(v)

[0.21005476]
[-0.03415606]
[-0.27071326]
[-0.51492408]


### Training Weights (Supervised Learning)
- Updating the weights using the error between output answers (D) and predictions (v).

In [6]:
# Desired outputs 'D'

D = np.array([
    [0], [0], [1], [1]
])

In [7]:
# Calculate output

def calc_output(W, x):
    v = np.matmul(W, x)
    y = sigmoid(v)

    return y

In [8]:
# Calculate error

def calc_error(d, y):
    e = d - y
    delta = y * (1-y) * e   # derivative value of predicted output and error.

    return delta

The term δ (often referred to as "delta" in the context of neural networks) represents the error gradient for a given neuron. 

It's a crucial component in the backpropagation algorithm, which is used to train neural networks. 

The purpose of δ is to measure how much a neuron's output value contributed to the overall error of the network. 

This information is then used to adjust the neuron's weights to reduce the error in subsequent predictions.

- [Wiki: Delta rule](https://en.wikipedia.org/wiki/Delta_rule)

### (Stochastic) Gradient Descent for single-layer perceptron

- The function delta_GD is implementing a form of stochastic gradient descent (SGD). 

- In traditional (or batch) gradient descent, you would compute the gradient based on the average error over all data samples and then update the weights. 

- In contrast, SGD updates the weights after each individual data sample.

In [12]:
def delta_GD(W, X, D, alpha):
    for k in range(4):
        x = X[k, :].T
        d = D[k]

        y = calc_output(W,x)
        delta = calc_error(d, y)

        dW = alpha*delta*x   
        W = W + dW   # adjusted weight
    
    return W

In [13]:
W = 2*np.random.random((1,3)) - 1   # initialize weights

In [14]:
alpha = 0.9
for epoch in range(10000):
    W = delta_GD(W, X, D, alpha)
    print(W)


[[ 0.0883468  -0.53985301 -0.05220225]]
[[ 0.33984914 -0.47420667  0.02370281]]
[[ 0.563017   -0.42838997  0.05764473]]
[[ 0.75563197 -0.40167143  0.05382018]]
[[ 0.92332421 -0.38854266  0.02367434]]
[[ 1.0723724  -0.38361643 -0.02242131]]
[[ 1.20746422 -0.38308685 -0.0772105 ]]
[[ 1.33174349 -0.38458957 -0.13606461]]
[[ 1.4472845  -0.38676255 -0.19615296]]
[[ 1.55548654 -0.38887556 -0.25579681]]
[[ 1.65733336 -0.39057573 -0.31403604]]
[[ 1.75355179 -0.3917255  -0.37035218]]
[[ 1.84470647 -0.39230403 -0.42449496]]
[[ 1.93125605 -0.3923492  -0.47637494]]
[[ 2.01358666 -0.3919245  -0.52599748]]
[[ 2.09203226 -0.391101   -0.57342228]]
[[ 2.16688722 -0.38994798 -0.61873875]]
[[ 2.2384144  -0.38852854 -0.66205103]]
[[ 2.30685067 -0.38689799 -0.7034689 ]]
[[ 2.37241067 -0.38510357 -0.74310241]]
[[ 2.43528975 -0.38318503 -0.78105871]]
[[ 2.49566623 -0.3811754  -0.8174403 ]]
[[ 2.55370321 -0.37910191 -0.85234411]]
[[ 2.60955017 -0.37698682 -0.88586116]]
[[ 2.6633443  -0.37484824 -0.91807646]]


In [15]:
N=4
for k in range(N):
    x = X[k,:].T
    v = np.matmul(W,x)   # Prediction using updated weights
    y = sigmoid(v)
    print(y)

[0.01020148]
[0.00829417]
[0.9932423]
[0.99168535]


### Error Backpropagation (XOR Data)

In [17]:
X = np.array([
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 1]
])

D = np.array([
    [0], [1], [1], [0]
])

W = 2*np.random.random((1,3)) - 1

In [18]:
alpha = 0.9
for epoch in range(10000):
    W = delta_GD(W, X, D, alpha)

In [19]:
N = 4
for k in range(N):
    x = X[k,:].T
    v = np.matmul(W, x)
    y = sigmoid(v)

    print(y)

[0.00760202]
[1.0742171e-06]
[0.98708455]
[0.01060387]


Not so ideal...

Again, this goes back to the limitation of single-layered perceptrons not being able to handle XOR problems (non-linear) due to its linearity. 

To solve this, we need multi-layered perceptrons (trained with the backpropagation algorithm).

Steps
- Forward Pass: Input is passed through the network to generate an output.

- Calculate Loss: Compute the difference between the predicted output and the actual target (using a loss function).

- Backward Pass: Use the chain rule to compute the gradients of the loss with respect to each weight in the network.

- Update Weights: Adjust the weights in the direction that reduces the error. This step typically uses gradient descent or one of its variants.

### Multi-Layered Perceptron with Back Propagation

In [20]:
def calc_output(W1, W2, x):
    v1 = np.matmul(W1, x)
    y1 = sigmoid(v1)
    v = np.matmul(W2, y1)
    y = sigmoid(v)

    return y, y1

In [21]:
def calc_delta(d, y):
    e = d - y
    delta = y*(1-y) * e

    return delta

In [22]:
def calc_delta1(W2, delta, y1):
    e1 = np.matmul(W2.T, delta)
    delta1 = y1 * (1-y1) * e1

    return delta1

In [23]:
# back propagation

def backprop_XOR(W1, W2, X, D, alpha):
    for k in range(4):
        x = X[k,:].T
        d = D[k]

        y, y1 = calc_output(W1, W2, x)   # output of two layers
        delta = calc_delta(d, y)   # delta for output layer
        delta1 = calc_delta1(W2, delta, y1)   # delta for hidden layer

        # updated weights
        dW1 = (alpha*delta1).reshape(4,1) * x.reshape(1,3)
        W1 = W1 + dW1

        dW2 = alpha * delta * y1
        W2 = W2 + dW2

    return W1, W2



In [24]:
X = np.array([
    [0, 0, 1],
    [0, 1, 1],
    [1, 0, 1],
    [1, 1, 1]
])

D = np.array([
    [0], [1], [1], [0]
])

W1 = 2*np.random.random((4,3)) - 1
W2 = 2*np.random.random((1,4)) - 1

In [25]:
alpha = 0.9
for epoch in range(10000):
    W1, W2 = backprop_XOR(W1, W2, X, D, alpha)


In [27]:
N = 4
for k in range(N):
    x = X[k,:].T
    v1 = np.matmul(W1, x)
    y1 = sigmoid(v1)
    v = np.matmul(W2, y1)
    y = sigmoid(v)

    print(y)

[0.00769078]
[0.99018689]
[0.99021231]
[0.01432481]


That's better.