## Exercise 2, Neural networks 'by hand'
The exercise is two-parted. First is to fill in the 'missing' code. The overall problem in the first part is to solve the XOR-problem using a neural network.

In the second part you should copy your functional code into a new cell, and decrease the initailized weights by a factor of 1/10. What happens? Do you think that the problem is fixable in some way? How? Document your thoughts in a short text either in the notebook (preferred) or separately.

### Part 1

In [16]:
import numpy as np


# sigmoid activation
def sigmoid(z):
    z = np.clip(z, -500, 500)  # Clip values to prevent overflow
    return 1 / (1 + np.exp(-z))


# derivative of sigmoid
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)


# cross-entropy loss function
def cross_entropy_loss(out_nn, target):
    epsilon = 1e-8
    out_nn = np.clip(out_nn, epsilon, 1 - epsilon) # Clip values to avoid log(0)
    return -np.mean(target * np.log(out_nn) + (1 - target) * np.log(1 - out_nn))


# Initialize weights and biases
np.random.seed(1)
W1 = np.random.randn(2, 2)
b1 = np.random.randn(2, 1)
W2 = np.random.randn(1, 2)
b2 = np.random.randn(1, 1)

# xor data
X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
target = np.array([[0, 1, 1, 0]])


# forward
def forward(X):
    Z1 = W1 @ X + b1
    Q1 = sigmoid(Z1)

    Z2 = W2 @ Q1 + b2
    out = sigmoid(Z2)

    return out, Q1


# backpropagation
def backprop(X, target, out_nn, Q1):
    delta2 = out_nn - target
    delta1 = (W2.T @ delta2) * sigmoid_derivative(Q1)
    dW2 = delta2 @ Q1.T
    dW1 = delta1 @ X.T
    db2 = np.mean(delta2, axis=1, keepdims=True)
    db1 = np.mean(delta1, axis=1, keepdims=True)
    return dW2, dW1, db2, db1


# updating the weights
def update(W2, W1, b2, b1, dW2, dW1, db2, db1, alpha):
    W2 -= alpha * dW2
    W1 -= alpha * dW1
    b2 -= alpha * db2
    b1 -= alpha * db1
    return W2, W1, b2, b1


def xor_nn_train_eval(X, target, W1, W2, b1, b2, alpha=0.1, epochs=100001):
    # print the prediction of the XOR BEFORE training here
    prediction = forward(X)
    loss = cross_entropy_loss(prediction[0], target)

    print(f"=== Prediction before training ===")
    for i, value in enumerate(X.T):
        print(f"Input: [{value[0]}, {value[1]}], Target: {target[0][i]}, Prediction: {prediction[0][0][i]:.3f}")
    print(f"Cross Entropy Loss: {loss:.3f}\n")

    print("=== Start of Training ===")
    for epoch in range(epochs):
        out, Q1 = forward(X)
        loss = cross_entropy_loss(out, target)
        dW2, dW1, db2, db1 = backprop(X, target, out, Q1)
        W2, W1, b2, b1 = update(W2, W1, b2, b1, dW2, dW1, db2, db1, alpha)
        if epoch % 10000 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.3f}")

    print("=== End of Training ===\n")

    # print the prediction of the XOR AFTER training here
    prediction = forward(X)
    loss = cross_entropy_loss(prediction[0], target)

    print(f"=== Prediction after training ===")
    for i, value in enumerate(X.T):
        print(f"Input: [{value[0]}, {value[1]}], Target: {target[0][i]}, Prediction: {prediction[0][0][i]:.3f}")
    print(f"Cross Entropy Loss: {loss:.3f}\n")

alpha = 0.1
epochs = 100001
xor_nn_train_eval(X, target, W1, W2, b1, b2, alpha, epochs)

=== Prediction before training ===
Input: [0, 0], Target: 0, Prediction: 0.814
Input: [0, 1], Target: 1, Prediction: 0.782
Input: [1, 0], Target: 1, Prediction: 0.869
Input: [1, 1], Target: 0, Prediction: 0.860
Cross Entropy Loss: 1.010

=== Start of Training ===
Epoch 0, Loss: 1.010
Epoch 10000, Loss: 0.071
Epoch 20000, Loss: 0.028
Epoch 30000, Loss: 0.017
Epoch 40000, Loss: 0.012
Epoch 50000, Loss: 0.010
Epoch 60000, Loss: 0.008
Epoch 70000, Loss: 0.007
Epoch 80000, Loss: 0.006
Epoch 90000, Loss: 0.005
Epoch 100000, Loss: 0.005
=== End of Training ===

=== Prediction after training ===
Input: [0, 0], Target: 0, Prediction: 0.004
Input: [0, 1], Target: 1, Prediction: 0.995
Input: [1, 0], Target: 1, Prediction: 0.996
Input: [1, 1], Target: 0, Prediction: 0.004
Cross Entropy Loss: 0.005



### Part 2

When the initial weights are reduced to one tenth, we can observe the following changes:
1) Predictions before training have a lower cross entropy value
2) Changes in weights, biases and, as a result, cross entropy value between epochs is marginal

First, it's important to emphasize that the first change is unique to the XOR problem and wouldn't persist for other binary classification problems. Here's why: In the XOR problem, the binary nature of the inputs combined with smaller initial weights results in initial activations that are closer to zero. This leads to sigmoid outputs near 0.5, reducing the initial cross-entropy loss. However, in other binary classification problems, inputs might not be binary, and the network might require different transformations, meaning the same weight scaling wouldn't necessarily yield similar initial predictions or loss values.

Second, the slower convergence rate comes from the fact that reducing initial weights causes very small pre-activation values in the hidden layer. These small pre-activations result in small gradients when backpropagated, even though the sigmoid function itself is in the high-gradient range around 0.5. Because the initial weights are tiny, each weight update remains small, causing minimal changes in the network’s behavior across epochs. 

In [17]:
np.random.seed(1)
W1 = np.random.randn(2, 2) / 10
b1 = np.random.randn(2, 1) / 10
W2 = np.random.randn(1, 2) / 10
b2 = np.random.randn(1, 1) / 10

alpha = 0.1
epochs = 100001
xor_nn_train_eval(X, target, W1, W2, b1, b2, alpha, epochs)

=== Prediction before training ===
Input: [0, 0], Target: 0, Prediction: 0.522
Input: [0, 1], Target: 1, Prediction: 0.522
Input: [1, 0], Target: 1, Prediction: 0.524
Input: [1, 1], Target: 0, Prediction: 0.524
Cross Entropy Loss: 0.694

=== Start of Training ===
Epoch 0, Loss: 0.694
Epoch 10000, Loss: 0.693
Epoch 20000, Loss: 0.693
Epoch 30000, Loss: 0.693
Epoch 40000, Loss: 0.692
Epoch 50000, Loss: 0.646
Epoch 60000, Loss: 0.619
Epoch 70000, Loss: 0.610
Epoch 80000, Loss: 0.606
Epoch 90000, Loss: 0.604
Epoch 100000, Loss: 0.602
=== End of Training ===

=== Prediction after training ===
Input: [0, 0], Target: 0, Prediction: 0.397
Input: [0, 1], Target: 1, Prediction: 0.603
Input: [1, 0], Target: 1, Prediction: 0.497
Input: [1, 1], Target: 0, Prediction: 0.503
Cross Entropy Loss: 0.602

