<a href="https://colab.research.google.com/github/VasavSrivastava/MAT422/blob/main/HW11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**3.7.1 Mathematical Formulation**

Mathematical formulation in neural networks involves determining the values of nodes in a layer from the preceding layer using weights and biases. For a given layer $l$, the input to the $j'$th node is calculated as:

$$
z^{(l)}_{j'} = \sum_{j=1}^{J_{l-1}} w^{(l)}_{j,j'} a^{(l-1)}_j + b^{(l)}_{j'}
$$

where $w^{(l)}_{j,j'}$ are the weights, $a^{(l-1)}_j$ are the activations from the $(l-1)$th layer, and $b^{(l)}_{j'}$ is the bias. After applying an activation function $\sigma$, the output of the $j'$th node becomes:

$$
a^{(l)}_{j'} = \sigma(z^{(l)}_{j'})
$$

In matrix form, the relationship for all nodes in layer $l$ is given by:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}
$$

and the activations are calculated as:

$$
a^{(l)} = \sigma(z^{(l)}) = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})
$$

Here, $W^{(l)}$ is the weight matrix containing all the weights $w^{(l)}_{j,j'}$, $b^{(l)}$ is the bias vector, and $\sigma$ is the activation function applied element-wise to $z^{(l)}$.




In [1]:
import numpy as np

# Define input values (layer l-1)
a_prev = np.array([0.5, 0.2, 0.1])  # Example inputs to the network

# Define weights and biases for layer l
W = np.array([[0.2, 0.8, -0.5],
              [0.7, -0.1, 0.9]])  # Weight matrix (2 nodes in layer l)
b = np.array([0.1, -0.3])         # Bias vector (2 nodes in layer l)

# Define the activation function (sigmoid in this example)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Compute z (linear combination of weights and inputs)
z = np.dot(W, a_prev) + b  # z^(l) = W^(l) * a^(l-1) + b^(l)

# Compute activations for layer l
a = sigmoid(z)  # a^(l) = σ(z^(l))

# Output results
print("Input values (a^(l-1)):", a_prev)
print("Weights (W^(l)):\n", W)
print("Biases (b^(l)):", b)
print("Linear combination (z^(l)):", z)
print("Activations (a^(l)):", a)


Input values (a^(l-1)): [0.5 0.2 0.1]
Weights (W^(l)):
 [[ 0.2  0.8 -0.5]
 [ 0.7 -0.1  0.9]]
Biases (b^(l)): [ 0.1 -0.3]
Linear combination (z^(l)): [0.31 0.12]
Activations (a^(l)): [0.57688526 0.52996405]


#**3.7.2 Activation Functions**

Activation functions in neural networks abstract the output of a node given an input or set of inputs, enabling specific tasks such as classification. Represented as $\sigma$, the activation function is applied uniformly across all nodes in a layer:

$$
a^{(l)} = \sigma(z^{(l)}) = \sigma(W^{(l)} a^{(l-1)} + b^{(l)})
$$

##### **3.7.2.1 Step function**
The step function is defined as:

$$
\sigma(x) =
\begin{cases}
0, & x < 0 \\
1, & x \geq 0
\end{cases}
$$

Also called the Heaviside step function or the unit step function, it often represents a signal that switches on at a specified time and stays switched on indefinitely. It is commonly used for classification problems.

##### **3.7.2.2 ReLU function**
The ReLU (Rectified Linear Unit) function is defined as:

$$
\sigma(x) = \max(0, x)
$$

ReLU is one of the most widely used activation functions. It allows signals to either pass through untouched or die completely, enabling faster and more effective training of deep neural architectures, especially for large and complex datasets, compared to sigmoid or similar functions.

##### **3.7.2.3 Sigmoid**
The sigmoid (or logistic) function is defined as:

$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$

The logistic function is used in various fields, including biomathematics, and can be applied in the output layer of neural networks for predicting probabilities.

##### **3.7.2.4 Softmax function**
The softmax function converts a vector of numbers into a vector of probabilities, where each probability is proportional to the relative scale of its corresponding value. It is defined as:

$$
\sigma(z_k) = \frac{e^{z_k}}{\sum_{k=1}^K e^{z_k}}
$$

Softmax is commonly used in the output layer of neural networks, particularly for multi-class classification problems.


In [2]:
import numpy as np

# Step Function
def step_function(x):
    return np.where(x >= 0, 1, 0)

# ReLU Function
def relu(x):
    return np.maximum(0, x)

# Sigmoid Function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Softmax Function
def softmax(x):
    exp_x = np.exp(x - np.max(x))  # For numerical stability
    return exp_x / np.sum(exp_x)

# Example Inputs
x = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
print("Input:", x)

# Step Function Output
step_output = step_function(x)
print("\nStep Function Output:", step_output)

# ReLU Function Output
relu_output = relu(x)
print("\nReLU Function Output:", relu_output)

# Sigmoid Function Output
sigmoid_output = sigmoid(x)
print("\nSigmoid Function Output:", sigmoid_output)

# Softmax Function Output
softmax_output = softmax(x)
print("\nSoftmax Function Output:", softmax_output)


Input: [-2. -1.  0.  1.  2.]

Step Function Output: [0 0 1 1 1]

ReLU Function Output: [0. 0. 0. 1. 2.]

Sigmoid Function Output: [0.11920292 0.26894142 0.5        0.73105858 0.88079708]

Softmax Function Output: [0.01165623 0.03168492 0.08612854 0.23412166 0.63640865]


#**3.7.3 Cost Function**

The cost function in neural networks measures the difference between the predicted outputs ($\hat{y}$) and the actual outputs ($y$) from the training data. For regression tasks, the least squares cost function is commonly used and is defined as:

$$
J = \frac{1}{2} \sum_{n=1}^{N} \sum_{k=1}^{K} \left( \hat{y}^{(n)}_k - y^{(n)}_k \right)^2
$$

where $N$ is the number of training examples, and $K$ is the number of output nodes. For classification problems, the cost function is often based on logistic regression. For binary classification ($y^{(n)} \in \{0, 1\}$), the cost function is:

$$
J = -\sum_{n=1}^{N} \left( y^{(n)} \ln \hat{y}^{(n)} + (1 - y^{(n)}) \ln (1 - \hat{y}^{(n)}) \right)
$$

This is also known as the cross-entropy loss, which is widely used in classification tasks for its effectiveness in optimizing probabilities.


In [3]:
import numpy as np

# Example outputs and predictions
y_actual_regression = np.array([1.5, 2.0, 3.0])  # Actual values for regression
y_predicted_regression = np.array([1.4, 2.1, 2.9])  # Predicted values for regression

y_actual_classification = np.array([1, 0, 1])  # Actual values for classification (binary)
y_predicted_classification = np.array([0.9, 0.2, 0.8])  # Predicted probabilities for classification

# Mean Squared Error (MSE) Cost Function
def mean_squared_error(y_actual, y_predicted):
    return np.mean((y_actual - y_predicted) ** 2)

# Cross-Entropy Loss for Binary Classification
def cross_entropy_loss(y_actual, y_predicted):
    epsilon = 1e-15  # To prevent log(0)
    y_predicted = np.clip(y_predicted, epsilon, 1 - epsilon)  # Clamp predicted values
    return -np.mean(y_actual * np.log(y_predicted) + (1 - y_actual) * np.log(1 - y_predicted))

# Calculate the losses
mse_loss = mean_squared_error(y_actual_regression, y_predicted_regression)
cross_entropy_loss_value = cross_entropy_loss(y_actual_classification, y_predicted_classification)

# Print the results
print("Mean Squared Error (MSE) Loss:", mse_loss)
print("Cross-Entropy Loss:", cross_entropy_loss_value)


Mean Squared Error (MSE) Loss: 0.010000000000000018
Cross-Entropy Loss: 0.18388253942874858


#**3.7.4 Backpropagation**

Backpropagation is the core mechanism for training neural networks by fine-tuning the weights ($W$) and biases ($b$) to minimize the cost function $J$. It calculates the gradient of the cost function with respect to each parameter using the chain rule. For a layer $l$, backpropagation introduces $\delta^{(l)}_{j'} = \frac{\partial J}{\partial z^{(l)}_{j'}}$, which propagates the error backward through the network. The weight gradients are computed as:

$$
\frac{\partial J}{\partial w^{(l)}_{j,j'}} = \delta^{(l)}_{j'} a^{(l-1)}_j
$$

and the bias gradients as:

$$
\frac{\partial J}{\partial b^{(l)}_{j'}} = \delta^{(l)}_{j'}.
$$

The error term $\delta$ depends on the activation function used. For ReLU, the derivative is either $0$ or $1$, and for the logistic function, it is $\sigma(z)(1 - \sigma(z))$. Backpropagation allows gradients to propagate backward from the output layer to earlier layers, enabling efficient optimization via gradient descent.


In [4]:
import numpy as np

# Define the sigmoid activation function and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Input data (X) and corresponding labels (y)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # Input: XOR problem
y = np.array([[0], [1], [1], [0]])  # Output: XOR labels

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.rand(2, 2)  # Weights for input to hidden layer
b1 = np.random.rand(1, 2)  # Biases for hidden layer
W2 = np.random.rand(2, 1)  # Weights for hidden to output layer
b2 = np.random.rand(1, 1)  # Biases for output layer

# Learning rate
learning_rate = 0.1

# Train the network
epochs = 10000
for epoch in range(epochs):
    # Feedforward
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)  # Activation of hidden layer

    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)  # Output layer predictions

    # Compute the loss (mean squared error)
    loss = np.mean((y - a2) ** 2)

    # Backpropagation
    d_a2 = a2 - y  # Derivative of loss w.r.t. output activations
    d_z2 = d_a2 * sigmoid_derivative(z2)  # Derivative of loss w.r.t. z2
    d_W2 = np.dot(a1.T, d_z2)  # Derivative of loss w.r.t. W2
    d_b2 = np.sum(d_z2, axis=0, keepdims=True)  # Derivative of loss w.r.t. b2

    d_a1 = np.dot(d_z2, W2.T)  # Backpropagating error to hidden layer
    d_z1 = d_a1 * sigmoid_derivative(z1)  # Derivative of loss w.r.t. z1
    d_W1 = np.dot(X.T, d_z1)  # Derivative of loss w.r.t. W1
    d_b1 = np.sum(d_z1, axis=0, keepdims=True)  # Derivative of loss w.r.t. b1

    # Update weights and biases
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1

    # Print loss every 1000 epochs
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Final predictions
print("\nFinal Predictions:")
print(a2)


Epoch 0, Loss: 0.3247
Epoch 1000, Loss: 0.2406
Epoch 2000, Loss: 0.1960
Epoch 3000, Loss: 0.1207
Epoch 4000, Loss: 0.0305
Epoch 5000, Loss: 0.0125
Epoch 6000, Loss: 0.0074
Epoch 7000, Loss: 0.0051
Epoch 8000, Loss: 0.0038
Epoch 9000, Loss: 0.0031

Final Predictions:
[[0.05322146]
 [0.95171535]
 [0.95160449]
 [0.05175396]]


#**3.7.5 Backpropagation Algorithm**

The backpropagation algorithm is a systematic method for training neural networks by iteratively updating weights and biases to minimize the cost function. It begins by initializing weights and biases randomly, followed by feeding input data through the network to compute activations ($a$) and outputs ($\hat{y}$). The gradient of the cost function is calculated using the chain rule, starting with the output layer error:

$$
\delta^{(L)}_j = \frac{d\sigma}{dz} \bigg|_{z^{(L)}_j} (\hat{y} - y).
$$

Errors are propagated backward through the layers:

$$
\delta^{(l-1)}_j = \frac{d\sigma}{dz} \bigg|_{z^{(l-1)}_j} \sum_j \delta^{(l)}_{j'} w^{(l)}_{j,j'}.
$$

Weights and biases are then updated using gradient descent:

$$
w^{(l)}_{j,j'} \gets w^{(l)}_{j,j'} - \beta \delta^{(l)}_{j'} a^{(l-1)}_j,
$$

$$
b^{(l)}_{j'} \gets b^{(l)}_{j'} - \beta \delta^{(l)}_{j'}.
$$

This process repeats until the desired accuracy is achieved.


In [5]:
import numpy as np

# Define activation function (sigmoid) and its derivative
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    return x * (1 - x)  # Assumes input is already passed through sigmoid

# Input data (X) and target output (y)
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # XOR input
y = np.array([[0], [1], [1], [0]])  # XOR output

# Initialize weights and biases
np.random.seed(42)
W1 = np.random.rand(2, 2)  # Weights for input to hidden layer
b1 = np.random.rand(1, 2)  # Biases for hidden layer
W2 = np.random.rand(2, 1)  # Weights for hidden to output layer
b2 = np.random.rand(1, 1)  # Biases for output layer

# Learning rate
learning_rate = 0.1

# Training loop
epochs = 10000
for epoch in range(epochs):
    # Forward pass
    z1 = np.dot(X, W1) + b1
    a1 = sigmoid(z1)  # Hidden layer activations

    z2 = np.dot(a1, W2) + b2
    a2 = sigmoid(z2)  # Output layer activations (predictions)

    # Compute the loss (Mean Squared Error)
    loss = np.mean((y - a2) ** 2)

    # Backward pass
    # Output layer error and delta
    error_output = a2 - y
    delta_output = error_output * sigmoid_derivative(a2)

    # Hidden layer error and delta
    error_hidden = np.dot(delta_output, W2.T)
    delta_hidden = error_hidden * sigmoid_derivative(a1)

    # Gradient calculation
    d_W2 = np.dot(a1.T, delta_output)
    d_b2 = np.sum(delta_output, axis=0, keepdims=True)
    d_W1 = np.dot(X.T, delta_hidden)
    d_b1 = np.sum(delta_hidden, axis=0, keepdims=True)

    # Update weights and biases
    W2 -= learning_rate * d_W2
    b2 -= learning_rate * d_b2
    W1 -= learning_rate * d_W1
    b1 -= learning_rate * d_b1

    # Print loss every 1000 epochs
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Final predictions
print("\nFinal Predictions:")
print(a2)


Epoch 0, Loss: 0.3247
Epoch 1000, Loss: 0.2406
Epoch 2000, Loss: 0.1960
Epoch 3000, Loss: 0.1207
Epoch 4000, Loss: 0.0305
Epoch 5000, Loss: 0.0125
Epoch 6000, Loss: 0.0074
Epoch 7000, Loss: 0.0051
Epoch 8000, Loss: 0.0038
Epoch 9000, Loss: 0.0031

Final Predictions:
[[0.05322146]
 [0.95171535]
 [0.95160449]
 [0.05175396]]
