# Initialization of Weights and Biases in Neural Networks
Setting the weights and biases of neural networks is a crucial aspect of building and training the network. Here's how weights and biases are typically initialized and updated in a neural network:

1. **Initialization:**
   - Weights: Weights are initialized randomly at the beginning of training. The initial values of weights can significantly impact the learning process and the final performance of the network. Xavier initialization and He initialization are common techniques used to initialize weights, ensuring that they are neither too large nor too small.
   - Biases: Biases are often initialized to small constant values, such as zeros or small random values. Unlike weights, biases are usually less sensitive to initialization, so simple initialization methods are often sufficient.

2. **Forward Pass:**
   - During the forward pass, the inputs are multiplied by the weights and added with the biases at each layer.
   - For a given layer \( l \), the output \( Z^{[l]} \) is calculated as:
     \[ Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]} \]
   - Where \( W^{[l]} \) represents the weights matrix, \( A^{[l-1]} \) represents the activation of the previous layer, and \( b^{[l]} \) represents the bias vector of layer \( l \).

3. **Activation Function:**
   - After computing the linear combination of inputs, weights, and biases, the result is passed through an activation function to introduce non-linearity into the network. Common activation functions include ReLU, sigmoid, and tanh.

4. **Backpropagation:**
   - During backpropagation, the gradients of the loss function with respect to the weights and biases are computed.
   - These gradients are then used to update the weights and biases in the opposite direction of the gradient to minimize the loss function.

5. **Weight Update:**
   - The weights are updated using an optimization algorithm such as Stochastic Gradient Descent (SGD), Adam, or RMSprop.
   - The update rule for the weights at each iteration is typically of the form:
     \[ W^{[l]} = W^{[l]} - \alpha \cdot \text{d}W^{[l]} \]
     Where \( \alpha \) is the learning rate, and \( \text{d}W^{[l]} \) is the gradient of the loss function with respect to the weights.

6. **Bias Update:**
   - Biases are updated in a similar manner to weights, using the same optimization algorithm and update rule, but with gradients calculated for the biases instead.

In summary, weights and biases are initialized randomly at the beginning of training, and then updated iteratively during training using backpropagation and optimization algorithms to minimize the loss function. These updates gradually improve the performance of the network on the training data.

In [3]:
import numpy as np

#sigmoid activation function
def sigmoid(x):
    return 1 / (1+np.exp(-x))

In [4]:
# Load the training data, skipping the first row
training_data = np.loadtxt('mnist_dataset/mnist_train.csv', delimiter=',', skiprows=1, dtype=np.float32)

# Load the test data, skipping the first row
test_data = np.loadtxt('mnist_dataset/mnist_test.csv', delimiter=',', skiprows=1, dtype=np.float32)


In [5]:
print("training_data.shape = ", training_data.shape, " ,  test_data.shape = ", test_data.shape)

training_data.shape =  (60000, 785)  ,  test_data.shape =  (10000, 785)


In [6]:
class NeuralNetwork:
    
    def __init__(self, input_nodes, hidden_nodes, output_nodes, learning_rate):
        
        self.input_nodes = input_nodes
        self.hidden_nodes = hidden_nodes
        self.output_nodes = output_nodes
        
        # Weight Initialization with Xavier/He : W2
        self.W2 = np.random.randn(self.input_nodes, self.hidden_nodes) / np.sqrt(self.input_nodes/2)
        self.b2 = np.random.rand(self.hidden_nodes)      
        
        # Weight Initialization Xavier/He : W3
        self.W3 = np.random.randn(self.hidden_nodes, self.output_nodes) / np.sqrt(self.hidden_nodes/2)
        self.b3 = np.random.rand(self.output_nodes)      
                        
        # Initialization A3,Z3 : A3 is the result of sigmoid function about Z2
        self.Z3 = np.zeros([1,output_nodes])
        self.A3 = np.zeros([1,output_nodes])
        
        # Initialization A2,Z2
        self.Z2 = np.zeros([1,hidden_nodes])
        self.A2 = np.zeros([1,hidden_nodes])
        
        # Initialization A1,Z1
        self.Z1 = np.zeros([1,input_nodes])    
        self.A1 = np.zeros([1,input_nodes])       
        
        # Learning rate Initialization
        self.learning_rate = learning_rate
        
    def feed_forward(self):  
        
        delta = 1e-7    # log Infinite Divergence Prevention
        
        # Calculate Z1,A1 in the input layer
        self.Z1 = self.input_data
        self.A1 = self.input_data
        
        # Calculate Z2,A2 in the hidden layer   
        self.Z2 = np.dot(self.A1, self.W2) + self.b2
        self.A2 = sigmoid(self.Z2)
        
        # Calculate Z3,A3 in the ouput layer
        self.Z3 = np.dot(self.A2, self.W3) + self.b3
        self.A3 = sigmoid(self.Z3)
        
        # Calculate the loss function value (error) : cross entropy
        return  -np.sum( self.target_data*np.log(self.A3 + delta) + (1-self.target_data)*np.log((1 - self.A3)+delta ) )    
    
    # For external printing
    def loss_val(self):
        
        delta = 1e-7    # log Infinite Divergence Prevention
        
        # Calculate Z1,A1 in the input layer
        self.Z1 = self.input_data
        self.A1 = self.input_data
        
        # Calculate Z2,A2 in the hidden layer   
        self.Z2 = np.dot(self.A1, self.W2) + self.b2
        self.A2 = sigmoid(self.Z2)
        
        # Calculate Z3,A3 in the ouput layer
        self.Z3 = np.dot(self.A2, self.W3) + self.b3
        self.A3 = sigmoid(self.Z3)

        # Calculate the loss function value : cross entropy
        return  -np.sum( self.target_data*np.log(self.A3 + delta) + (1-self.target_data)*np.log((1 - self.A3)+delta ) )
    
    def train(self, input_data, target_data):   # input_data : 784 , target_data : 10
        
        self.target_data = target_data    
        self.input_data = input_data
        
        # Calculate an error with the feed foward
        loss_val = self.feed_forward()
        
        # Calculate loss_3
        loss_3 = (self.A3-self.target_data) * self.A3 * (1-self.A3)
                        
        # Update W3, b3 
        self.W3 = self.W3 - self.learning_rate * np.dot(self.A2.T, loss_3)   
        
        self.b3 = self.b3 - self.learning_rate * loss_3
        
        # Caculate loss_2 
        loss_2 = np.dot(loss_3, self.W3.T) * self.A2 * (1-self.A2)
        
        # Update W2, b2
        self.W2 = self.W2 - self.learning_rate * np.dot(self.A1.T, loss_2)   
        
        self.b2 = self.b2 - self.learning_rate * loss_2
  
    def predict(self, input_data):        # Shape of input_data is (1, 784) matrix    
        
        Z2 = np.dot(input_data, self.W2) + self.b2
        A2 = sigmoid(Z2)
        
        Z3 = np.dot(A2, self.W3) + self.b3
        A3 = sigmoid(Z3)
        
        predicted_num = np.argmax(A3)
    
        return predicted_num

    # Accuracy measurement
    def accuracy(self, test_data):
        
        matched_list = []
        not_matched_list = []
        
        for index in range(len(test_data)):
                        
            label = int(test_data[index, 0])
                        
            # Data normalize for one-hot encoding
            data = (test_data[index, 1:] / 255.0 * 0.99) + 0.01
            
                  
            # Vector -> Matrix (for the prediction)
            predicted_num = self.predict(np.array(data, ndmin=2)) 
        
            if label == predicted_num:
                matched_list.append(index)
            else:
                not_matched_list.append(index)
                
        print("Current Accuracy = ", 100*(len(matched_list)/(len(test_data))), " %")
        
        return matched_list, not_matched_list    

In [7]:
# Define variables
input_nodes = 784
hidden_nodes = 100
output_nodes = 10
learning_rate = 0.3
epochs = 1

nn = NeuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)

for i in range(epochs):
    
    for step in range(len(training_data)):  # train
    
        # input_data, target_data normalize        
        target_data = np.zeros(output_nodes) + 0.01    
        target_data[int(training_data[step, 0])] = 0.99
        input_data = ((training_data[step, 1:] / 255.0) * 0.99) + 0.01
    
        nn.train( np.array(input_data, ndmin=2), np.array(target_data, ndmin=2) )


        # Print the error once every 400 times
        if step % 400 == 0:
            print("step = ", step,  ",  loss_val = ", nn.loss_val())

step =  0 ,  loss_val =  3.859061828041105
step =  400 ,  loss_val =  1.723945632577408
step =  800 ,  loss_val =  1.1928549870450125
step =  1200 ,  loss_val =  0.7344099337903831
step =  1600 ,  loss_val =  1.1983662569064677
step =  2000 ,  loss_val =  1.7236779494244436
step =  2400 ,  loss_val =  0.7200644094708929
step =  2800 ,  loss_val =  0.878201592652564
step =  3200 ,  loss_val =  0.7319273532757054
step =  3600 ,  loss_val =  0.6821124862118942
step =  4000 ,  loss_val =  0.9132001443741815
step =  4400 ,  loss_val =  0.7681665597132775
step =  4800 ,  loss_val =  0.7991229218871426
step =  5200 ,  loss_val =  0.7818208283113939
step =  5600 ,  loss_val =  2.383541489331239
step =  6000 ,  loss_val =  0.8553421744938714
step =  6400 ,  loss_val =  0.9048363276186253
step =  6800 ,  loss_val =  0.922894875822513
step =  7200 ,  loss_val =  0.7983763436268783
step =  7600 ,  loss_val =  0.8765856048991398
step =  8000 ,  loss_val =  0.9591600869283143
step =  8400 ,  loss_va

In [8]:
nn.accuracy(test_data)

Current Accuracy =  94.22  %


([0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  34,
  35,
  36,
  37,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90,
  91,
  92,
  93,
  94,
  95,
  96,
  97,
  98,
  99,
  100,
  101,
  102,
  103,
  104,
  105,
  106,
  107,
  108,
  109,
  110,
  112,
  113,
  114,
  115,
  116,
  117,
  118,
  119,
  120,
  121,
  122,
  123,
  125,
  126,
  127,
  128,
  129,
  130,
  131,
  132,
  133,
  134,
  135,
  136,
  137,
  138,
  139,
  140,
  141,
  142,
  143,
  144,
  145,
  146,
  147,
  148,
  150,
  152,
  153,
  154,
  155,
  156,
  157,
  158,
  159,
  160,
  161,
  162,
  163,
  164,


This code implements a simple neural network for classification tasks using the MNIST dataset. Let's break down the code and explain its components along with some theory:

### NeuralNetwork Class:
- **Initialization (`__init__`):** 
  - Initializes the neural network with parameters like the number of input, hidden, and output nodes, as well as the learning rate.
  - Initializes weights (`W2` and `W3`) using Xavier/He initialization, biases (`b2` and `b3`), and activation values (`A1`, `A2`, `A3`) to zeros.
  
- **Feed Forward (`feed_forward`):**
  - Computes forward propagation through the network, calculating the activation values (`A1`, `A2`, `A3`) in each layer.
  - Uses sigmoid activation function for hidden layers and output layer.
  - Computes and returns the loss function value using cross-entropy loss.
  
- **Training (`train`):**
  - Accepts input data and target data.
  - Calls `feed_forward` to compute forward pass and loss.
  - Computes loss gradients (`loss_3` and `loss_2`).
  - Updates weights and biases (`W2`, `b2`, `W3`, `b3`) using gradient descent.
  
- **Prediction (`predict`):**
  - Accepts input data.
  - Computes forward pass to get the predicted output.
  - Returns the predicted label (digit).
  
- **Accuracy (`accuracy`):**
  - Evaluates the accuracy of the model on test data.
  - Compares predicted labels with actual labels and calculates accuracy.

### Training Loop:
- Initializes a neural network object (`nn`) with specified parameters.
- Iterates over the training dataset for a certain number of epochs.
- For each training step:
  - Normalizes input data and sets the target data for the current example.
  - Calls the `train` method to update the weights based on the current example.
  - Prints the loss value every 400 steps.

### Explanation:
- **Step Size (Printing Loss Every 400 Steps):**
  - The step size of 400 is arbitrary and can be adjusted based on the user's preference or the specific needs of the training process.
  - Printing the loss value every few steps (e.g., every 400 steps) helps monitor the training process and check if the loss is decreasing as expected.
  - Choosing a too frequent step size can lead to excessive output, while a too large step size might not provide enough visibility into the training progress. A balance needs to be struck between the two.

In summary, this code implements a basic neural network model for classifying handwritten digits from the MNIST dataset. It demonstrates key concepts such as feedforward, backpropagation, gradient descent, and Xavier/He initialization for weight initialization. The training loop iterates through the dataset, updating weights based on the calculated gradients, and periodically prints the loss value to monitor training progress.

# Xavier
 initialization, also known as Glorot initialization, is a method used to initialize the weights of neural networks. It is named after its proposer, Xavier Glorot. Xavier initialization aims to ensure that the weights are initialized in such a way that they neither explode (i.e., become too large) nor vanish (i.e., become too small) during the training process. It is particularly useful for activation functions like tanh and sigmoid.

The basic idea behind Xavier initialization is to scale the initial weights according to the number of input and output units of the layer. The scale factor is inversely proportional to the square root of the number of input units. This helps in maintaining the variance of the activations and gradients approximately constant across different layers, which aids in the convergence of the training process.

Mathematically, for a weight matrix \( W \) with dimensions \( n_{\text{in}} \times n_{\text{out}} \), Xavier initialization initializes the weights as:

\[ W \sim U\left(-\frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}, \frac{\sqrt{6}}{\sqrt{n_{\text{in}} + n_{\text{out}}}}\right) \]

Where \( U(a, b) \) denotes a uniform distribution in the interval \( [a, b] \).

Now, let's discuss how weights are adjusted during the training process:

1. **Forward Pass (Feedforward):** During the forward pass, input data is propagated through the network, and predictions are made. At each layer, the input is multiplied by the weight matrix and passed through an activation function to produce the output.

2. **Loss Calculation:** After the forward pass, the loss between the predicted output and the actual target is calculated. This loss function quantifies how far the predictions are from the actual targets.

3. **Backpropagation:** Backpropagation is the process of computing gradients of the loss function with respect to the weights of the network. It works by recursively applying the chain rule of calculus from the output layer to the input layer.

4. **Weight Update:** Once the gradients are computed, the weights are updated to minimize the loss function. This is typically done using an optimization algorithm such as Stochastic Gradient Descent (SGD) or its variants. The weights are adjusted in the opposite direction of the gradient, scaled by a learning rate, which controls the size of the update step.

Regarding the value of 400 for the step size in your code, it seems to be used as a parameter for printing the loss value during training. Printing the loss value every 400 steps might be chosen to monitor the progress of training without overwhelming the console with too many print statements. This value can be adjusted based on factors such as the size of the dataset, the complexity of the model, and the computational resources available.