***Q1) Theory and Concepts***

# 1) Explain the concept of batch Normalization in the context of ANN?

Batch Normalization is a technique used in ANN to improve the training process and performance. It involves normalising the inputs of each layer accross a mini batch during training ,stabilizing and accelerating the learning process.
##Key Concepts
1) Normalization: For each mini batch, the inputs are normalized by subtracting the batch mean and dividing the batch standard deviation.
2) Scaling and Shifting: After Normazling , the normalized value is scaled and shifted using learnable parameters.
##Benifits
1) Improved Training Speed
2) Higher Learning Rates
3) Regularization effect
4) Reduced Sensitivity to Initialization


 Q2) Describe the benifits of using batch Normalization during training.

Benefits:
1) Improved Training Speed: Normalizing the inputs helps in stabilizing and accelerating the training process by reducing internal covariate shift.


2) Higher Learning Rates: Batch normalization allows the use of higher learning rates, making the training faster.


3) Regularization Effect: It introduces a slight regularization effect by adding noise due to the mini-batch statistics, which can reduce the need for other regularization methods like dropout.
4) Reduced Sensitivity to Initialization: Helps reduce sensitivity to weight initialization, making the network more robust.

3) Discuss the working principle of batch normalization, including the normalization step and the learnable
parameters.

Batch normalization (BatchNorm) is a technique used in artificial neural networks to improve training speed and stability. It normalizes the input to each layer in the network, making the training process more efficient and reliable. Here’s a detailed discussion of its working principle:

### Working Principle of Batch Normalization:

1. **Normalization Step**:
   - For a given layer in the neural network, consider the input to this layer as a mini-batch of size \( m \), represented as \( \{ x_1, x_2, \ldots, x_m \} \).
   - **Compute Batch Mean**:
     \[
     \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
     \]
   - **Compute Batch Variance**:
     \[
     \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
     \]
   - **Normalize the Inputs**: Subtract the batch mean and divide by the square root of the batch variance plus a small constant \( \epsilon \) to avoid division by zero.
     \[
     \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}
     \]

2. **Scaling and Shifting**:
   - After normalization, each normalized input \( \hat{x}_i \) is scaled and shifted using two learnable parameters: \(\gamma\) (scale) and \(\beta\) (shift).
     \[
     y_i = \gamma \hat{x}_i + \beta
     \]
   - \(\gamma\) and \(\beta\) are learned during the training process, allowing the model to undo the normalization if needed and maintain the representational capacity of the network.

### Summary of Steps:
1. **Input**: A mini-batch of inputs \( \{ x_1, x_2, \ldots, x_m \} \).
2. **Calculate Batch Statistics**:
   - Batch mean \( \mu_B \)
   - Batch variance \( \sigma_B^2 \)
3. **Normalize**: Normalize the batch inputs \( \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \)
4. **Scale and Shift**: Apply the learned scale and shift parameters \( y_i = \gamma \hat{x}_i + \beta \)

### Learnable Parameters:
- **\(\gamma\) (Scale Parameter)**: Adjusts the normalized output’s variance. If \(\gamma\) is greater than 1, it increases the variance, and if it is less than 1, it decreases the variance.
- **\(\beta\) (Shift Parameter)**: Adjusts the normalized output’s mean. It shifts the mean of the normalized outputs.

### During Training and Inference:
- **Training**: Batch statistics (mean and variance) are computed for each mini-batch and used for normalization.
- **Inference**: Moving averages of the batch mean and variance (calculated during training) are used to normalize the inputs, ensuring consistency and stability.

### Benefits:
- **Accelerates Training**: Stabilizes the learning process by reducing internal covariate shift.
- **Allows Higher Learning Rates**: Reduces the risk of the network getting stuck in local minima, enabling the use of higher learning rates.
- **Regularization Effect**: Adds a slight regularization by introducing noise due to mini-batch statistics, which can reduce overfitting.

In summary, batch normalization normalizes the inputs of each layer to have a mean of zero and a variance of one for each mini-batch, then scales and shifts them using learnable parameters. This improves the stability and speed of the training process, making the network more robust and efficient.

##Q2)Implementation

In [1]:
from ast import AsyncFunctionDef
## Importing req libraries
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
plt.style.use("fivethirtyeight")
%load_ext tensorboard

In [3]:
##loading the mnist data
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_trian_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_trian_full[:5000], X_trian_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [11]:
tf.random.set_seed(42)
np.random.seed(42)
LAYERS = [tf.keras.layers.Flatten(input_shape = [28,28]),
          tf.keras.layers.Dense(300,kernel_initializer="he_normal"),
          tf.keras.layers.LeakyReLU(),
          tf.keras.layers.Dense(100,kernel_initializer="he_normal"),
          tf.keras.layers.LeakyReLU(),
          tf.keras.layers.Dense(10,activation="softmax")]

model = tf.keras.models.Sequential(LAYERS)

In [12]:
##now compile the model
model.compile(loss="sparse_categorical_crossentropy",
              optimizer = tf.keras.optimizers.SGD(lr=1e-3),
              metrics = ["accuracy"])



In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_4 (Flatten)         (None, 784)               0         
                                                                 
 dense_8 (Dense)             (None, 300)               235500    
                                                                 
 leaky_re_lu_5 (LeakyReLU)   (None, 300)               0         
                                                                 
 dense_9 (Dense)             (None, 100)               30100     
                                                                 
 leaky_re_lu_6 (LeakyReLU)   (None, 100)               0         
                                                                 
 dense_10 (Dense)            (None, 10)                1010      
                                                                 
Total params: 266610 (1.02 MB)
Trainable params: 266610 

In [14]:
##calculating the trianing time
start = time.time()
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid), verbose=2)

#ending time
end = time.time()

# total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/10
1719/1719 - 8s - loss: 0.6819 - accuracy: 0.7693 - val_loss: 0.5044 - val_accuracy: 0.8286 - 8s/epoch - 5ms/step
Epoch 2/10
1719/1719 - 9s - loss: 0.4858 - accuracy: 0.8304 - val_loss: 0.4386 - val_accuracy: 0.8488 - 9s/epoch - 5ms/step
Epoch 3/10
1719/1719 - 8s - loss: 0.4443 - accuracy: 0.8436 - val_loss: 0.5537 - val_accuracy: 0.7934 - 8s/epoch - 5ms/step
Epoch 4/10
1719/1719 - 6s - loss: 0.4203 - accuracy: 0.8532 - val_loss: 0.4009 - val_accuracy: 0.8622 - 6s/epoch - 4ms/step
Epoch 5/10
1719/1719 - 8s - loss: 0.4038 - accuracy: 0.8587 - val_loss: 0.3859 - val_accuracy: 0.8678 - 8s/epoch - 5ms/step
Epoch 6/10
1719/1719 - 6s - loss: 0.3866 - accuracy: 0.8645 - val_loss: 0.3792 - val_accuracy: 0.8696 - 6s/epoch - 4ms/step
Epoch 7/10
1719/1719 - 7s - loss: 0.3754 - accuracy: 0.8684 - val_loss: 0.3765 - val_accuracy: 0.8718 - 7s/epoch - 4ms/step
Epoch 8/10
1719/1719 - 6s - loss: 0.3653 - accuracy: 0.8704 - val_loss: 0.3972 - val_accuracy: 0.8584 - 6s/epoch - 4ms/step
Epoch 9/

In [15]:
##Accuracy = 0.8764

##After Applying Batch Normalization

In [16]:
del model



In [17]:
# Defing new model with batch normalization
LAYERS_BN = [
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation="softmax")
]

model = tf.keras.models.Sequential(LAYERS_BN)

In [19]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_5 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization (Batch  (None, 784)               3136      
 Normalization)                                                  
                                                                 
 dense_11 (Dense)            (None, 300)               235500    
                                                                 
 batch_normalization_1 (Bat  (None, 300)               1200      
 chNormalization)                                                
                                                                 
 dense_12 (Dense)            (None, 100)               30100     
                                                                 
 batch_normalization_2 (Bat  (None, 100)              

In [20]:
bn1 = model.layers[1]

In [21]:
for variable in bn1.variables:
  print(variable.name, variable.trainable)

batch_normalization/gamma:0 True
batch_normalization/beta:0 True
batch_normalization/moving_mean:0 False
batch_normalization/moving_variance:0 False


In [22]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=tf.keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])



In [23]:
# now training & calculating the training time.

# starting time
start = time.time()


history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid), verbose=2)

#ending time
end = time.time()

# total time taken
print(f"Runtime of the program is {end - start}")

Epoch 1/10
1719/1719 - 12s - loss: 0.5292 - accuracy: 0.8148 - val_loss: 0.3837 - val_accuracy: 0.8668 - 12s/epoch - 7ms/step
Epoch 2/10
1719/1719 - 10s - loss: 0.3915 - accuracy: 0.8602 - val_loss: 0.3505 - val_accuracy: 0.8764 - 10s/epoch - 6ms/step
Epoch 3/10
1719/1719 - 10s - loss: 0.3567 - accuracy: 0.8716 - val_loss: 0.3493 - val_accuracy: 0.8736 - 10s/epoch - 6ms/step
Epoch 4/10
1719/1719 - 10s - loss: 0.3256 - accuracy: 0.8829 - val_loss: 0.3236 - val_accuracy: 0.8822 - 10s/epoch - 6ms/step
Epoch 5/10
1719/1719 - 11s - loss: 0.3040 - accuracy: 0.8900 - val_loss: 0.3130 - val_accuracy: 0.8866 - 11s/epoch - 6ms/step
Epoch 6/10
1719/1719 - 10s - loss: 0.2889 - accuracy: 0.8952 - val_loss: 0.3138 - val_accuracy: 0.8858 - 10s/epoch - 6ms/step
Epoch 7/10
1719/1719 - 10s - loss: 0.2757 - accuracy: 0.9004 - val_loss: 0.3129 - val_accuracy: 0.8852 - 10s/epoch - 6ms/step
Epoch 8/10
1719/1719 - 10s - loss: 0.2629 - accuracy: 0.9036 - val_loss: 0.3133 - val_accuracy: 0.8846 - 10s/epoch - 6

In [24]:
##Accuracy 0.9123 :))

##Q3) Discuss the advantages and potential limitations of batch normalization in improving the training of
neural networks.

### Advantages of Batch Normalization:

1. **Accelerates Training**:
   - **Stabilizes Learning**: By normalizing the inputs of each layer, batch normalization reduces internal covariate shift, leading to more stable and faster convergence.
   - **Higher Learning Rates**: Enables the use of higher learning rates, which can speed up training and lead to better model performance.

2. **Improves Generalization**:
   - **Regularization Effect**: The noise introduced by mini-batch statistics acts as a form of regularization, reducing overfitting and improving generalization to new data.

3. **Reduces Sensitivity to Initialization**:
   - **Robustness to Weights**: Batch normalization makes the network less sensitive to the initial values of the weights, making the training process more reliable.

4. **Allows Deeper Networks**:
   - **Training Deep Networks**: Facilitates the training of very deep networks by mitigating issues related to vanishing and exploding gradients.

5. **Smoothes Loss Landscape**:
   - **Easier Optimization**: By normalizing activations, batch normalization can create a smoother loss landscape, making optimization easier and reducing the likelihood of getting stuck in local minima.

### Potential Limitations of Batch Normalization:

1. **Additional Computation**:
   - **Overhead**: Batch normalization introduces additional computation due to the calculation of mean and variance for each mini-batch, as well as the scaling and shifting operations.

2. **Dependence on Batch Size**:
   - **Small Batches**: The effectiveness of batch normalization can diminish with very small batch sizes, as the batch statistics may become noisy and less representative of the data distribution.
   - **Memory Consumption**: Requires maintaining running averages of batch statistics during training, which can increase memory consumption.

3. **Complexity**:
   - **Implementation**: Adds complexity to the model and training process, especially when combining with other normalization techniques or specialized layers.

4. **Inference Discrepancy**:
   - **Training vs. Inference**: During inference, running averages of batch statistics are used, which might not be as accurate as the batch statistics used during training, potentially leading to a small discrepancy in performance.

5. **Potential for Over-reliance**:
   - **Regularization Dependency**: Relying too heavily on the regularization effect of batch normalization may lead to underutilization of other regularization techniques like dropout, which might be more suitable for certain architectures or tasks.

### Summary:
Batch normalization offers several advantages in improving the training of neural networks, including accelerated training, improved generalization, and enabling deeper networks. However, it also comes with potential limitations, such as additional computation, dependence on batch size, and implementation complexity. Despite these limitations, batch normalization remains a widely used and powerful technique in modern deep learning.