
# Day-60: Regularization In Deep Learning

In Deep Learning, Overfitting is the biggest enemy. It happens when the network's capacity is too high, allowing it to memorize the noise and anomalies in the training data, rather than learning the underlying, general patterns. A heavily overfitted model will show near 100% accuracy on the training set but terrible performance on a validation set.

## Regularization

Regularization refers to techniques designed to explicitly reduce the generalization error (i.e., prevent overfitting) without significantly increasing the training error.


- `analogy`:
    - Think of your neural network like a student who memorizes answers instead of understanding concepts.
They ace the practice tests (training data) but fail real exams (test data).

    - Regularization techniques help the student learn the pattern, not the exact answers — that’s generalization.

##  Dropout (Preventing Co-Adaptation)

During training only, a certain percentage (e.g., 50%) of the neurons in a hidden layer are randomly shut off (their output is set to zero) for that single forward and backward pass.

- `Analogy`:
    - Imagine you’re training a football team. If you always train with the same players, others never learn.
    - So occasionally, you ask some players to sit out. This forces others to adapt.

Similarly, Dropout randomly turns off some neurons during training.It prevents co-dependency among neurons and improves generalization.

During inference, all neurons are active, but each one’s output is scaled appropriately.

## L2 Regularization (Weight Decay)

This is a classic machine learning technique applied directly to the loss function during backpropagation.

$$ \text{Total Loss} = \text{Original Loss} + \lambda \sum w^2 $$

where $\lambda$ controls how strong the penalty is.

This helps prevent the model from relying too heavily on specific neurons — making it more stable.

- `Analogy`:
    - Imagine you’re packing for a trip. If your bag (weights) is too heavy, it slows you down.
    - L2 regularization adds a small penalty for carrying large weights — it encourages the model to keep them small and balanced.

## Batch Normalization (BN)

Batch Normalization is a critical technique not just for regularization, but also for dramatically speeding up training and making it more stable.

The Problem: As weights are constantly updated during training, the inputs to subsequent layers are constantly changing. This is called Internal Covariate Shift, and it forces the later layers to constantly adapt, slowing down learning.

Batch Normalization normalizes the inputs of each layer, ensuring faster and more stable training.
It also acts as a mild regularizer, reducing the need for dropout in some architectures.

Mechanism: The BN layer is placed before the activation function. For every mini-batch of data, it normalizes the weighted sum (z) such that the output has a mean of 0 and a standard deviation of 1.
$$ \hat{z} = \frac{z - \mu_{\text{batch}}}{\sigma_{\text{batch}}} $$

- `Analogy`: Consistent Ingredients
    - If you're making a complex recipe, you want your ingredients (the inputs to the next layer) to be consistent. 
    - Batch Normalization ensures that every batch of data entering a layer is standardized, so the layer always sees the data in the same, easy-to-learn distribution.

Benefits:

- Allows for much higher learning rates.

- Acts as a subtle form of regularization, slightly reducing the need for Dropout.

In [2]:
! pip install tensorflow



In [3]:
# Day 60: Regularization in Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.regularizers import l2
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load Data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train.reshape(-1, 784) / 255.0, x_test.reshape(-1, 784) / 255.0
y_train, y_test = to_categorical(y_train), to_categorical(y_test)


# Build Model
model = Sequential([
    Dense(512, activation='relu', kernel_regularizer=l2(0.001), input_shape=(784,)),
    BatchNormalization(),
    Dropout(0.3),
    Dense(256, activation='relu', kernel_regularizer=l2(0.001)),
    BatchNormalization(),
    Dropout(0.3),
    Dense(10, activation='softmax')
])

# Compile Model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train Model
history = model.fit(x_train, y_train, epochs=10, batch_size=128,
                    validation_data=(x_test, y_test))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.9221 - loss: 0.8528 - val_accuracy: 0.9591 - val_loss: 0.5595
Epoch 2/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9564 - loss: 0.4758 - val_accuracy: 0.9701 - val_loss: 0.3692
Epoch 3/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9621 - loss: 0.3504 - val_accuracy: 0.9705 - val_loss: 0.2997
Epoch 4/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9642 - loss: 0.2955 - val_accuracy: 0.9708 - val_loss: 0.2680
Epoch 5/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9650 - loss: 0.2748 - val_accuracy: 0.9683 - val_loss: 0.2500
Epoch 6/10
[1m469/469[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 5ms/step - accuracy: 0.9657 - loss: 0.2621 - val_accuracy: 0.9640 - val_loss: 0.2630
Epoch 7/10
[1m469/469[0m 

Explanation:

- l2(0.001) → adds weight decay penalty

- Dropout(0.3) → randomly drops 30% neurons

- BatchNormalization() → stabilizes learning and speeds up convergence