# Vanishing Gradient Problem in Multilayered Neural Networks

## 🧠 What is it?
The vanishing gradient problem occurs during backpropagation in deep neural networks when the gradients (errors) become very small as they are passed backward through layers.
As a result:

   1. Earlier layers learn very slowly or not at all.

   2. The network fails to converge or becomes very hard to train.



## 🔁 Why Does It Happen?
It’s due to repeated multiplication of small derivatives (< 1) using the chain rule in backpropagation.

Example:

If in each layer the gradient is 0.5, and we have 10 layers:

Total gradient = 0.5^10 ≈ 0.00098 → Almost zero

This leads to almost no weight update in early layers.



## 🔍 Mostly Happens With:
Deep networks (many hidden layers)

Sigmoid or tanh activation functions

    1. Their gradients are < 1 and can shrink fast.

    2. Near 0 or 1, derivatives become close to 0.



## ❌ Effect:
Weights in early layers stop learning.

Training becomes slow or stuck.

Final model performs poorly.

## ✅ Solutions to Vanishing Gradient:

We used other Activation Function like

1. ReLU Activation (ReLU does not squash gradients like sigmoid/tanh.)

2. Batch Normalization (Keeps activations in a stable range.)

3. He or Xavier Initialization(Smart ways to initialize weights to prevent gradient shrink.)

4. Skip Connections (ResNet)	(Allows gradient to flow directly to earlier layers.)

5. LSTM/GRU (in RNNs)	(Designed to retain gradients better than vanilla RNNs.)

## 📌 Summary:
Vanishing gradient = gradient becomes too small → early layers stop learning → deep networks fail to train properly.