# Comparativo: Mini-Batches vs Batch Completo com e sem Clipping

### 📊 Gráficos Comparativos

![Batch vs Full - Sem Clipping](batch_vs_full_sem_clipping.png)
![Batch vs Full - Com Clipping](batch_vs_full_com_clipping.png)

### 🔹 Matrizes de Confusão

**Batch Completo - Sem Clipping**

![Confusion Matrix - Sem Clipping](confusion_matrix_batch_completo_sem_clipping.png)

**Batch Completo - Com Clipping**

![Confusion Matrix - Com Clipping](confusion_matrix_batch_completo_com_clipping.png)

### 🧠 Por que utilizamos Gradient Clipping quando eu quis testar o batch total?

Durante os testes com treinamento usando batch completo (60.000 amostras), observamos que o modelo apresentava explosão de gradientes, evidenciada por valores extremamente altos de *loss* (acima de 30) e uma acurácia de 10% (equivalente a chute aleatório).

Esse problema ocorre porque o uso de batches grandes reduz a frequência de atualização dos pesos, concentrando muito gradiente em uma única atualização, o que pode provocar valores extremos nas ativações e pesos. Isso afeta funções como softmax e causa instabilidade numérica.

Para resolver isso, implementamos o **gradient clipping**, que limita a norma L2 dos gradientes a um valor máximo (neste caso, 5.0). Se a norma dos gradientes for maior que esse limite, todos os gradientes são escalados proporcionalmente. Isso estabiliza o treinamento, evita explosões e melhora a convergência.

Com a técnica aplicada, conseguimos treinar a rede mesmo com batch completo, mantendo a *loss* dentro de valores razoáveis e observando uma melhoria na acurácia.


### 🔢 Como o Gradient Clipping funciona matematicamente?

Seja $\mathbf{g} \in \mathbb{R}^n$ o vetor concatenado de todos os gradientes do modelo (por exemplo, pesos e bias). O clipping por norma L2 funciona da seguinte forma:

1. Calcula-se a norma:

   $$
   \|\mathbf{g}\|_2 = \sqrt{g_1^2 + g_2^2 + \ldots + g_n^2}
   $$

2. Se $\|\mathbf{g}\|_2 \leq \tau$, nada é feito ($\tau$ é o limite definido, como 5.0).

3. Caso contrário, aplica-se um fator de escala:

   $$
   \mathbf{g}_{\text{clipped}} = \mathbf{g} \cdot \frac{\tau}{\|\mathbf{g}\|_2 + \varepsilon}
   $$

Esse reescalonamento garante que o vetor de gradiente final tenha no máximo a norma desejada, sem alterar sua direção.

Isso evita que gradientes extremamente grandes causem saltos muito bruscos na atualização dos pesos, estabilizando o aprendizado, especialmente em redes profundas ou com batches grandes.


### ❌ Sem usar o Clipping:
```
    Epoch 1/30, Loss: 2.4086, Acc: 0.1053, Val Loss: 27.9799, Val Acc: 0.1899
    Epoch 2/30, Loss: 27.9862, Acc: 0.1897, Val Loss: 31.0849, Val Acc: 0.1000
    Epoch 3/30, Loss: 31.0849, Acc: 0.1000, Val Loss: 28.6563, Val Acc: 0.1701
    Epoch 4/30, Loss: 28.5377, Acc: 0.1735, Val Loss: 28.3900, Val Acc: 0.1004
    Epoch 5/30, Loss: 28.3917, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 6/30, Loss: 31.0734, Acc: 0.1003, Val Loss: 24.2246, Val Acc: 0.1004
    Epoch 7/30, Loss: 24.2253, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 8/30, Loss: 31.0740, Acc: 0.1003, Val Loss: 31.0849, Val Acc: 0.1000
    Epoch 9/30, Loss: 31.0826, Acc: 0.1001, Val Loss: 27.8344, Val Acc: 0.1004
    Epoch 10/30, Loss: 27.8383, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 11/30, Loss: 31.0751, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 12/30, Loss: 31.0740, Acc: 0.1003, Val Loss: 30.2935, Val Acc: 0.1004
    Epoch 13/30, Loss: 30.2958, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 14/30, Loss: 31.0728, Acc: 0.1003, Val Loss: 25.2368, Val Acc: 0.1004
    Epoch 15/30, Loss: 25.2385, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 16/30, Loss: 31.0728, Acc: 0.1003, Val Loss: 26.7380, Val Acc: 0.1000
    Epoch 17/30, Loss: 26.7355, Acc: 0.1001, Val Loss: 27.0227, Val Acc: 0.1004
    Epoch 18/30, Loss: 27.0250, Acc: 0.1003, Val Loss: 28.9454, Val Acc: 0.1004
    Epoch 19/30, Loss: 28.9477, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 20/30, Loss: 31.0734, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 21/30, Loss: 31.0745, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 22/30, Loss: 31.0734, Acc: 0.1003, Val Loss: 27.4549, Val Acc: 0.1004
    Epoch 23/30, Loss: 27.4566, Acc: 0.1003, Val Loss: 29.5507, Val Acc: 0.1004
    Epoch 24/30, Loss: 29.5524, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 25/30, Loss: 31.0728, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 26/30, Loss: 31.0728, Acc: 0.1003, Val Loss: 26.2962, Val Acc: 0.1000
    Epoch 27/30, Loss: 26.2939, Acc: 0.1001, Val Loss: 26.9832, Val Acc: 0.1004
    Epoch 28/30, Loss: 26.9852, Acc: 0.1003, Val Loss: 27.8167, Val Acc: 0.1004
    Epoch 29/30, Loss: 27.8189, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Epoch 30/30, Loss: 31.0734, Acc: 0.1003, Val Loss: 31.0711, Val Acc: 0.1004
    Batch Completo - Test Accuracy: 0.1004 | Test Loss: 31.0711
```

### ✅ Depois do Clipping:
```
    Epoch 1/30, Loss: 2.5320, Acc: 0.1170, Val Loss: 2.3967, Val Acc: 0.1661
    Epoch 2/30, Loss: 2.3971, Acc: 0.1635, Val Loss: 2.2847, Val Acc: 0.2195
    Epoch 3/30, Loss: 2.2860, Acc: 0.2144, Val Loss: 2.1911, Val Acc: 0.2662
    Epoch 4/30, Loss: 2.1933, Acc: 0.2648, Val Loss: 2.1113, Val Acc: 0.3188
    Epoch 5/30, Loss: 2.1142, Acc: 0.3202, Val Loss: 2.0400, Val Acc: 0.3723
    Epoch 6/30, Loss: 2.0432, Acc: 0.3690, Val Loss: 1.9731, Val Acc: 0.4137
    Epoch 7/30, Loss: 1.9765, Acc: 0.4097, Val Loss: 1.9090, Val Acc: 0.4507
    Epoch 8/30, Loss: 1.9124, Acc: 0.4494, Val Loss: 1.8474, Val Acc: 0.4855
    Epoch 9/30, Loss: 1.8508, Acc: 0.4842, Val Loss: 1.7882, Val Acc: 0.5178
    Epoch 10/30, Loss: 1.7916, Acc: 0.5136, Val Loss: 1.7316, Val Acc: 0.5396
    Epoch 11/30, Loss: 1.7349, Acc: 0.5374, Val Loss: 1.6774, Val Acc: 0.5585
    Epoch 12/30, Loss: 1.6807, Acc: 0.5587, Val Loss: 1.6256, Val Acc: 0.5766
    Epoch 13/30, Loss: 1.6289, Acc: 0.5769, Val Loss: 1.5762, Val Acc: 0.5899
    Epoch 14/30, Loss: 1.5792, Acc: 0.5916, Val Loss: 1.5289, Val Acc: 0.6014
    Epoch 15/30, Loss: 1.5317, Acc: 0.6036, Val Loss: 1.4836, Val Acc: 0.6120
    Epoch 16/30, Loss: 1.4863, Acc: 0.6133, Val Loss: 1.4403, Val Acc: 0.6191
    Epoch 17/30, Loss: 1.4428, Acc: 0.6213, Val Loss: 1.3990, Val Acc: 0.6263
    Epoch 18/30, Loss: 1.4012, Acc: 0.6287, Val Loss: 1.3594, Val Acc: 0.6340
    Epoch 19/30, Loss: 1.3615, Acc: 0.6344, Val Loss: 1.3217, Val Acc: 0.6394
    Epoch 20/30, Loss: 1.3236, Acc: 0.6397, Val Loss: 1.2856, Val Acc: 0.6450
    Epoch 21/30, Loss: 1.2873, Acc: 0.6445, Val Loss: 1.2512, Val Acc: 0.6495
    Epoch 22/30, Loss: 1.2527, Acc: 0.6487, Val Loss: 1.2185, Val Acc: 0.6534
    Epoch 23/30, Loss: 1.2197, Acc: 0.6530, Val Loss: 1.1873, Val Acc: 0.6584
    Epoch 24/30, Loss: 1.1883, Acc: 0.6573, Val Loss: 1.1575, Val Acc: 0.6625
    Epoch 25/30, Loss: 1.1583, Acc: 0.6619, Val Loss: 1.1293, Val Acc: 0.6676
    Epoch 26/30, Loss: 1.1298, Acc: 0.6655, Val Loss: 1.1024, Val Acc: 0.6706
    Epoch 27/30, Loss: 1.1027, Acc: 0.6695, Val Loss: 1.0768, Val Acc: 0.6753
    Epoch 28/30, Loss: 1.0769, Acc: 0.6742, Val Loss: 1.0525, Val Acc: 0.6792
    Epoch 29/30, Loss: 1.0524, Acc: 0.6790, Val Loss: 1.0294, Val Acc: 0.6831
    Epoch 30/30, Loss: 1.0291, Acc: 0.6828, Val Loss: 1.0075, Val Acc: 0.6882
    Batch Completo - Test Accuracy: 0.6882 | Test Loss: 1.0075
```


Essa comparação evidencia o impacto direto do gradient clipping na estabilidade e performance de treinamento quando se usa batch completo.