---

# Chapter 11: Training Deep Neural Networks

**Tujuan:** Menangani *vanishing/exploding gradients*, inisialisasi bobot, fungsi aktivasi, *Batch Normalization*, *gradient clipping*, optimizers, dan *transfer learning*.

---

## 1. Vanishing & Exploding Gradients

* Saat *backpropagation*, gradien bisa:

  * **Vanishing**: mendekati nol → pelatihan lambat atau berhenti
  * **Exploding**: terlalu besar → ketidakstabilan
* Umumnya terjadi di jaringan yang sangat dalam.
* **Solusi:**

  * Inisialisasi bobot yang tepat (Glorot, He)
  * Aktivasi non-saturating (ReLU, LeakyReLU)
  * *Batch Normalization*
  * *Gradient Clipping*

---

## 2. Inisialisasi Bobot

* **Glorot/Xavier Initialization**:

  $$
  W \sim \mathcal{U} \left[ -\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}} \right]
  $$

* **He Initialization** (cocok untuk ReLU):

  $$
  W \sim \mathcal{N} \left( 0, \sqrt{\frac{2}{n_{in}}} \right)
  $$

---

## 3. Aktivasi Non‑Saturating

* **ReLU**:

  $
  \text{ReLU}(z) = \max(0, z)
  $

* Varian lain:

  * **LeakyReLU**: mencegah neuron “mati”
  * **ELU**, **SELU**: self-normalizing properties (SELU butuh inisialisasi dan dropout khusus)

---

## 4. Batch Normalization

* Menstabilkan pelatihan dengan *menormalisasi* output layer:

  * Normalisasi: $\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$
  * Skala dan geser: $y = \gamma \hat{x} + \beta$ (dilatih)
* Dapat mempercepat dan menstabilkan pelatihan.

---

## 5. Gradient Clipping

* Mencegah *exploding gradient* dengan membatasi ukuran gradien:

  * **Clip by value**: limit gradien ke rentang tetap
  * **Clip by norm**: skalakan jika norm melebihi threshold

---

## 6. Optimizers

| Optimizer    | Karakteristik                                  |
| ------------ | ---------------------------------------------- |
| **SGD**      | Dasar, butuh learning rate yang hati‑hati      |
| **Momentum** | Tambahkan inersia ke arah gradien              |
| **Nesterov** | Lookahead momentum (lebih responsif)           |
| **AdaGrad**  | Adaptif, cocok untuk data sparse               |
| **RMSprop**  | Mirip AdaGrad tapi lebih stabil jangka panjang |
| **Adam**     | Gabungan Momentum + RMSprop (paling umum)      |
| **Nadam**    | Adam + Nesterov momentum                       |

---

## 7. Transfer Learning

- Gunakan model pretrained (misal MobileNet) → _fine‑tune_ layer atas untuk dataset baru  

---

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, initializers, optimizers, callbacks

# Setup: buat dataset sintetik untuk classification
(X_train, y_train), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train = X_train.reshape(-1,28,28,1).astype("float32")/255
X_test  = X_test .reshape(-1,28,28,1).astype("float32")/255

# Create a simple deep MLP to illustrate vanishing/exploding
def build_deep_model(init, activation, use_bn=False, clip_norm=None):
    inp = keras.Input(shape=(28,28,1))
    x = layers.Flatten()(inp)
    for _ in range(10):
        x = layers.Dense(128,
                         activation=activation,
                         kernel_initializer=init)(x)
        if use_bn:
            x = layers.BatchNormalization()(x)
    out = layers.Dense(10, activation="softmax")(x)
    opt = optimizers.Adam(clipnorm=clip_norm)
    model = keras.Model(inp, out)
    model.compile(optimizer=opt,
                  loss="sparse_categorical_crossentropy",
                  metrics=["accuracy"])
    return model

# 1) Tanpa BN, default init → mungkin vanishing
model1 = build_deep_model("glorot_uniform", "relu", use_bn=False, clip_norm=None)
# 2) Dengan BatchNorm + He init + gradient clipping
model2 = build_deep_model(initializers.HeNormal(), "relu", use_bn=True, clip_norm=1.0)

# Train singkat tiap model (1 epoch untuk demo)
print("Training model1 (no BN)...")
model1.fit(X_train, y_train, epochs=1, batch_size=256, validation_split=0.1)
print("\nTraining model2 (BN + He + clip_norm=1)...")
model2.fit(X_train, y_train, epochs=1, batch_size=256, validation_split=0.1)

# Evaluate
print("\nEvaluate model1:")
print(model1.evaluate(X_test, y_test, verbose=0))
print("Evaluate model2:")
print(model2.evaluate(X_test, y_test, verbose=0))

Training model1 (no BN)...
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 16ms/step - accuracy: 0.5803 - loss: 1.0949 - val_accuracy: 0.8350 - val_loss: 0.4688

Training model2 (BN + He + clip_norm=1)...
[1m211/211[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 30ms/step - accuracy: 0.6350 - loss: 1.1070 - val_accuracy: 0.8178 - val_loss: 0.4966

Evaluate model1:
[0.4947534203529358, 0.8256000280380249]
Evaluate model2:
[0.5355045199394226, 0.8062999844551086]


## 8. Transfer Learning dengan MobileNetV2

In [None]:
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')

# Demo: resize & subset CIFAR10
(x_c, y_c), _ = keras.datasets.cifar10.load_data()
x_c = tf.image.resize(x_c, (96,96))[:2500] / 255.0
y_c = y_c[:2500]

# Load MobileNetV2 pretrained (tanpa top)
base = keras.applications.MobileNetV2(
    input_shape=(96,96,3),
    include_top=False,
    weights="imagenet",
    alpha=0.35
)
base.trainable = False  # freeze base

# Tambah classifier baru
inputs = keras.Input(shape=(96,96,3))
x = base(inputs, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.2)(x)
outputs = layers.Dense(10, activation="softmax")(x)
tl_model = keras.Model(inputs, outputs)

tl_model.compile(optimizer="adam",
                 loss="sparse_categorical_crossentropy",
                 metrics=["accuracy"])

# Latih
tl_model.fit(x_c, y_c, epochs=3, batch_size=16, validation_split=0.1)

# Ringkasan Chapter 11
- Vanishing/exploding diatasi dengan inisialisasi, ReLU, BatchNorm, clipping

- Adam sering jadi pilihan default

- Transfer Learning percepat pelatihan dan tingkatkan performa