# **CHAPTER 11**
# **Training Deep Neural Networks**

This subchapter discusses one of the main difficulties in training deep neural networks: the vanishing and exploding gradients problem. During backpropagation, gradients are propagated backward from the output layer to the earlier layers. If gradients become too small, learning slows down or stops (vanishing gradients). If gradients become too large, weights grow uncontrollably (exploding gradients).
These problems were especially severe in deep networks using sigmoid or tanh activation functions, which compress inputs into a small range. As a result, early layers learn extremely slowly compared to later layers.
This issue was one of the main reasons deep neural networks were difficult to train before 2010.


**Xavier and He Initialization**

To reduce vanishing and exploding gradients, better weight initialization strategies were introduced.
Xavier (Glorot) initialization sets initial weights so that the variance of activations remains stable across layers. It works well with sigmoid and tanh activations.
He initialization is a variant optimized for ReLU activation functions. It uses a higher variance to account for the fact that ReLU outputs zero for half of its inputs.
Keras automatically uses these initializers depending on the activation function.


**Nonsaturating Activation Functions**

This subchapter explains how choosing the right activation function helps deep networks train faster and more reliably.
•	ReLU is the most widely used activation function because it reduces vanishing gradients.
•	Leaky ReLU allows a small slope for negative inputs to avoid “dead neurons”.
•	ELU (Exponential Linear Unit) improves convergence speed and robustness to noise.
•	SELU enables self-normalizing neural networks when used with specific initializations.


In [2]:
from tensorflow import keras

model = keras.models.Sequential([
    keras.layers.Dense(64, activation="relu", input_shape=(20,)),  # contoh input 20 fitur
    keras.layers.Dense(32),
    keras.layers.LeakyReLU(alpha=0.2),
    keras.layers.Dense(10, activation="softmax")  # misal 10 kelas output
])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [3]:
layer = keras.layers.Dense(10, activation="selu",
kernel_initializer="lecun_normal")

**Batch Normalization**

Batch Normalization (BN) addresses the problem of internal covariate shift by normalizing inputs of each layer during training. This stabilizes learning, allows higher learning rates, and reduces sensitivity to initialization.
BN also acts as a regularizer, often reducing the need for dropout.
Batch normalization layers typically come before or after activation functions.


In [4]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

  super().__init__(**kwargs)


In [5]:
model.summary()

In [6]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [8]:
bn_layer = model.layers[1]

# Moving mean dan moving variance
moving_mean = bn_layer.moving_mean
moving_var = bn_layer.moving_variance

print("Moving mean:", moving_mean.numpy())
print("Moving variance:", moving_var.numpy())


Moving mean: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0

In [9]:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.BatchNormalization(),
keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("elu"),
keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
keras.layers.BatchNormalization(),
keras.layers.Activation("elu"),
keras.layers.Dense(10, activation="softmax")
])

In [10]:
class BatchNormalization(keras.layers.Layer):
    [...]
def call(self, inputs, training=None):
    [...]

**Gradient Clipping**

Gradient clipping limits the magnitude of gradients during backpropagation to prevent exploding gradients. This is especially useful for recurrent neural networks but can also help deep feedforward networks.
Keras supports gradient clipping via the optimizer.


In [11]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
model.compile(loss="mse", optimizer=optimizer)

**Reusing Pretrained Layers**

Transfer learning allows reuse of pretrained models or layers, which significantly reduces training time and improves performance when data is limited.
Typical steps:
1.	Load a pretrained model
2.	Freeze some layers
3.	Train remaining layers
4.	Optionally fine-tune frozen layers with a lower learning rate


**Faster Optimizers**

This section introduces advanced optimization algorithms that converge faster than plain Gradient Descent.
Common optimizers:
•	Momentum
•	Nesterov Accelerated Gradient
•	RMSProp
•	Adam
•	Adamax
•	Nadam


In [19]:
from keras.optimizers import SGD

optimizer = SGD(learning_rate=0.001, momentum=0.9)


In [21]:
from keras.optimizers import SGD

optimizer = SGD(learning_rate=0.001, momentum=0.9, nesterov=True)


In [23]:
from keras.optimizers import RMSprop

optimizer = RMSprop(learning_rate=0.001, rho=0.9)


In [24]:
from keras.optimizers import Adam

optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)


**Learning Rate Scheduling**

Choosing a fixed learning rate is rarely optimal. Learning rate scheduling dynamically adjusts the learning rate during training to improve convergence.
Techniques include:
•	Step decay
•	Exponential decay
•	Power scheduling
•	Performance-based scheduling


In [25]:
from keras.optimizers import SGD

optimizer = SGD(learning_rate=0.01, momentum=0.0, nesterov=False)


In [27]:
def exponential_decay_fn(epoch):
    return 0.01 * 0.1**(epoch / 20)

In [28]:
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1**(epoch / s)
    return exponential_decay_fn


exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [36]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [37]:
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

In [38]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled  = scaler.transform(X_test)

In [39]:
model = keras.models.Sequential([
    keras.Input(shape=(X_train_scaled.shape[1],)),   # input layer
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(3, activation='softmax')     # output 3 kelas
])

In [40]:
optimizer = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=optimizer,
    metrics=["accuracy"]
)

In [41]:
def exponential_decay_fn(epoch, lr):
    k = 0.1
    return float(lr * np.exp(-k * epoch))

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)


In [42]:
history = model.fit(
    X_train_scaled, y_train,
    epochs=20,
    validation_data=(X_valid_scaled, y_valid),
    callbacks=[lr_scheduler]
)

Epoch 1/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 108ms/step - accuracy: 0.2934 - loss: 1.1930 - val_accuracy: 0.3636 - val_loss: 1.0663 - learning_rate: 0.0100
Epoch 2/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 58ms/step - accuracy: 0.4696 - loss: 0.9867 - val_accuracy: 0.6364 - val_loss: 0.8560 - learning_rate: 0.0090
Epoch 3/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step - accuracy: 0.6472 - loss: 0.8150 - val_accuracy: 0.6818 - val_loss: 0.7387 - learning_rate: 0.0074
Epoch 4/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 61ms/step - accuracy: 0.6774 - loss: 0.7034 - val_accuracy: 0.7727 - val_loss: 0.6649 - learning_rate: 0.0055
Epoch 5/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step - accuracy: 0.7368 - loss: 0.6085 - val_accuracy: 0.7727 - val_loss: 0.6160 - learning_rate: 0.0037
Epoch 6/20
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step -

In [43]:
loss_test, acc_test = model.evaluate(X_test_scaled, y_test)
print("Test accuracy:", acc_test)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step - accuracy: 0.6957 - loss: 0.6062
Test accuracy: 0.695652186870575


In [44]:
X_new = X_test_scaled[:3]
y_pred = model.predict(X_new)
y_pred_classes = np.argmax(y_pred, axis=1)
print("Predicted classes:", y_pred_classes)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step
Predicted classes: [0 0 2]


In [45]:
def exponential_decay_fn(epoch, lr):
    return lr * 0.1**(1 / 20)

In [46]:
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [47]:
lr_scheduler = keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

In [48]:
s = 20 * len(X_train) // 32 # number of steps in 20 epochs (batch size = 32)
learning_rate = keras.optimizers.schedules.ExponentialDecay(0.01, s, 0.1)
optimizer = keras.optimizers.SGD(learning_rate)

**Avoiding Overfitting Through Regularization**

Deep networks are prone to overfitting due to their large number of parameters.
Regularization techniques discussed:
•	L1 and L2 regularization
•	Dropout
•	Max-norm constraints
•	Early stopping


In [49]:
layer = keras.layers.Dense(100, activation="elu",
kernel_initializer="he_normal",

kernel_regularizer=keras.regularizers.l2(0.01))

In [50]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
activation="elu",
kernel_initializer="he_normal",
kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
RegularizedDense(300),
RegularizedDense(100),
RegularizedDense(10, activation="softmax",
kernel_initializer="glorot_uniform")
])

  super().__init__(**kwargs)


In [51]:
model = keras.models.Sequential([
keras.layers.Flatten(input_shape=[28, 28]),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
keras.layers.Dropout(rate=0.2),
keras.layers.Dense(10, activation="softmax")
])

In [55]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [56]:
iris = load_iris()
X = iris.data          # 4 fitur
y = iris.target.reshape(-1, 1)  # 0,1,2

In [58]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_encoded = encoder.fit_transform(y)  # shape (150, 3)


In [59]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42
)

In [60]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

In [61]:
model = keras.models.Sequential([
    keras.Input(shape=(4,)),             # 4 fitur input
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(3, activation='softmax')  # 3 kelas
])

In [62]:
model.compile(
    loss='categorical_crossentropy',
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    metrics=['accuracy']
)

In [63]:
history = model.fit(
    X_train_scaled, y_train,
    epochs=50,
    batch_size=16,
    validation_split=0.2,
    verbose=1
)

Epoch 1/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 74ms/step - accuracy: 0.2818 - loss: 1.1599 - val_accuracy: 0.5000 - val_loss: 0.9817
Epoch 2/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.2439 - loss: 1.1192 - val_accuracy: 0.5000 - val_loss: 0.9522
Epoch 3/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 23ms/step - accuracy: 0.2092 - loss: 1.0697 - val_accuracy: 0.5000 - val_loss: 0.9261
Epoch 4/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.3077 - loss: 1.0620 - val_accuracy: 0.5000 - val_loss: 0.8990
Epoch 5/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.2533 - loss: 1.0909 - val_accuracy: 0.5417 - val_loss: 0.8752
Epoch 6/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step - accuracy: 0.3290 - loss: 1.0235 - val_accuracy: 0.6667 - val_loss: 0.8512
Epoch 7/50
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━

In [64]:
def predict_mc_dropout(model, X, n_iter=100):
    preds = np.stack([model(X, training=True).numpy() for _ in range(n_iter)])
    return preds.mean(axis=0)

In [65]:
y_proba = predict_mc_dropout(model, X_test_scaled, n_iter=100)

In [66]:
print("Shape probabilitas:", y_proba.shape)
print("Contoh probabilitas untuk 5 sampel pertama:\n", y_proba[:5])

Shape probabilitas: (30, 3)
Contoh probabilitas untuk 5 sampel pertama:
 [[4.0260687e-02 6.7122465e-01 2.8851438e-01]
 [9.2543596e-01 4.1811489e-02 3.2752682e-02]
 [6.0832663e-04 1.3655116e-01 8.6284041e-01]
 [3.1433925e-02 5.2417159e-01 4.4439453e-01]
 [1.9243708e-02 5.3971273e-01 4.4104356e-01]]


In [67]:
np.round(model.predict(X_test_scaled[:1]), 2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 210ms/step


array([[0.02, 0.7 , 0.28]], dtype=float32)

In [72]:
np.round(y_proba[:1], 2)

array([[0.04, 0.67, 0.29]], dtype=float32)

**Summary and Practical Guidelines**

The chapter concludes with practical advice for training deep neural networks effectively:
•	Use ReLU or its variants
•	Apply He initialization
•	Prefer Batch Normalization
•	Use adaptive optimizers like Adam
•	Regularize using dropout or early stopping
•	Tune learning rates carefully
These techniques collectively make training deep neural networks faster, more stable, and more reliable.
