# 4.5. Weight Decay

## 4.5.4. Concise Implementation

Here I added two hidden layers. In this case, `net.losses` is a list of two tensors. Thus use `sum(net.losses)` when computing a batch loss.

In [9]:
from d2l import tensorflow as d2l
import tensorflow as tf

n_train, n_test, num_inputs, batch_size = 20, 100, 200, 5
true_w, true_b = tf.ones((num_inputs, 1)) * 0.01, 0.05
train_data = d2l.synthetic_data(true_w, true_b, n_train)
train_iter = d2l.load_array(train_data, batch_size)
test_data = d2l.synthetic_data(true_w, true_b, n_test)
test_iter = d2l.load_array(test_data, batch_size, is_train=False)

wd = 0.5            # wd is lambda
num_inputs = 200    # num_inputs is d (feature size)
num_epochs, lr = 100, 0.003

net = tf.keras.models.Sequential()
net.add(tf.keras.layers.Dense(10, kernel_regularizer=tf.keras.regularizers.l2(wd)))
net.add(tf.keras.layers.Dense(1, kernel_regularizer=tf.keras.regularizers.l2(wd)))
net.build(input_shape=(1, num_inputs))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)

for epoch in range(num_epochs):
    for X, y in train_iter:
        with tf.GradientTape() as tape:
            # `tf.keras` requires retrieving and adding the losses from
            # layers manually for custom training loop.
            l = loss(net(X), y) + sum(net.losses)
        grads = tape.gradient(l, net.trainable_variables)
        trainer.apply_gradients(zip(grads, net.trainable_variables))

# 4.6. Dropout

Let $X$ be an input (minibatch) to a dropout layer $D$. 

If $h$ is a scalar entry in $X$, then the dropout layer makes $h$ either zero with probability $p$ or $h/(1-p)$ with probability $1-p$.