🏷️sec_mlp_scratch
Now that we have characterized
multilayer perceptrons (MLPs) mathematically,
let us try to implement one ourselves. To compare against our previous results
achieved with softmax regression
(:numref:sec_softmax_scratch
),
we will continue to work with
the Fashion-MNIST image classification dataset
(:numref:sec_fashion_mnist
).
from d2l import mxnet as d2l
from mxnet import gluon, np, npx
npx.set_np()
#@tab pytorch
from d2l import torch as d2l
import torch
from torch import nn
#@tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
#@tab all
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
Recall that Fashion-MNIST contains 10 classes,
and that each image consists of a
Again, we will represent our parameters with several tensors. Note that for every layer, we must keep track of one weight matrix and one bias vector. As always, we allocate memory for the gradients of the loss with respect to these parameters.
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = np.random.normal(scale=0.01, size=(num_inputs, num_hiddens))
b1 = np.zeros(num_hiddens)
W2 = np.random.normal(scale=0.01, size=(num_hiddens, num_outputs))
b2 = np.zeros(num_outputs)
params = [W1, b1, W2, b2]
for param in params:
param.attach_grad()
#@tab pytorch
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = nn.Parameter(torch.randn(
num_inputs, num_hiddens, requires_grad=True) * 0.01)
b1 = nn.Parameter(torch.zeros(num_hiddens, requires_grad=True))
W2 = nn.Parameter(torch.randn(
num_hiddens, num_outputs, requires_grad=True) * 0.01)
b2 = nn.Parameter(torch.zeros(num_outputs, requires_grad=True))
params = [W1, b1, W2, b2]
#@tab tensorflow
num_inputs, num_outputs, num_hiddens = 784, 10, 256
W1 = tf.Variable(tf.random.normal(
shape=(num_inputs, num_hiddens), mean=0, stddev=0.01))
b1 = tf.Variable(tf.zeros(num_hiddens))
W2 = tf.Variable(tf.random.normal(
shape=(num_hiddens, num_outputs), mean=0, stddev=0.01))
b2 = tf.Variable(tf.random.normal([num_outputs], stddev=.01))
params = [W1, b1, W2, b2]
To make sure we know how everything works,
we will [implement the ReLU activation] ourselves
using the maximum function rather than
invoking the built-in relu
function directly.
def relu(X):
return np.maximum(X, 0)
#@tab pytorch
def relu(X):
a = torch.zeros_like(X)
return torch.max(X, a)
#@tab tensorflow
def relu(X):
return tf.math.maximum(X, 0)
Because we are disregarding spatial structure,
we reshape
each two-dimensional image into
a flat vector of length num_inputs
.
Finally, we (implement our model)
with just a few lines of code.
def net(X):
X = d2l.reshape(X, (-1, num_inputs))
H = relu(np.dot(X, W1) + b1)
return np.dot(H, W2) + b2
#@tab pytorch
def net(X):
X = d2l.reshape(X, (-1, num_inputs))
H = relu(X@W1 + b1) # Here '@' stands for matrix multiplication
return (H@W2 + b2)
#@tab tensorflow
def net(X):
X = d2l.reshape(X, (-1, num_inputs))
H = relu(tf.matmul(X, W1) + b1)
return tf.matmul(H, W2) + b2
To ensure numerical stability,
and because we already implemented
the softmax function from scratch
(:numref:sec_softmax_scratch
),
we leverage the integrated function from high-level APIs
for calculating the softmax and cross-entropy loss.
Recall our earlier discussion of these intricacies
in :numref:subsec_softmax-implementation-revisited
.
We encourage the interested reader
to examine the source code for the loss function
to deepen their knowledge of implementation details.
loss = gluon.loss.SoftmaxCrossEntropyLoss()
#@tab pytorch
loss = nn.CrossEntropyLoss()
#@tab tensorflow
def loss(y_hat, y):
return tf.losses.sparse_categorical_crossentropy(
y, y_hat, from_logits=True)
Fortunately, [the training loop for MLPs
is exactly the same as for softmax regression.]
Leveraging the d2l
package again,
we call the train_ch3
function
(see :numref:sec_softmax_scratch
),
setting the number of epochs to 10
and the learning rate to 0.1.
num_epochs, lr = 10, 0.1
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs,
lambda batch_size: d2l.sgd(params, lr, batch_size))
#@tab pytorch
num_epochs, lr = 10, 0.1
updater = torch.optim.SGD(params, lr=lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
#@tab tensorflow
num_epochs, lr = 10, 0.1
updater = d2l.Updater([W1, W2, b1, b2], lr)
d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, updater)
To evaluate the learned model, we [apply it on some test data].
#@tab all
d2l.predict_ch3(net, test_iter)
- We saw that implementing a simple MLP is easy, even when done manually.
- However, with a large number of layers, implementing MLPs from scratch can still get messy (e.g., naming and keeping track of our model's parameters).
- Change the value of the hyperparameter
num_hiddens
and see how this hyperparameter influences your results. Determine the best value of this hyperparameter, keeping all others constant. - Try adding an additional hidden layer to see how it affects the results.
- How does changing the learning rate alter your results? Fixing the model architecture and other hyperparameters (including number of epochs), what learning rate gives you the best results?
- What is the best result you can get by optimizing over all the hyperparameters (learning rate, number of epochs, number of hidden layers, number of hidden units per layer) jointly?
- Describe why it is much more challenging to deal with multiple hyperparameters.
- What is the smartest strategy you can think of for structuring a search over multiple hyperparameters?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: