# Multilayer Perceptron (MLP)
In this notebook, we build a multilayer perceptron for digit recognition trained on the MNIST dataset. We used [Deep-Learning-Experiments](https://github.com/roatienza/Deep-Learning-Experiments/blob/master/versions/2022/mlp/python/mlp_pytorch_demo.ipynb) as reference.

In [1]:
# Import necessary libraries
import torch
from torch import nn
import torchvision
import pytorch_lightning as pl
from torchmetrics import Accuracy
from torch.optim import SGD, Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
from einops import rearrange
from argparse import ArgumentParser



We create a simple 4-layer multilayer perceptron for digit recognition. We use torch.nn.Module as superclass to remove boilerplate code. The number of input features corresponds to the input image size for the MNIST dataset, which is $28\times28$ (It is grayscale so we do not need to multiply by 3). The number of nodes in the hidden layers is set to $256$. Finally, the number of classes is set to $10$, corresponding to the number of possible digits in the MNIST dataset.

In [2]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, num_features=28*28, num_hidden=256, num_classes=10):
        # Initiate the nn.Module superclass
        super().__init__()

        # Build the layers of the MLP
        self.fc_in = nn.Linear(num_features,num_hidden)
        self.fc_hid = nn.Linear(num_hidden,num_hidden)
        self.fc_out = nn.Linear(num_hidden,num_classes)

        # Set up the activation function (we choose ReLU) and the softmax function (for the output).
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1) # Comment if using CrossEntropyLoss()

    def forward(self, x):
        # Flatten the input x from bx1x28x28 to 1x1*28*28=784 to match num_features.
        y = rearrange(x, 'b c h w -> b (c h w)')

        # Feed the rearranged input data to the input layer, then feed to the activation function.
        y = self.fc_in(y)
        y = self.relu(y)

        # Do the same for the 2 hidden layers.
        y = self.fc_hid(y)
        y = self.relu(y)
        y = self.fc_hid(y)
        y = self.relu(y)

        # Feed the resulting tensor into the output layer. ReLU activation function is not needed since softmax is used for it.
        y = self.fc_out(y)
        y = self.softmax(y) # Comment if using CrossEntropyLoss()

        return y

We now perform the necessary preparations for the dataset and training, using the PyTorch Lightning module.

In [3]:
class MNISTMLPModel(pl.LightningModule):
    def __init__(self, lr=0.001, batch_size=64, num_workers=1, max_epochs=30, model=MultilayerPerceptron, optim="adam"):
        # Initiate LightningModule superclass
        super().__init__()
        self.train_step_outputs = []
        self.test_step_outputs = []

        # Set up other parameters
        self.save_hyperparameters()
        self.model = model()

        # Set up loss function (Mean Squared Error) and accuracy
        self.loss = nn.MSELoss()
        #self.loss = nn.CrossEntropyLoss()
        self.accuracy = Accuracy(task="multiclass", num_classes=10)
        self.optim = optim 

    def forward(self,x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        # Perform one-hot encoding on y first
        y = self.mnist_one_hot(y) # Comment if using CrossEntropyLoss()
        y_hat = self.forward(x)
        loss = self.loss(y_hat, y)
        self.train_step_outputs.append({"loss": loss})
        return loss
    
    def on_train_epoch_end(self):
        avg_loss = torch.stack([x["loss"] for x in self.train_step_outputs]).mean()
        print(f'Train loss: {avg_loss}')
        self.train_step_outputs.clear()
        self.log("train_loss", avg_loss, on_epoch=True)
    
    def test_step(self, batch, batch_idx):
        x, y = batch
        # Perform one-hot encoding on y first. We keep the original y for the accuracy
        y_oh = self.mnist_one_hot(y) # Comment if using CrossEntropyLoss()
        y_hat = self.forward(x)
        loss = self.loss(y_hat, y_oh) # Comment if using CrossEntropyLoss()
        #loss = self.loss(y_hat, y) # Uncomment if using CrossEntropyLoss()
        # We get the predictions through argmax
        y_preds = torch.argmax(y_hat, dim=1)
        acc = self.accuracy(y, y_preds) * 100. # Comment if using CrossEntropyLoss()
        #acc = self.accuracy(y, y_hat) * 100. # Uncomment if using CrossEntropyLoss()
        self.test_step_outputs.append({"y_hat": y_hat, "test_loss": loss, "test_acc": acc})
        return y_hat, loss, acc
    
    def on_test_epoch_end(self):
        avg_loss = torch.stack([x["test_loss"] for x in self.test_step_outputs]).mean()
        avg_acc = torch.stack([x["test_acc"] for x in self.test_step_outputs]).mean()
        print(f'Test loss: {avg_loss}')
        print(f'Test accuracy: {avg_acc}')
        self.test_step_outputs.clear()
        self.log("test_loss", avg_loss, on_epoch=True, prog_bar=True)
        self.log("test_acc", avg_acc, on_epoch=True, prog_bar=True)

    def validation_step(self, batch, batch_idx):
       return self.test_step(batch, batch_idx)

    def on_validation_epoch_end(self):
        return self.on_test_epoch_end()
    
    def configure_optimizers(self):
        if self.optim == "adam":
            optimizer = Adam(self.parameters(), lr=self.hparams.lr)
        elif self.optim == "sgd":
            optimizer = SGD(self.parameters(), lr=self.hparams.lr)
        scheduler = CosineAnnealingLR(optimizer, T_max=self.hparams.max_epochs)
        return [optimizer], [scheduler]
    
    # Settings from https://nextjournal.com/gkoehler/pytorch-mnist
    def train_dataloader(self):
        return torch.utils.data.DataLoader(
            torchvision.datasets.MNIST('/files/', train=True, download=True,
                                        transform=torchvision.transforms.Compose([
                                        torchvision.transforms.ToTensor(),
                                        torchvision.transforms.Normalize((0.1307,), (0.3081,))])), 
                                        batch_size=self.hparams.batch_size, shuffle=True, pin_memory=True)
    
    def test_dataloader(self):
        return torch.utils.data.DataLoader(
            torchvision.datasets.MNIST('/files/', train=False, download=True,
                                        transform=torchvision.transforms.Compose([
                                        torchvision.transforms.ToTensor(),
                                        torchvision.transforms.Normalize((0.1307,), (0.3081,))])), 
                                        batch_size=self.hparams.batch_size, shuffle=False, pin_memory=True)
    
    def val_dataloader(self):
        return self.test_dataloader()
    
    def setup(self, stage=None):
        self.train_dataloader()
        self.test_dataloader()

    # Perform MNIST-specific one-hot encoding
    def mnist_one_hot(self, x):
        device = 'cuda' if 'cuda' in str(x.device) else 'cpu'
        y_oh = torch.zeros(size=(x.shape[0],10), device=device)
        y_oh.to(device)
        for i, y in enumerate(x):
            y_oh[i,y] = 1
        return y_oh.float()

After setting up the model, the datasets, and the train/test/validation configurations, we set up the arguments.

In [4]:
def get_args():
    parser = ArgumentParser(description="PyTorch Lightning MNIST Example")
    parser.add_argument("--epochs", type=int, default=30, help="num epochs")
    parser.add_argument("--batch-size", type=int, default=64, help="batch size")
    parser.add_argument("--lr", type=float, default=0.001, help="learning rate")

    parser.add_argument("--num-classes", type=int, default=10, help="num classes")

    parser.add_argument("--optim", default="adam", help="optimizer")
    # Verify device count with torch.cuda.device_count()
    parser.add_argument("--devices", default=1)
    # Verify CUDA availability with torch.cuda.is_available())
    parser.add_argument("--accelerator", default='gpu')
    # Recommended: num_workers = (os.cpu_count() // 2) // torch.cuda.device_count()
    parser.add_argument("--num-workers", type=int, default=4, help="num workers")

    parser.add_argument("--model", default=MultilayerPerceptron)
    args = parser.parse_args("")
    return args

Now, we train the MLP model with the MNIST dataset

In [5]:
if __name__ == "__main__":
    args = get_args()
    model = MNISTMLPModel(lr=args.lr, batch_size=args.batch_size,
                           num_workers=args.num_workers,
                           model=args.model, optim=args.optim)
    model.setup()
    print(model)

    trainer = pl.Trainer(accelerator=args.accelerator,
                      devices=args.devices,
                      max_epochs=args.epochs)

    trainer.fit(model)
    trainer.test(model)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


MNISTMLPModel(
  (model): MultilayerPerceptron(
    (fc_in): Linear(in_features=784, out_features=256, bias=True)
    (fc_hid): Linear(in_features=256, out_features=256, bias=True)
    (fc_out): Linear(in_features=256, out_features=10, bias=True)
    (relu): ReLU()
    (softmax): Softmax(dim=1)
  )
  (loss): MSELoss()
  (accuracy): MulticlassAccuracy()
)


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type                 | Params
--------------------------------------------------
0 | model    | MultilayerPerceptron | 269 K 
1 | loss     | MSELoss              | 0     
2 | accuracy | MulticlassAccuracy   | 0     
--------------------------------------------------
269 K     Trainable params
0         Non-trainable params
269 K     Total params
1.077     Total estimated model params size (MB)


Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

  rank_zero_warn(


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:01<00:00,  1.66it/s]Test loss: 0.09025193750858307
Test accuracy: 7.8125
                                                                           

  rank_zero_warn(


Epoch 0: 100%|██████████| 938/938 [00:21<00:00, 44.30it/s, v_num=41]Test loss: 0.006266745738685131
Test accuracy: 95.85987091064453
Epoch 0: 100%|██████████| 938/938 [00:25<00:00, 37.04it/s, v_num=41, test_loss=0.00627, test_acc=95.90]Train loss: 0.011967782862484455
Epoch 1: 100%|██████████| 938/938 [00:24<00:00, 38.74it/s, v_num=41, test_loss=0.00627, test_acc=95.90]Test loss: 0.005569866858422756
Test accuracy: 96.26791381835938
Epoch 1: 100%|██████████| 938/938 [00:28<00:00, 33.00it/s, v_num=41, test_loss=0.00557, test_acc=96.30]Train loss: 0.005723630078136921
Epoch 2: 100%|██████████| 938/938 [00:23<00:00, 39.77it/s, v_num=41, test_loss=0.00557, test_acc=96.30]Test loss: 0.004178702365607023
Test accuracy: 97.31289672851562
Epoch 2: 100%|██████████| 938/938 [00:26<00:00, 35.15it/s, v_num=41, test_loss=0.00418, test_acc=97.30]Train loss: 0.004358722362667322
Epoch 3: 100%|██████████| 938/938 [00:22<00:00, 41.85it/s, v_num=41, test_loss=0.00418, test_acc=97.30]Test loss: 0.0047874

`Trainer.fit` stopped: `max_epochs=30` reached.


Train loss: 0.000287483970168978
Epoch 29: 100%|██████████| 938/938 [00:28<00:00, 33.46it/s, v_num=41, test_loss=0.0026, test_acc=98.50]


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  rank_zero_warn(


Testing DataLoader 0: 100%|██████████| 157/157 [00:03<00:00, 42.92it/s]Test loss: 0.002599412575364113
Test accuracy: 98.4972152709961
Testing DataLoader 0: 100%|██████████| 157/157 [00:03<00:00, 42.80it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc             98.4972152709961
        test_loss          0.002599412575364113
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


#### Notes (Output text files are available in the same folder as this notebook)
* Adam optimizer performs significantly better than Stochastic Gradient Descent (SGD) in terms of accuracy. This may be due to the claim of various studies that Adam converges faster than SGD (source: [Adam vs SGD](https://medium.com/geekculture/a-2021-guide-to-improving-cnns-optimizers-adam-vs-sgd-495848ac6008)).
* Both Mean-Squared Error (MSE) and Cross Entropy (CE) loss functions perform well when used with the Adam optimizer.