##** Train a Deep Learning Model for digit recognition (on MNIST)**

---


Let us start with the imports. We need `torch` and  `torchvision`. The first contains all the tools we need for training a network, while the second contains practical shortcuts for datasets and other stuffs (e.g. pretrained models).



In [0]:
import torch, torchvision

##Dataset and dataloaders
Now that we have the tools, let us define a function which allows us to load the MNIST data. For doing so, we need a dataset (`torch.utils.data.Dataset`) and a loader (`torch.utils.data.Dataloader`), allowing us to loop over the dataset. For MNIST PyTorch already contains a dataset definition, which you can find [here](https://pytorch.org/docs/stable/torchvision/datasets.html#mnist). For what concerns the dataloader, default ones can be found [here](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader). We must create train, validation and test set and a loader for each of them.

In [0]:
def get_data(batch_size, test_batch_size=256): 
  # This function is needed to convert the PIL images to Tensors
  transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor()])

  # Load data
  full_training_data = torchvision.datasets.MNIST(root="./data", download=True, train=False, transform=transform)

  # Create train, test and validation splits
  num_samples = len(full_training_data)
  training_samples = int(num_samples*0.7+1)
  test_validation_samples = num_samples - training_samples 
  validation_samples = int(test_validation_samples*0.5+1)
  test_samples = test_validation_samples - validation_samples

  training_data, test_data, validation_data = torch.utils.data.random_split(full_training_data, [training_samples, test_samples, validation_samples])

  # Initialize dataloaders
  train_loader = torch.utils.data.DataLoader(training_data, batch_size=batch_size)
  val_loader = torch.utils.data.DataLoader(validation_data, batch_size=batch_size)
  test_loader = torch.utils.data.DataLoader(test_data, batch_size=test_batch_size)
  
  return train_loader, val_loader, test_loader


##Network definition
Now that we have the data, what we need is a network. For now let us instantiate an MLP with 2 fully-connected layers (input-to-hidden and hidden-to-output).  The fully-connected layers are defined as `torch.nn.Linear`.  Between the layers we must put a non-linear activation. For now let us use a sigmoid (`torch.nn.Sigmoid`). For other layers and activation functions please have a look at the [doc](https://pytorch.org/docs/stable/nn.html). Do not forget that a network must extend a `torch.nn.Module`.

In [0]:
# Our network
class MyFirstNetwork(torch.nn.Module):
  def __init__(self, input_dim, hidden_dim, output_dim):
    super(MyFirstNetwork, self).__init__()
    self.layer1 = torch.nn.Linear(input_dim, hidden_dim)
    self.layer2 = torch.nn.Linear(hidden_dim, hidden_dim)
    self.layer3 = torch.nn.Linear(hidden_dim, hidden_dim)
    self.activation = torch.nn.Sigmoid()
    self.dropout = torch.nn.Dropout(0.1)

  def forward(self, x):
    x = x.view(x.shape[0],-1)
    x = self.layer1(x)
    x = self.activation(x)
    x = self.layer2(x)
    
    return x




## Loss/cost function
For training the network, we obviously need a loss function. The task is classification with multiple classes, thus a proper loss could be a cross-entropy with softmax. We can again use `torch.nn` which contains several losses, among which `torch.nn.CrossEntropyLoss`. Notice that this loss already contains the softmax activation, thus we do not need to apply the softmax to the output of our network.

In [0]:
def get_cost_function():
  cost_function = torch.nn.CrossEntropyLoss()
  return cost_function

## Optimizer
Now we must devise a way to update the parameters of our network. This can be easily held out by having a look at [`torch.optim`](https://pytorch.org/docs/stable/optim.html) which contains a large variety of optimizers.

In [0]:
def get_optimizer(net, lr, wd, momentum):
  optimizer = torch.optim.RMSprop(net.parameters(), lr=lr, weight_decay=wd, momentum=momentum)
  return optimizer

## Train and test functions
We are ready to merge everything by creating a training and test functions. Both of them must:

1.   Loop over the data (exploiting the dataloader, which is just an iterator)
2.   Forward the data through the network
3.  Comparing the output with the target labels for computing either the loss (train), the accuracy (test) or both.

Additionally, during training we must:


1.   Compute the gradient with the backward pass (`loss.backward()`)
2.   Using the optimizer to update the weights (`optimizer.step()`)
3.   Cleaning the gradient of the weights in order to not accumulating it (`optimizer.zero_grad()`)

With these steps in mind, we are ready to define everything.





In [0]:
def train(net,data_loader,optimizer,cost_function, device='cuda'):
  samples = 0.
  cumulative_loss = 0.
  cumulative_accuracy = 0.
  
  # Set the network in train mode
  
  # Loop over the dataset
  for batch_idx, (inputs, targets) in enumerate(data_loader):
    # Load data into GPU
    inputs = inputs.to(device)
    targets = targets.to(device)
    
    # Forward pass
    outputs = net(inputs)

    # Apply the loss
    loss = cost_function(outputs, targets)
      
    # Backward pass
    loss.backward()
    
    # Update parameters
    optimizer.step()
    
    # Reset the optimizer
    optimizer.zero_grad()

    # Better print something, no?
    samples+=inputs.shape[0]
    cumulative_loss += loss.item()
    _, predicted = outputs.max(1)
    cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, cumulative_accuracy/samples*100


def test(net, data_loader, cost_function, device='cuda'):
  samples = 0.
  cumulative_loss = 0.
  cumulative_accuracy = 0.

  #Set the network in eval mode
  
  with torch.no_grad(): # torch.no_grad() disables the autograd machinery, thus not saving the intermediate activations
    # Loop over the dataset
    for batch_idx, (inputs, targets) in enumerate(data_loader):
      # Load data into GPU
      inputs = inputs.to(device)
      targets = targets.to(device)
      
      # Forward pass
      outputs = net(inputs)

      # Apply the loss
      loss = cost_function(outputs, targets)

      # Better print something
      samples+=inputs.shape[0]
      cumulative_loss += loss.item() # Note: the .item() is needed to extract scalars from tensors
      _, predicted = outputs.max(1)
      cumulative_accuracy += predicted.eq(targets).sum().item()

  return cumulative_loss/samples, cumulative_accuracy/samples*100

## Wrapping everything up
Finally, we need a main function which initializes everything + the needed hyperparameters and loops over multiple epochs (printing the results).

In [0]:
def main(batch_size=128, input_dim=28*28, hidden_dim=100, output_dim=10, device='cuda:0', learning_rate=0.01, weight_decay=0.000001, momentum=0.9, epochs=10):
  
  train_loader, val_loader, test_loader = get_data(batch_size)
  net = MyFirstNetwork(input_dim, hidden_dim, output_dim).to(device)
  optimizer = get_optimizer(net, learning_rate, weight_decay, momentum)
  cost_function = get_cost_function()

  print('Before training:')
  train_loss, train_accuracy = test(net, train_loader, cost_function)
  val_loss, val_accuracy = test(net, val_loader, cost_function)
  test_loss, test_accuracy = test(net, test_loader, cost_function)

  print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))
  print('\t Validation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
  print('\t Test loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_accuracy))
  print('-----------------------------------------------------')

  for e in range(epochs):
    train_loss, train_accuracy = train(net, train_loader, optimizer, cost_function)
    val_loss, val_accuracy = test(net, val_loader, cost_function)
    print('Epoch: {:d}'.format(e+1))
    print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))
    print('\t Validation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
    print('-----------------------------------------------------')

  print('After training:')
  train_loss, train_accuracy = test(net, train_loader, cost_function)
  val_loss, val_accuracy = test(net, val_loader, cost_function)
  test_loss, test_accuracy = test(net, test_loader, cost_function)

  print('\t Training loss {:.5f}, Training accuracy {:.2f}'.format(train_loss, train_accuracy))
  print('\t Validation loss {:.5f}, Validation accuracy {:.2f}'.format(val_loss, val_accuracy))
  print('\t Test loss {:.5f}, Test accuracy {:.2f}'.format(test_loss, test_accuracy))
  print('-----------------------------------------------------')


And now let the magic happen :)

In [0]:
main(epochs=100)

Before training:
	 Training loss 0.03764, Training accuracy 0.00
	 Validation loss 0.03843, Validation accuracy 0.00
	 Test loss 0.01922, Test accuracy 0.00
-----------------------------------------------------
Epoch: 1
	 Training loss 0.00765, Training accuracy 70.63
	 Validation loss 0.00567, Validation accuracy 79.40
-----------------------------------------------------
Epoch: 2
	 Training loss 0.00378, Training accuracy 85.17
	 Validation loss 0.00485, Validation accuracy 82.67
-----------------------------------------------------
Epoch: 3
	 Training loss 0.00308, Training accuracy 87.94
	 Validation loss 0.00470, Validation accuracy 84.13
-----------------------------------------------------
Epoch: 4
	 Training loss 0.00286, Training accuracy 88.67
	 Validation loss 0.00492, Validation accuracy 84.07
-----------------------------------------------------
Epoch: 5
	 Training loss 0.00268, Training accuracy 89.59
	 Validation loss 0.00462, Validation accuracy 85.47
------------------