# Exercise 1: Classifying penguin species with PyTorch

<img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png" width="750" />


Artwork by @allison_horst

In this exercise, we will use the python package [``palmerpenguins``](https://github.com/mcnakhaee/palmerpenguins) to supply a toy dataset containing various features and measurements of penguins.

We have already created a PyTorch dataset which yields data for each of the penguins, but first we should examine the dataset and see what it contains.

### Task 1: look at the data
In the following code block, we import the ``load_penguins`` function from the ``palmerpenguins`` package.

- Call this function, which returns a single object, and assign it to the variable ``data``.
  - Print ``data`` and recognise that ``load_penguins`` has returned a ``pandas.DataFrame``.
- Consider which features it might make sense to use in order to classify the species of the penguins.
  - You can print the column titles using ``pd.DataFrame.keys()``
  - You can also obtain useful information using ``pd.DataFrame.Series.describe()``

In [1]:
from palmerpenguins import load_penguins

In [2]:
data = load_penguins()
print(data.keys())

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')


Let's now disuss the features we will use to classify the penguins' species, and populate the following list together:
- ...
- ...
- ...

### Task 2: creating a ``torch.utils.data.Dataset``

All PyTorch dataset objects are subclasses of the ``torch.utils.data.Dataset`` class. To make a custom dataset, create a class which inherits from the ``Dataset`` class, implement some methods (the Python magic (or dunder) methods ``__len__`` and ``__getitem__``) and supply some data.

Spoiler alert: we've done this for you already in ``src/ml_workshop/_penguins.py``.

- Open the file ``src/ml_workshop/_penguins.py``.
- Let's examine, and discuss, each of the methods together.
  - ``__len__``
    - What does the ``__len__`` method do?
    - ...
  - ``__getitem__``
    - What does the ``__getitem__`` method do?
    - ...
- Review and discuss the class arguments.
  - ``input_keys``— ...
  - ``target_keys``— ...
  - ``train``— ...
  - ``x_tfms``— ...
  - ``y_tfms``— ...

### Task 3: Obtaining training and validation datasets

- Instantiate the penguin dataloader.
  - Make sure you supply the correct column titles for the features and the targets.
- Iterate over the dataset
    - Hint:
        ```python
        for features, targets in dataset:
            # print the features and targets here
        ```

In [3]:
from ml_workshop import PenguinDataset

features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']

data_set = PenguinDataset(
    input_keys=["bill_length_mm", "body_mass_g"],
    target_keys=["species"],
    train=True,
)

- Can we give these items to a neural network, or do they need to be transformed first?
  - Short answer: no, we can't just pass tuples of numbers or strings to a neural network.
    - We must represent these data as ``torch.Tensor``s.

### Task 4: Applying transforms to the data

A common way of transforming inputs to neural networks is to apply a series of transforms using ``torchvision.transforms.Compose``. The ``Compose`` object takes a list of callable objects and applies them to the incoming data.

These transforms can be very useful for mapping between file paths and tensors of images, etc.

In [4]:
from torchvision.transforms import Compose
from torch import tensor, eye

# Apply the transforms we need to the PenguinDataset to get out inputs
# targets as Tensors.

features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']
species = ['Gentoo', 'Chinstrap', 'Adelie']

train_set = PenguinDataset(
    input_keys=features,
    target_keys=['species'],
    train=True,
    x_tfms=Compose([tensor, lambda x: x.float()]),
    y_tfms=Compose([lambda x: eye(3)[species.index(x[0])]])
)

valid_set = PenguinDataset(
    input_keys=features,
    target_keys=['species'],
    train=False,
    x_tfms=Compose([tensor, lambda x: x.float()]),
    y_tfms=Compose([lambda x: eye(3)[species.index(x[0])]])
)

### Task 5: Creating ``DataLoaders``—and why

- Once we have created a ``Dataset`` object, we then wrap it in a ``DataLoader``.
  - The ``DataLoader`` object allows us to put our inputs and targets in mini-batches, which makes for more efficient training.
    - Note: rather than suppling one input-target pair to the model at a time, we supply "mini-batches" of these data at once.
    - The number of items we supply at once is called the batch size.
  - The ``DataLoader`` also randomly shuffles the data each epoch (when training).
  - It allows us to load different mini-batches in parallel, which can be very useful for larger datasets and images that can't all fit in memory at once.

In [25]:
from torch.utils.data import DataLoader

batchsize = 16

train_loader = DataLoader(train_set,shuffle=True,batch_size=batchsize, 
                          drop_last=True) # ensures batch size is always consistent
valid_loader = DataLoader(valid_set,shuffle=True,batch_size=batchsize)

# Create training and validation DataLoaders.

### Task 6: Creating a neural network in PyTorch

Here we will create our neural network in PyTorch, and have a general discussion on clean and messy ways of going about it.

- First, we will create quite an ugly network to highlight how to make a neural network in PyTorch on a very basic level.
- We will then discuss a trick for making the print-out nicer.
- Finally, we will discuss how the best approach would be to write a class where various parameters (e.g. number of layers, droupout probabilities, etc.) are passed as arguments.

In [26]:
from torch.nn import Module
from torch.nn import BatchNorm1d, Linear, ReLU, Dropout, Sequential
from torch import Tensor


class FCNet(Sequential):
    """Fully-connected neural network."""
    def __init__(self):
        
        # easy way to construct layers
        super().__init__(BatchNorm1d(5), # normalization, we can pad each layer with this
                        Linear(5, 16),
                        ReLU(),
                        Linear(16, 8),
                        ReLU(),
                        Linear(8, 3))
        
model = FCNet()

### Task 7: Selecting a loss function

- Binary cross-entropy is about the most common loss function for classification.
  - Details on this loss function are available in the [PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).
- Let's instantiate it together.

In [7]:
from torch.nn import BCELoss

loss_func = BCELoss()

### Task 8: Selecting an optimiser

While we talked about stochastic gradient descent in the slides, most people use the so-called [Adam optimiser](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html).

You can think of it as a more complex and improved implementation of SGD.

In [8]:
# Create an optimiser and give it the model's parameters.
from torch.optim import Adam

optim = Adam(model.parameters(), lr=1e-4)

### Task 9: Writing basic training and validation loops

- Before we jump in and write these loops, we must first choose an activation function to apply to the model's outputs.
  - Here we are going to use the softmax activation function: see [the PyTorch docs](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html).
  - For those of you who've studied physics, you may be remininded of the partition function in thermodynamics.
  - This activation function is good for classifcation when the result is one of ``A or B or C``.
    - It's bad if you even want to assign two classification to one images—say a photo of a dog _and_ and cat.
  - It turns the raw outputs, or logits, into "psuedo probabilities", and we take our prediction to be the most probable class.

- We will write the training loop together, then you can go ahead and write the (simpler) validation loop.

In [12]:
from typing import Dict
from numpy import mean

from torch import no_grad

@no_grad()
def get_accuracy(pred: Tensor, target: Tensor):
    
    decision = pred.argmax(dim=1)
    
    return (decision == target.argmax(dim=1)).float().mean()


def train_one_epoch(
    model: Module,
    train_loader: DataLoader,
    optimiser: Adam,
    loss_func: BCELoss,
) -> Dict[str, float]:
    """Train ``model`` for once epoch.

    Parameters
    ----------
    model : Module
        The neural network.
    train_loader : DataLoader
        Training dataloader.
    optimiser : Adam
        The optimiser.
    loss_func : BCELoss
        Binary cross-entropy loss function.

    Returns
    -------
    Dict[str, float]
        A dictionary of metrics.

    """
    metrics = {'loss':[], 'accuracy':[]}
    model.train()
    
    for batch, targets in train_loader:
        optimiser.zero_grad() # deletes gradients from past epochs
        
        # activation function for output
        # (N, num_classes)
        preds = model(batch).softmax(dim=1) # sum over dim 1 = 1
        
        loss = loss_func(preds, targets)
        
        loss.backward()
        
        optimiser.step()
        
        metrics['loss'].append(loss.item()) # we could do running sum instead of list if large epochs
        metrics['accuracy'].append(get_accuracy(preds, targets))
        
    return {met : mean(values) for met, values in metrics.items()}

@no_grad()
def validate_one_epoch(
    model: Module,
    valid_loader: DataLoader,
    loss_func: BCELoss,
) -> Dict[str, float]:
    """Validate ``model`` for a single epoch.

    Parameters
    ----------
    model : Module
        The neural network.
    valid_loader : DataLoader
        Training dataloader.
    loss_func : BCELoss
        Binary cross-entropy loss function.

    Returns
    -------
    Dict[str, float]
        Metrics of interest.

    """
    metrics = {'loss':[], 'accuracy':[]}
    model.eval()
    
    for batch, targets in valid_loader:
        preds = model(batch).softmax(dim=1) # sum over dim 1 = 1
        
        loss = loss_func(preds, targets)
        
        metrics['loss'].append(loss.item()) # we could do running sum instead of list if large epochs
        metrics['accuracy'].append(get_accuracy(preds, targets))
        
    return {met : mean(values) for met, values in metrics.items()}

### Task 10: Training, extracting and plotting metrics

- Now we can train our model for a specified number of epochs.
  - During each epoch the model "sees" each training item once.
- Append the training and validation metrics to a list.
- Turn them into a ``pandas.DataFrame``
  - Note: You can turn a ``List[Dict[str, float]]``, say ``my_list`` into a ``DataFrame`` with ``DataFrame(my_list)``.
- Use Matplotlib to plot the training and validation metrics as a function of the number of epochs.

We will begin the code block together before you complete it independently.  
After some time we will go through the solution together.

In [33]:
import pandas as pd
epochs = 5

train_metrics, valid_metrics = [], []

for _ in range(epochs):
    train_metrics.append(train_one_epoch(model, train_loader, optim, loss_func))
    valid_metrics.append(validate_one_epoch(model, valid_loader, loss_func))
    
train_metrics = pd.DataFrame(train_metrics)
valid_metrics = pd.DataFrame(valid_metrics)

print(train_metrics.join(valid_metrics, lsuffix="_train", rsuffix="_valid"))

   loss_train  accuracy_train  loss_valid  accuracy_valid
0    0.622891        0.338235    0.639671        0.348958
1    0.622996        0.341912    0.641569        0.333333
2    0.622988        0.349265    0.643010        0.328125
3    0.622543        0.349265    0.641861        0.333333
4    0.622014        0.345588    0.641227        0.333333


### Task 11: Visualise some results

Let's do this part together—though feel free to make a start on your own if you have completed the previous exercises.

In [None]:

#axes[0].plot(train_metrics.loss, label="Train")
#axes[0].plot(valid_metrics.loss, label="Valid")

#for axis in axes.ravel():