# Seedling Classification

Today, we'll be tackling the Kaggle **Plant Seedlings Classification** problem.

The training dataset is a corpus of 4750 images of newly sprouted plants. Each image is labeled as belonging to one of twelve plant species, six of which are crop plants, and six of which are weeds. The goal is to train against the dataset, and correctly classify a set of 794 unlabeled test images.

For this exercise, you will be responsible for:

* Grooming your data for use by your model
* Selecting or designing a model
* Training the model against the training dataset
* Testing the model against the test dataset
* Checking your results on the Kaggle site

Some code (e.g., simple code to wrap the images in a dataset) is provided below, so that you can focus on the problem rather than writing boilerplate.


Run the code below to download the Kaggle dataset directly to your Colab instance.

In [0]:
# This cell works great in Google Colab.
# If you run it in other environments, your mileage may vary.

!wget http://bradheintz.com/kaggle/plant-seedlings-classification.zip
!unzip plant-seedlings-classification.zip
!mkdir data
!unzip train.zip
!mv train data
!unzip test.zip
!mv test data
!mkdir models

Below are some of the imports we'll need. Note the import of `transforms` from `torchvision` - we'll need this to regularize the images.

In [0]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import transforms

from imageio import imread
from PIL import Image
from io import BytesIO

import time, argparse, os
from os import listdir
from os.path import isfile, isdir, join

## The Dataset

The `torchvision` package provides the `ImageFolder` dataset object. This object takes a root folder, and expects to find images in subfolders, where each subfolder is the correct label for the image; it exposes this to your code via the `torch.utils.data.Dataset` interface. (Conveniently, you set up just such a root folder in `data/train` when you ran the first cell of this notebook.)

The test dataset is constructed a little differently. Here, we *don't* have labels for the instances, but we *do* need the filename of each instance for reporting. (See the test loop later in this notebook for details.) For this reason, I wrote a lightweight `Dataset` subclass that provides the filename with each instance returned by `__getitem__`.

In [0]:
def _get_transforms():
    return transforms.Compose([
        transforms.Resize(100),
        transforms.RandomCrop(100),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor()
        ])

def get_training_loader(args, kwargs):
    dataset = torchvision.datasets.ImageFolder('data/train', transform=_get_transforms())
    return torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=2, shuffle=True, **kwargs)
  

class SeedlingTestDataset(torch.utils.data.Dataset):

    def __init__(self, path_to_test_data='data/test', transform=None):
        self.transform = transform
        self.data, self.datasize = self.build_dataset_from_path(path_to_test_data)
        self.filenames = sorted(self.data.keys())

    def build_dataset_from_path(self, test_data_path):
        data = {}
        for item in listdir(test_data_path):
            file_path = join(test_data_path, item)
            if isfile(file_path) and 'png' in file_path:
                data[item] = file_path
        return data, len(data)

    def __len__(self):
        return self.datasize

    def __getitem__(self, index):
        key = self.filenames[index]
        full_path = self.data[key]

        f = open(full_path, 'rb')
        img = Image.open(BytesIO(f.read()))
        f.close()
        if self.transform is not None:
            img = self.transform(img)

        return img, key

def get_test_loader(args, kwargs):
    dataset = SeedlingTestDataset(path_to_test_data='data/test', transform=_get_transforms())
    return torch.utils.data.DataLoader(dataset, batch_size=1, num_workers=1, shuffle=False, **kwargs)

## The Model

Here's my hand-rolled model. (Later, I'll show an example using transfer learning from a pre-trained model.) It's based very loosely on LeNet 5, and it produces accuracy as high as 82% on the seedling problem. It has about 2.3M parameters, and trains in about 40 minutes on my laptop CPU, or 20 minutes in a Colab instance, using 20 training epochs over the whole training corpus.

In [0]:
class SeedlingModel(nn.Module):

    def __init__(self): # model on LeNet-5
        super(SeedlingModel, self).__init__()

        self.conv1 = nn.Conv2d(3, 30, 7)
        self.conv2 = nn.Conv2d(30, 50, 3)
        self.fc1 = nn.Linear(50 * 15 * 15, 200)
        self.fc2 = nn.Linear(200, 12)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 3, 3)
        x = x.view(-1, 50 * 15 * 15)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# to check the number of parameters on your model, try:
# print(sum(p.numel() for p in model.parameters() if p.requires_grad))
# 2270602

Also, a utility function for saving the model:

In [0]:
def _save_model(model, model_dir):
    # logger.info("Saving the model.")
    print('Saving model')
    # savefile = "seedling-%d.pt"%(int(time.time()))
    savefile = 'seedling-model.pt'
    path = os.path.join(model_dir, savefile)
    # recommended way from http://pytorch.org/docs/master/notes/serialization.html
    torch.save(model.cpu().state_dict(), path)

## Training the Model

Some of the code in this notebook was originally written to work with the `argparse` Python module. Rather than gut the code, here I've made an object to take the place of the parsed arguments, and provide a central point for controlling test inputs and hyperparameters.

In [0]:
class Args:
    def __init__(self):
        self.epochs = 20
        self.use_cuda = True
        self.lr = 0.01
        self.momentum = 0.5
        self.model_dir = 'models'
        self.savefile = 'seedling-model.pt'
        
args = Args()

The training loop is pretty straightforward. At the end, it saves the model for later use.

Also, note the liberal use of user-readable output with timestamps and vital stats (e.g, accumulated loss). When you're not sure how long an operation like a training run will take, this kind of progress reporting can tell you early whether you want to kill a job or let it run.

In [0]:
def _train(model, args):
    print('Training the model...')
    # skipping distributed stuff at this time
    use_cuda = torch.cuda.is_available() and args.use_cuda
    device = torch.device('cuda' if use_cuda else 'cpu')
    kwargs = {'pin_memory': True} if use_cuda else {}
    print('Device: {}'.format(device.type))

    train_loader = get_training_loader(args, kwargs)

    model = model.to(device)
    loss_fn = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)

    print('Training begins: {}'.format(time.asctime()))
    
    for epoch in range(0, args.epochs):
        print('Epoch {} of {}   {}'.format(epoch + 1, args.epochs, time.asctime()))
        running_loss = 0.0
        for i, data in enumerate(train_loader):
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)

            optimizer.zero_grad()

            outputs = model(inputs)
            loss = loss_fn(outputs, labels) # TODO should maybe be cross entropy?
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if (i + 1) % 300 == 0:
                print('{}, {} loss: {:.6f}   {}'.format(epoch + 1, i + 1, running_loss / 200, time.asctime()))
                running_loss = 0.0

    print('Training complete: {}'.format(time.asctime()))

    _save_model(model, args.model_dir)
    return model

Now let's create an instance of the model and training it! If you haven't, make sure to set your Colab instance to run on GPU.

In [0]:
model = SeedlingModel()
model = _train(model, args)

## Testing the Model

Below is the test loop for our model. Note that this is *not* the same as a traditional validation loop - we don't have labels for the test set. This loop takes the prediction for each image in the test set and writes it to a CSV file that we can upload to Kaggle to see how well  we did. (More on that shortly.)

(Also: Yes, I could have used the `csv` module here, but this is not a tricky output file to write, and haven't we imported enough modules today?)

In [0]:
def test_model(model, device, kwargs):
    model.eval()
    classes = get_training_loader(args, kwargs).dataset.classes # TODO this is horrible
    loader = get_test_loader(args, kwargs)
    filenames = loader.dataset.filenames

    outfile = open('submission.csv', 'w')
    outfile.write('file,species\n')
    print('Gathering predictions...   {}'.format(time.asctime()))
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            output = model(data)
            score, pred = torch.max(output, 1)
            outfile.write('{},{}\n'.format(target[0], classes[pred.item()])) # batches of 1
    print('Finished predictions, output written to submission.csv   {}'.format(time.asctime()))

Uncomment and run the following code if you want to work from a saved model. If you've been running the cells of this notebook sequentially, you should be able to call the test loop function without this step.

In [0]:
# model = SeedlingModel()
# with open(os.path.join(args.model_dir, args.savefile), 'rb') as f:
#     model.load_state_dict(torch.load(f))

Now we'll test the model and get our output .csv of predictions:

In [0]:
test_model(model, 'cpu', {})

If the test ran successfully, you should now be able to take the file `submission.csv`, go to https://www.kaggle.com/c/plant-seedlings-classification/submit (you will have to create a free account), and submit your file for checking.

## Exercises, Part 1

**Training loop validation:** The training loop, in its current state, does not provide validation of training, meaning that your model is not protected against overfitting and data bias pitfalls. How would you go about splitting the training dataset for training and validation?

**Data grooming:** The model above assumes a 100x100, 3-color image as input, and the transforms fed to the dataset (see the `_get_transforms()` function) enforce this. About 10% of the training instances are *less than* 100px wide/high, meaning they'll get upsampled. Is this good, bad, or indifferent? What happens if you discard those samples? Are there other ways to handle them?

**Tweaking the model:** Can you identify changes to the model (either by inspection or empirically) that might enhance performance? Run experiments with some of those changes to test your intuition.

**Data grooming, part 2:** The pre-trained models (such as the one we use in the next part) that come with the `torchvision` package are trained against ImageNet, which assumes that all image instances are *at least* 224px square. About 50% of the training instances are smaller than this. If you make appropriate adjustments to the model to accommodate a 224x224 image, and make appropriate changes to the transforms to make the incoming images 224px square, does this improve or degrade accuracy? What about speed? Given that half of the training set will have to be upsampled to do this, is there any benefit from discarding some or all of the smaller images?

**Data grooming, part 3:** As explained in the [docs](https://pytorch.org/docs/stable/torchvision/models.html) for `torchvision.models`, input images for the pretrained models should be at least 224px wide and high, and normalized with:

```
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
```

These numbers are based on the average mean and standard deviation values for the color channels of the 1.2 million images of the ImageNet corpus. This transform shifts the distribution of pixel color values to be a normal distribution centered on 0, with values between -1 and 1. This is beneficial because most activation functions have their steepest gradients near 0, and (with some obvious exceptions like ReLU) tend to saturate quickly for absolute values greater than 1.

Try taking the mean and standard deviation values for the colorspace of the seedling training images, and normalizing the images using a transform like the one above (but with your values substituted). How does this affect accuracy? Speed?