# Welcome to the PyTorch Seedlings Exercise

This exercise will cover some key concepts in computer vision, including:

* The preparation of image datasets
* The construction of neural net models for image tasks
* The use of *transfer learning,* a method for re-using previously trained models for new tasks

## Notes on Using This Notebook

* Code will be provided for boilerplate tasks; in other places, you will need to fill in code to complete the exercise. Cells you need to fill in will be flagged with the **Exercise** heading.
* The code cells are, in general, meant to be run in order. If you think a code cell should be working, but it isn't, verify that all previous cells were run - the cell you're having trouble with may depend on a variable or file that is created in a previous cell.
* Class names and other text normally meant for consumption by a computer will be rendered in a `monospace font`. This will hopefully reduce confusion between, e.g., the word "dataset" referring to the concept of a cohesive body of data, and the class name `Dataset` referring to the related PyTorch class.

### Do This Now:

The cell below downloads and unzips the dataset we'll be using for this exercise. The dataset is 1.8GB, so **please uncomment and execute the following code cell now** to get the process started. (The commented lines are there to prevent the download triggering accidentally, so you may wish to replace them afterward.)

In [None]:
# !curl -0 https://s3-us-west-1.amazonaws.com/pytorch-course-datasets/plant-seedlings-classification.zip > seedlings.zip
# !unzip seedlings.zip
# !unzip train.zip
# !unzip test.zip

## Introduction

This exercise is based on the Kaggle competition, [Plant Seedlings Classification](https://www.kaggle.com/c/plant-seedlings-classification/overview). The goal is to create a neural net that can accurately classify newly sprouted plants as belonging to a particular species. Twelve species are represented in the training data, six crop plants and six undesirable weed plants.

### The Training Dataset

The training dataset is a set of almost 5000 image files, each depicting a seedling, sorted into folders labeled with the correct species name of each plant of interest:

```
train
  \--Black-grass
  |    \--0050f38b3.png
  |    \--0183fdf68.png
  |    ...
  \--Charlock
       \--022179d65.png
       \--02c95e601.png
       ...
```

We will train and validate our dataset with this data.

### Multiple Iterations

We'll show two approaches - one simpler, one more advanced. The simpler one will employ a simple model that we will train from scratch. The second approach will involve *transfer learning,* and will involve doing some domain-specific learning on an existing, pre-trained model.

Don't forget that even if you want to jump ahead to the advanced exercise, it may depend on code executed in earlier cells.

### The Final Step

The *test* dataset is a separate, unlabeled set of images. The final step in today's exercise will be to use your model to classify the unlabeled images. You will export your predictions as a CSV file and upload them to the Kaggle site to receive a final accuracy score.

## The First Iteration: Building from Scratch

Let's Get Started! The code cell below contains imports we'll need; please execute it.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler

import torchvision
from torchvision import transforms

import os
import time
import random

torch.manual_seed(23)
random.seed(23)

### Setting Up Your Training Dataset

In order for our images to be consumed by a model, it helps if they are regularized in some way. The function below resizes and crops the images to squares of a specified size.

In [None]:
def get_transforms(target_size=100, normalize=False):
    t = transforms.Compose([
        transforms.Resize(target_size),
        transforms.CenterCrop(target_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor()
        ])
    if normalize: # for imagenet-trained models specifically
        t = transforms.Compose([
            t,
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    return t

As mentioned in the introduction above, the training data is a set of images, divided into folders, with each folder named for the image's class. There are twelve classes.

This is a common enough arrangement that PyTorch (through the torchvision library) has an `ImageFolder` class that will build a PyTorch `Dataset` object for you from this structure.

In [None]:
full_dataset = torchvision.datasets.ImageFolder('train', transform=get_transforms())
print('This dataset has:')
print('  {} elements'.format(len(full_dataset)))
print('  {} classes'.format(len(full_dataset.classes)))
print(full_dataset)

It's a best practice to set aside part of your labeled data for validation. This guards against *overfitting.* The main symptom of overfitting is that a model seems to perform well in training, but does poorly when presented with new data. This happens when the model learns the dataset a little too well, and doesn't develop general rules for dealing with similar inputs. (Qualitatively, this can be compared with a child who has learned multiplication tables by rote up to 10x10, but hasn't learned a rule to multiply 13 x 16.) It often means that the model is overspecified with respect to the data - that is, that the parameter space of the model is large enough to form a map of the individual inputs to specific outputs.

On the other hand, if your model performs just as accurately on the validation dataset as on the training dataset, that's a positive sign that it's learning as intended.

Here, we use `torch.utils.data.random_split()` to extract training and validation sets with an 80/20 split:

In [None]:
train_len = int(0.8 * len(full_dataset))
validate_len = len(full_dataset) - train_len
train_dataset, validate_dataset = torch.utils.data.random_split(full_dataset, (train_len, validate_len))
print('Training dataset contains {} elements'.format(len(train_dataset)))
print('Validation dataset contains {} elements'.format(len(validate_dataset)))

It is usually convenient to package a `Dataset` in a `DataLoader`. When you're writing your own `Dataset` object, all you have to do is report the number of elements in the set, and return elements (with their labels, if needed) by index. The `DataLoader` handles everything else: Batching, shuffling, multi-threading I/O, sampling, and more. The `DataLoader` is the most common interface for offering data to a training loop.

In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True)
validate_loader = torch.utils.data.DataLoader(validate_dataset)

### A Simple Model That Might Work

An earlier tutorial in this series, which made a classfier for CIFAR-10 images, used a variant of the LeNet-5 architecture, adapted for 3-channel color and larger images:

```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.fc1.in_features)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
```

The `__init__()` method defines two convolutional layers and three linear layers. Here's a quick review of what the parameters mean:

* `conv1` is meant to take input with `3` channels (corresponding to the three color channels), and produce an output of `6` feature activation maps, with a detection window of `5` pixels square (its kernel size). You can think of this layer as scanning the input image and looking for features it recognizes.
* `conv2` takes output with `6` channels (corresponding to the 6 features detected by `conv1`), produces output for 16 features, and also employs a `5`-pixel window. You can think of this layer as composing the features detected by `conv1` into larger features.
* `fc1` and `fc2` perform further processing on the output of the convnet layers.
* `fc3` gives our final output, a vector of `10` elements. These are floating point numbers that relate to the model's confidence that the input belongs to a particular class.

The `forward()` method composes these layers and some important functions into a computation graph that takes in a 3x32x32 tensor representing a 3-color image, and . Here's how the data flows through the graph:

| Stage | Tensor Shape | Notes |
| --- | --- | --- |
| input | 3 x 32 x 32 | 32x32 image with 3 color channels |
| conv1 | 6 x 28 x 28 | 6 features; spatial map reduced from 32 to 28 due to kernel size |
| pooling | 6 x 14 x 14 | every 2 x 2 group of the map elements is reduced to a single element, which takes on the max value of its parent elements |
| conv2 | 16 x 10 x 10 | 16 features; spatial map reduced from 14 to 10 due to kernel size |
| pooling | 16 x 5 x 5 | as above, reducing resolution of the spatial map |
| reshape | 1 x 400 | same data as the 3D tensor in the previous step, but flattened to a vector (400 = 16 x 5 x 5) |
| fc1 | 1 x 120 | |
| fc2 | 1 x 84 | |
| output | 1 x 10 | 10 classes of data |

### Exercise

Below is a skeleton version of the image classifier above, with most the parameters removed. (The 3-color input stays the same, and the `12` for the number of output classes has also been filled in.) **How would you fill in the values to make this work for our 100x100 seedling images?** Don't forget that some values are related, such as the output features of `conv1` and the input features of `conv2`. Some values in deeper layers are directly related to your input size as well, such as the input width of `fc1`.

Things to think about and experiment with:

**For the convolutional layers:** Does this model work using the same number of features (6 and 16) as before? Is there any advantage to altering the kernel size?

Convolutional layers can also specify a *stride length:* A stride length of 1 means the kernel scans every possible position, a stride of 2 means it scans every other position, 3 means it scans every 3rd, and so on. If you enlarge the kernel, is there an advantage in setting a stride length?

**For the linear layers:** Do the original input widths of the linear layers still work? (Hint: How does `fc1` respond to the new the 3x100x100 input size?) Can the intermediate values be left as-is, or is there benefit to changing them?

In [None]:
 class SeedlingModelV1(nn.Module):
    def __init__(self):
        super(SeedlingModelV1, self).__init__()
        self.conv1 = nn.Conv2d(3, ?, ?)
        self.conv2 = nn.Conv2d(?, ?, ?)
        self.fc1 = nn.Linear(?, ?)
        self.fc2 = nn.Linear(?, ?)
        self.fc3 = nn.Linear(?, 12)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), ?)
        x = F.max_pool2d(F.relu(self.conv2(x)), ?)
        x = x.view(-1, ?)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

It might be valuable to check your updated model architecture with the code in the next cell. If any of your layers are mismatched, you should get an error. This code:

* instantiates the model
* extracts an instance from the dataset
* feeds the instance to the model for processing

*(NB: The `torch.unsqueeze()` call is there because `forward()` actually expects a batch of tensors. Here, we have added a dimension at the beginning of our lone tensor to create a batch of 1.)*

In [None]:
model = SeedlingModelV1()
image, label = train_dataset[0]
output = model(torch.unsqueeze(image, 0))
print(output)

### Training the Model

First, we'll define a few constants, including the learning hyperparameters. It can be convenient to have these parameters defined in one place, or specified on the command line, to make it easy to tune them as you're shaking out your training loop.

In [None]:
N_EPOCHS = 20 # number of passes over the training dataset
LR = 0.01 # learning rate
MOMENTUM = 0.5 # for SGD

BATCH_SIZE = 4 # number of instances per batch served by dataloader
NUM_WORKERS = 2 # number of I/O threads used by dataloader

MODEL_DIR = 'models' # save models here
MODEL_SAVEFILE = 'seedling'

And we'll need to create that folder for our models:

In [None]:
!mkdir models

As we train and validate the model, we'll want informative logging so that we know what's going on, and roughly how long it should take. Also, it's a good practice to save the model when it reaches a new accuracy peak, so we'll create a helper for that.

In [None]:
def tlog(msg):
    print('{}   {}'.format(time.asctime(), msg))

    
def save_model(model, epoch):
    tlog('Saving model')
    savefile = "{}-e{}-{}.pt".format(MODEL_SAVEFILE, epoch, int(time.time()))
    path = os.path.join(MODEL_DIR, savefile)
    # recommended way from https://pytorch.org/docs/stable/notes/serialization.html
    torch.save(model.state_dict(), path)
    return savefile

If we can, we'd like to run this on GPU. Below, we'll check for the presence of a CUDA-compatible device and get a handle to it:

In [None]:
if not torch.cuda.is_available():
    device = torch.device('cpu')
    print('*** GPU not available - running on CPU. ***')
else:
    device = torch.device('cuda')
    print('GPU ready to go!')

Finally, just to make sure we're starting from *tabula rasa* (and for review), let's recreate the key components of our process:

In [None]:
full_dataset = torchvision.datasets.ImageFolder('train', transform=get_transforms())
train_len = int(0.8 * len(full_dataset))
validate_len = len(full_dataset) - train_len
train_dataset, validate_dataset = torch.utils.data.random_split(full_dataset, (train_len, validate_len))

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, shuffle=True)
validate_loader = torch.utils.data.DataLoader(validate_dataset, batch_size=1)

model = SeedlingModelV1()

Now, we have an untrained model in `model`, our data ready to consume from `train_loader` and `validate_loader`, and a `device` selected. It's time to train!

The structure of this training loop should be familiar from previous exercises.

In [None]:
def train(model, epochs=N_EPOCHS):
    tlog('Training the model...')
    tlog('working on {}'.format(device))
    
    best_accuracy = 0. # determines whether we save a copy of the model
    saved_model_filename = None
    
    model = model.to(device) # move to GPU if available
    loss_fn = nn.CrossEntropyLoss() # combines nn.LogSoftmax() and nn.NLLLoss() for classification tasks
    optimizer = optim.SGD(model.parameters(), lr=LR, momentum=MOMENTUM)
    exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
    
    for epoch in range(epochs):
        tlog('BEGIN EPOCH {} of {}'.format(epoch + 1, epochs))
        running_loss = 0. # bookkeeping
        
        tlog('Train:')
        for i, data in enumerate(train_loader):
            instances, labels = data[0], data[1]
            instances, labels = instances.to(device), labels.to(device) # move to GPU if available
            
            optimizer.zero_grad()
            guesses = model(instances)
            loss = loss_fn(guesses, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            if (i + 1) % 200 == 0: # log every 200 batches
                tlog('  batch {}   avg loss: {}'.format(i + 1, running_loss / (200)))
                running_loss = 0.
        
        tlog('Validate:')
        with torch.no_grad(): # no need to do expensive gradient computation for validation
            total_loss = 0.
            correct = 0
            
            for i, data in enumerate(validate_loader):
                instance, label = data[0], data[1]
                instance, label = instance.to(device), label.to(device) # move to GPU if available
                
                guess = model(instance)
                loss = loss_fn(guess, label)
                total_loss += loss.item()
                
                prediction = torch.argmax(guess, 1)
                if prediction.item() == label.item(): # assuming batch size of 1
                    correct += 1

            avg_loss = total_loss / len(validate_loader)
            accuracy = correct / len(validate_loader)
            tlog('  Avg loss for epoch: {}   accuracy: {}'.format(avg_loss, accuracy))
            
            if accuracy >= best_accuracy:
                tlog( '  New accuracy peak, saving model')
                best_accuracy = accuracy
                saved_model_filename = save_model(model, epoch + 1)
                
    return (saved_model_filename, best_accuracy)
                


When you run the training loop, you should see the loss decreasing and accuracy increasing more-or-less monotonically, both for training and for validation. You should also see the average per-instance loss values roughly similar for validation and testing.

In [None]:
best_model_filename, accuracy  = train(model)
print('The best model is saved at {} with accuracy {}'.format(best_model_filename, accuracy))

### Exercise

**What accuracy did you achieve?** Did the model converge (i.e., did the per-instance loss flatten out) in the number of epochs you ran? Was the loss during validation similar to the loss during training?

**Was the learning stable?** Did loss continue to decrease and accuracy increase monotonically?

**How could it improve?** Consider the many choices we've made up to this point, and their effect on the model architecture, the state of the data, and the execution of the training loop:

* **Data:** We regularized the *shape* of the training data, but performed no other normalization. (For more information, see the discussion below on normalization of the color space.) Could the data be altered in some way that improves accuracy?
* **Convnet Layers:** Convolutional layers make use of multiple important parameters. Would preformance be improved with a change to the number of input or output features, or the kernel size, or the stride length?
* **Training Hyperparameters:** Did the model converge? What might happen if you changed the learning rate or momentum? Do you need more training epochs?

If you feel your run could have been better, hypothesize about which of the above factors might affect it, and pick one or two to experiment with.

In the cell below is a revision of the model above with some plausible values filled in; it typically gets to around 70% accuracy with 20 training epochs. Use it as a starting point for experiments if your model didn't converge.

In [None]:
class SeedlingModelV1_1(nn.Module):
    def __init__(self):
        super(SeedlingModelV1_1, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 22 * 22, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 12)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.fc1.in_features)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

model_1_1 = SeedlingModelV1_1()
best_model_1_1, acc_1_1 = train(model_1_1)
print('The best updated model is saved at {} with accuracy {}'.format(best_model_1_1, acc_1_1))

## Inference and Scoring

Now that we've trained our model, it's time to see how it performs.

When we downloaded the Kaggle dataset for this exercise, we got two folders: `train` and `test`. The training dataset was labeled, but the test dataset is not. To see how well we did, we need to load our trained model, feed it test instances, and put the predictions into the format that the Kaggle site is expecting. Fortunately, the package from Kaggle also included the file `sample_submission.csv`:

```
file,species
0021e90e4.png,Sugar beet
003d61042.png,Sugar beet
007b3da8b.png,Sugar beet
...
```

Below, we will:

* Create PyTorch Dataset and DataLoader objects for the test data
* Load our model
* Ask the model for its best guess about the species shown in each test image
* Put this information in a formatted file

Finally, we'll upload the file to Kaggle and see how well we scored.

First, let's get our trained model. At the end of your training run, you should see something like:

```
...
Fri May 10 19:59:38 2019   Validate:
Fri May 10 19:59:51 2019     Avg loss for epoch: 0.9324716810176247   accuracy: 0.7442105263157894
The best model is saved at seedling-e14-1557518094.pt with accuracy 0.7568421052631579
```

Copy and paste that filename into the following cell and run it. This will be the version of your model that we use for inference.

In [None]:
# load the model
path = os.path.join(MODEL_DIR, '***')
model_data = torch.load(path, map_location=torch.device('cpu'))
trained_model = SeedlingModelV1()
trained_model.load_state_dict(model_data)
print(trained_model)

# sanity check
image, label = train_dataset.[0]
output = trained_model(torch.unsqueeze(image, 0))
print(output)

Now, we need to load the test images. We can't use `ImageLoader` like before, as the test images are not organized in labeled folders, but it's straightforward to build our own `Dataset` object from scratch:

In [None]:
from imageio import imread
from PIL import Image
from io import BytesIO
from os import listdir
from os.path import isfile, isdir, join


def get_test_transforms(target_size=100, normalize=False):
    t = transforms.Compose([
        transforms.Resize(target_size),
        transforms.CenterCrop(target_size),
        transforms.ToTensor()
        ])
    if normalize: # for imagenet-trained models specifically
        t = transforms.Compose([
            t,
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        ])
    return t


class SeedlingTestDataset(torch.utils.data.Dataset):

    def __init__(self, path_to_test_data='test', transform=None):
        self.transform = transform
        self.data, self.datasize = self.build_dataset_from_path(path_to_test_data)
        self.filenames = sorted(self.data.keys())

    def build_dataset_from_path(self, test_data_path):
        data = {}
        for item in listdir(test_data_path):
            file_path = join(test_data_path, item)
            if isfile(file_path) and 'png' in file_path:
                data[item] = file_path
        return data, len(data)

    def __len__(self):
        return self.datasize

    def __getitem__(self, index):
        key = self.filenames[index]
        full_path = self.data[key]

        with open(full_path, 'rb') as f:
            img = Image.open(BytesIO(f.read()))
        if self.transform is not None:
            img = self.transform(img)

        return img, key

Now we can instantiate the dataset and wrap it in a loader:

In [None]:
test_dataset = SeedlingTestDataset(transform=get_transforms())
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, num_workers=1, shuffle=False)

# sanity check
image, filename = test_dataset.__getitem__(0)
output = trained_model(torch.unsqueeze(image, 0))
print(output)

# translate to a guess about class
classes = full_dataset.classes
score, pred = torch.max(output, 1)
print('{} (score: {})'.format(classes[pred.item()], score.item()))

Now, let's run inference on all the test images, and put the results in the file format that Kaggle expects:

In [None]:
with open('submission.csv', 'w') as outfile:
    outfile.write('file,species\n') # required header row
    
    # Some models have layers that are only active during training,
    # so always call model.eval() before inference
    model.eval()
    with torch.no_grad():
        for _, (data, filename) in enumerate(test_loader):
            data.to(device)
            output = trained_model(data)
            score, pred = torch.max(output, 1)
            outfile.write('{}, {}\n'.format(filename[0], classes[pred.item()]))

Check your output with the cell below. It should look something like:

```
file,species
0021e90e4.png, Small-flowered Cranesbill
003d61042.png, Fat Hen
007b3da8b.png, Sugar beet
0086a6340.png, Common Chickweed
00c47e980.png, Sugar beet
00d090cde.png, Loose Silky-bent
00ef713a8.png, Common Chickweed
01291174f.png, Fat Hen
026716f9b.png, Loose Silky-bent
```

*Note: The values above are not guaranteed to be correct.*

In [None]:
!head submission.csv

If you have a nicely-formatted CSV file, download it now, then go to https://www.kaggle.com/c/plant-seedlings-classification/submit to get it scored.

## A Second Approach: Tuning an Existing Model (Self-Directed Exercise)

Now, we'll look at a second technique: Making adjustments to a pre-trained model.

The very best computer vision models can be large indeed. Here's a sampling of the parameter counts of some of the pre-trained models available with the `torchvision` library:

| Model | Number of Parameters |
| --- | --- |
| SeedlingModelV1 (above) | 943,456 |
| SqueezeNet 1.1 | 1,235,496 |
| Resnet 50 | 25,557,032 |
| Densenet-161 | 28,681,000 |
| Alexnet | 61,100,840 |
| VGG-16 | 138,357,544 |

Training a model with tens of millions of parameters - or more! - can take a huge amount of time, even if you have access to hardware acceleration. The good news, as we covered in the earlier unit on transfer learning, is that you can leverage pre-trained models for your problem domain, and greatly reduce your training time.

The pre-trained models available in `torchvision` are all trained against [ImageNet](http://www.image-net.org/about-overview) - a general-purpose set of over a million images drawn from the World Wide Web, categorized by their content into 1000 different categories. We'll be adapting one of these models to see if we can achieve better results while still only incurring a short cost for training time.

### Tweaking the Data

The pre-trained models available in `torchvision` all assume a 224x224 input image with 3 color channels. A few cells down, we re-create the datasets and loaders, tweaking the transform to give us a 224-pixel square instead of a 100-pixel square.

The pre-trained models are also all normalized to the colorspace distribution of the ImageNet set. The `if normalize:` stanza in `get_transforms()` adjusts our data to match this. Normalization can be important for model convergence. The usual method is to adjust the input values to a range between -1 and 1, with the distribution centered on zero. Most activation functions have their greatest derivative around zero, so this allows for strong gradient signal during learning. Also, some activation functions saturate around -1 and 1, so this keeps input values in a range where they'll generate a useful gradient.

If you use one of the `torchvision` pre-trained models, make sure you call `get_transforms()` with the correct arguments:

```
get_transforms(target_size=224, normalize=True)
```

### The Process

If you need to, review the [transfer learning tutorial](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) at pytorch.org. Think about how you'd approach the seedling problem using a pre-trained model.

* **Which model should you use?** Most of the models have multiple variants of different depths. Larger models will take more training time.
* **How will you need to alter the model?** Some of the models have single, linear/fully-connected layers for final classification, and it's simple to swap in a new linear layer with the same number of inputs and the number of outputs set to the number of classes in the target problem (in our case, 12). Other models end with multi-layer classifiers, which will require slightly more work to adjust.
* **What about your training hyperparameters?** When fine-tuning a pre-trained model, you may wish to use a smaller learning rate. In fact, even when training a newly initialized model, it's a common practice to use *learning rate scheduling,* which reduces the learning rate over time, under the assumption that with each training epoch the model is getting closer to the global optimum. The transfer learning tutorial linked above demonstrates using learning rate scheduling. Are there other hyperparameters you might tune?
* **What about the data?** If you take a close look at the training dataset, the images come in a variety of sizes. About half of them are *less than* 224px wide, meaning that our default transform is upsampling them. Could this introduce artifacts that affect training? Does it makes sense to exclude images under a certain size?

# Training and Inference on AWS SageMaker

If you're using this notebook locally on an ml.t2.medium instance, 20 training epochs using the 1.8GB dataset typically takes about 45 minutes - and that's for a *very* simple model architecture. Using a more complex architecture - e.g., `torchvision.models.resnet18()` - will take *much* longer, especially if you try to start training from scratch.

Below, we have provided some code that should help you get started with using SageMaker Estimators to initiate a large training job on a GPU-enabled instance. There are a few things to note about the code cell below:

* **The first argument to the estimator constructor is a Python file that contains your model architecture and training loop.** This file needs certain enhancements in order to function in a SageMaker Estimator - these enhancements are described in [the documentation for PyTorch Estimators](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html). We have provided a file, `seedlings.py`, that has some of those enhancements, and includes the simple model architecture that we used above.
* **The second argument is the instance type you want the training job to run on.** Here, we have used the smallest of the GPU-enabled instances, ml.p2.xlarge. *Don't forget to check that your code moves your model and all input data to the GPU device!*
* **You will need to provide your own AWS IAM role for the `role` argument.** See the [the documentation for PyTorch Estimators](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html) for details.

After we create the Estimator, we need to start the training job. We do this with the `pytorch_estimator.fit(...)` call, and we provide the name of an S3 bucket that contains our training and test data. *Be sure that your S3 bucket and your training instance are in the same region.*

It will take a few minutes to start the instance and download the data from the S3 bucket, but you should find that training runs significantly faster.

In [None]:
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch('seedlings.py',
                        train_instance_type='ml.p2.xlarge', # 'ml.p3.2xlarge',
                        train_instance_count=1,
                        framework_version='1.0.0',
                        role='arn:aws:iam::896498678582:role/service-role/AmazonSageMaker-ExecutionRole-20190226T104608',
                        hyperparameters = {'epochs': 1, 'batch-size': 4, 'learning-rate': 0.01})

pytorch_estimator.fit('s3://pytorch-course-datasets-ireland/seedlings-train/')

## Inference in Production with Predictors

Now that you have trained the instance, and the model has been saved to the model output directory, you can use it to make predictions on new data. The `pytorch_estimator.deploy(...)` call below creates an endpoint that you may call with image data to make predictions.

In [None]:
pytorch_predictor = pytorch_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

To test our prediction, we'll use a random instance from the labeled validation dataset. We'll convert the image content to a NumPy array, and pass it to the predictor.

In [None]:
import numpy as np

image, label = validate_dataset[random.randrange(0, len(validate_dataset))]
print('Checking prediction for {}'.format(label))

image = image.unsqueeze(0).numpy()

response = pytorch_predictor.predict(image)
prediction = response.argmax(axis=1)[0]
print(prediction)

Don't forget: Always clean up after yourself! Deleting your endpoint will deallocate the instance it runs on.

In [None]:
pytorch_estimator.delete_endpoint()