# Download, Structure, and Preprocess Image Data for SageMaker Built-In Algorithms

**Notes**: 
* This notebook should be used with the conda_amazonei_mxnet_p36 kernel
* You can also explore image preprocessing with TensorFlow and PyTorch by running [Download, Structure, and Preprocess Image Data for TensorFlow Models](tensorflow_preprocess_and_train.ipynb) and [Download, Structure, and Preprocess Image Data for PyTorch Models](pytorch_preprocess_and_train.ipynb), respectively.

The main purpose of this notebook is to demonstrate how you can preprocess image data to train SageMaker Built-In Algorithms.

## Contents
1. [Part 1: Download the Dataset](#Part-1:-Download-the-Dataset)
1. [Part 2: Structure the Dataset](#Part-2:-Structure-the-Dataset)
1. [Part 3: Preprocess Images for Built-in Algorithms](#Part-3:-Preprocess-Images-for-Built-in-Algorithms)
1. [Part 4: Train the Built-in Image Classification Algorithm](#Part-4:-Train-the-Built-in-Image-Classification-Algorithm)

## Part 1: Download the Dataset
----
----
In this section, you will use a dataset manifest to download animal images from the COCO dataset for all ten animal classes. You will then download frog images from the CIFAR dataset and add them to your COCO animal images. In order to simulate coming to SageMaker with your own dataset, we will keep the data in an unstructured form until the next notebook where you will learn the best practices for structuring an image dataset.

In [None]:
! pip install imageio joblib opencv-python

In [None]:
import json
import pickle
import shutil
import urllib
import pathlib
import tarfile
from tqdm import tqdm
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from imageio import imread, imwrite
from joblib import Parallel, delayed, parallel_backend

<pre>
</pre>

### The COCO and CIFAR Datasets
___
For this series of notebooks we will be sampling images from the [COCO dataset](https://cocodataset.org) and [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) (before beginning the notebooks in this series, it's a good idea to browse each dataset website to familiaraize youreself with the data). Both are datasets of images, but come formatted very differently. The COCO dataset contains images from Flickr that represent a real-world dataset which isn't formatted or resized specifically for deep learning. This makes it a good dataset for this guide because we want it to be as comprehensive as possible. The CIFAR-10 images, on the other hand, are preprocessed specifically for deep learning as they come cropped, resized and vectorized (i.e. not in a readable image format). This notebooks will show you how to work with both types of datasets.

<pre>
</pre>

### Download the annotations
____
The dataset annotation file contains info on each image in the dataset such as the class, superclass, file name and url to download the file. Just the annotations for the COCO dataset are about 242MB.

In [None]:
anno_url = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
urllib.request.urlretrieve(anno_url, "coco-annotations.zip");

In [None]:
shutil.unpack_archive("coco-annotations.zip")

#### Load the annotations into Python
The training and validation annotations come in separate files

In [None]:
with open("annotations/instances_train2017.json", "r") as f:
    train_metadata = json.load(f)

with open("annotations/instances_val2017.json", "r") as f:
    val_metadata = json.load(f)

<pre>
</pre>


### Extract only the animal annotations
___
To limit the scope of the dataset for this guide we're only using the images of animals in the COCO dataset

In [None]:
category_labels = {
    c["id"]: c["name"] for c in train_metadata["categories"] if c["supercategory"] == "animal"
}

#### Extract metadata and image filepaths
For the train and validation sets, the data we need for the image labels and the filepaths are under different headings in the annotations. We have to extract each out and combine them into a single annotation in subsequent steps.

In [None]:
train_annos = {}
for a in train_metadata["annotations"]:
    if a["category_id"] in category_labels:
        train_annos[a["image_id"]] = {"category_id": a["category_id"]}

train_images = {}
for i in train_metadata["images"]:
    train_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

val_annos = {}
for a in val_metadata["annotations"]:
    if a["category_id"] in category_labels:
        val_annos[a["image_id"]] = {"category_id": a["category_id"]}

val_images = {}
for i in val_metadata["images"]:
    val_images[i["id"]] = {"coco_url": i["coco_url"], "file_name": i["file_name"]}

#### Combine label and filepath info
Later in this series of guides we'll make our own train, validation and test splits. For this reason we'll combine the training and validation datasets together.

In [None]:
for id, anno in train_annos.items():
    anno.update(train_images[id])

for id, anno in val_annos.items():
    anno.update(val_images[id])

In [None]:
all_annos = {}
for k, v in train_annos.items():
    all_annos.update({k: v})
for k, v in val_annos.items():
    all_annos.update({k: v})

<pre>
</pre>


### Sample the dataset
___
In order to make working with the data easier, we'll select 250 images from each class at random. To make sure you get the same set of cell images for each run of this we'll also set Numpy's random seed to 0. This is a small fraction of the dataset, but it demonstrates how using transfer learning can give you good results without needing very large datasets.

In [None]:
np.random.seed(0)

In [None]:
sample_annos = {}

for category_id in category_labels:
    subset = [k for k, v in all_annos.items() if v["category_id"] == category_id]
    sample = np.random.choice(subset, size=250, replace=False)
    for k in sample:
        sample_annos[k] = all_annos[k]

#### Create a download function
In order to parallelize downloading the images we must wrap the download and save process with a function for multi-threading with joblib.

In [None]:
def download_image(url, path):
    data = imread(url)
    imwrite(path / url.split("/")[-1], data)

#### Download the sample of the dataset (2,500 images, ~5min)

In [None]:
sample_dir = pathlib.Path("data_sample_2500")
sample_dir.mkdir(exist_ok=True)

In [None]:
with parallel_backend("threading", n_jobs=5):
    Parallel(verbose=3)(
        delayed(download_image)(a["coco_url"], sample_dir) for a in sample_annos.values()
    )

<pre>
</pre>

### Combine with CIFAR-10 frog data
___
The COCO dataset doesn't include any images of frogs, but let's say our model must also be able to label images of frogs. To fix this we can download another dataset of images which includes frogs, sample 250 frog images and add them to our existing image data. These images are much smaller (32x32) so they will appear pixelated and blurry when we increase the size of them to (244x244). We'll use the CIFAR-10 dataset to achieve this. As you'll see the CIFAR-10 dataset comes formatted in a very different manner from COCO dataset. We must process the CIFAR-10 data into individual image files so that it's congruent to our COCO images.

#### Download and extract the CIFAR-10 dataset

In [None]:
!wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

In [None]:
tf = tarfile.open("cifar-10-python.tar.gz")
tf.extractall()

#### Open first batch of CIFAR-10 dataset
The CIFAR-10 dataset comes in five training batches and one test batch. Each training batch has 10,000 randomly ordered images. Since we only need 250 frog images for our dataset, just pulling from the first batch will suffice.

In [None]:
with open("./cifar-10-batches-py/data_batch_1", "rb") as f:
    batch_1 = pickle.load(f, encoding="bytes")

In [None]:
image_data = batch_1[b"data"]

#### Pull 250 sample frog images

In [None]:
frog_indices = np.array(batch_1[b"labels"]) == 6
sample_frog_indices = np.random.choice(frog_indices.nonzero()[0], size=250, replace=False)
sample_data = image_data[sample_frog_indices, :]
frog_images = sample_data.reshape(len(sample_data), 3, 32, 32).transpose(0, 2, 3, 1)

#### View frog images

In [None]:
fig, axs = plt.subplots(3, 4, figsize=(10, 7))
indices = np.random.randint(low=0, high=249, size=12)

for i, ax in enumerate(axs.flatten()):
    ax.imshow(frog_images[indices[i]])
    ax.axis("off")

#### Write sample frog images to `data_sample_2500` directory

In [None]:
frog_filenames = np.array(batch_1[b"filenames"])[sample_frog_indices]

In [None]:
for idx, filename in enumerate(frog_filenames):
    filename = filename.decode()
    data = frog_images[idx]
    if filename.endswith(".png"):
        filename = filename.replace(".png", ".jpg")
    imwrite(sample_dir / filename, data)

In [None]:
sample_dir.rename("data_sample_2750")

#### Add frog annotations to `sample_annos`

In [None]:
category_labels[26] = "frog"

In [None]:
next_anno_idx = np.array(list(sample_annos.keys())).max() + 1

frog_anno_ids = range(next_anno_idx, next_anno_idx + len(frog_images))

In [None]:
for idx, frog_id in enumerate(frog_anno_ids):
    sample_annos[frog_id] = {
        "category_id": 26,
        "file_name": frog_filenames[idx].decode().replace(".png", ".jpg"),
    }

## Part 2: Structure the Dataset
----
----

In this section, you will properly structure your image files for ingestion by the model. Then, we will use Python to create the new folder structure and copy the files into the correct set and label folder.

### Proper folder structure
___

Although most tools can accommodate data in any file structure with enough tinkering, it makes most sense to use the sensible defaults that frameworks like MXNet, TensorFlow and PyTorch all share to make data ingestion as smooth as possible. By default, most tools will look for image data in the file structure depicted below:
```
+-- train
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- val
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|
+-- test
|   +-- class_A
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
|   +-- class_B
|       +-- filename.jpg
|       +-- filename.jpg
|       +-- filename.jpg
```
You will notice that the COCO dataset does not come structured like above so we must use the annotation data to help restructure the folders of the COCO dataset so they match the pattern above. Once the new directory structures are created you can use your desired framework's data loading tool to gracefully load and define transformation for your image data. Many datasets may already be in this structure in which case you can skip this guide.

<a id='ipg2.3'></a>
### Make train, validation and test splits
___
We should divide our data into train, validation and test splits. A typical split ratio is 80/10/10. Our image classification algorithm will train on the first 80% (training) and evaluate its performance at each epoch with the next 10% (validation) and we'll give our model's final accuracy results using the last 10% (test). It's important that before we split the data we make sure to shuffle it randomly so that class distribution among splits is roughly proportional.

In [None]:
np.random.seed(0)
image_ids = sorted(list(sample_annos.keys()))
np.random.shuffle(image_ids)
first_80 = int(len(image_ids) * 0.8)
next_10 = int(len(image_ids) * 0.9)
train_ids, val_ids, test_ids = np.split(image_ids, [first_80, next_10])

<a id='ipg2.4'></a>
### Make new folder structure and copy image files
___
This new folder structure can then be read by data loaders for SageMaker's built-in algorithms, TensorFlow or PyTorch for easy loading of the image data into your framework of choice.

In [None]:
unstruct_dir = Path("data_sample_2750")
struct_dir = Path("data_structured")
struct_dir.mkdir(exist_ok=True, parents=True)

for name, split in zip(["train", "val", "test"], [train_ids, val_ids, test_ids]):
    split_dir = struct_dir / name
    split_dir.mkdir(exist_ok=True)
    for image_id in tqdm(split):
        category_dir = split_dir / f'{category_labels[sample_annos[image_id]["category_id"]]}'
        category_dir.mkdir(exist_ok=True)
        source_path = (unstruct_dir / sample_annos[image_id]["file_name"]).as_posix()
        target_path = (category_dir / sample_annos[image_id]["file_name"]).as_posix()
        shutil.copy(source_path, target_path)

## Part 3: Preprocess Images for Built-in Algorithms
----
----

In this section, we will explore the different ways to format your image dataset for SageMaker's built-in algorithms. The first involves creating a manifest file for the train and validations sets and the other has you creating .REC files (RecordIO format) which are single binary files made up of all the images for the train and validation sets. Since the RecordIO format is preferred, we will upload the .REC files to S3 for training in the nedxt notebook.

### Dependencies
___

In [None]:
import uuid
import boto3
import shutil
import urllib
import pickle
import pathlib
import sagemaker
import subprocess

### Application/x-image format
___

This format is also referred to as "Image Format" or "LST" format. The benefit of using this format is that it doesn't require any modification or restructuring of your dataset. Instead, you create a manifest of the images for your training set and validation set. These two manifests are separate `.lst` files which list all the images giving each of them a unique index, the class they belong to and the relative path to the image file from the main training folder. The data in the `.lst` file is in tab separated values.

While its the easiest format to use, it requires SageMaker to do more work behind the scenes. For datasets with many images, this will cause training to take longer. For datasets with fewer images, the performance difference isn't as pronounced.

Below are two examples of how to create your .LST manifest files. One uses your own code and the other uses a script from MXNet. If you want to create .REC files of your images, you should skip to Option 2.

#### Option 1: Manually generate the .LST files

In [None]:
category_ids = {name: idx for idx, name in enumerate(sorted(category_labels.values()))}
print(category_ids)

In [None]:
image_paths = pathlib.Path("./data_structured").rglob("*.jpg")

for idx, p in enumerate(image_paths):
    image_id = f"{idx:010}"
    category = category_ids[p.parts[-2]]
    path = p.as_posix()
    split = p.parts[-3]
    with open(f"{split}.lst", "a") as f:
        line = f"{image_id}\t{category}\t{path}\n"
        f.write(line)

View the contents of the `train.lst` file

In [None]:
!head train.lst

<pre>
</pre>

#### Option 2: Use im2rec.py script to generate the .LST files

In [None]:
script_url = "https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py"
urllib.request.urlretrieve(script_url, "im2rec.py");

`python im2rec.py --list --recursive LST_FILE_PREFIX DATA_DIR`
* --list - generate an LST file
* --recursive - looks inside subfolders for image data
* LST_FILE_PREFIX - choose the name you want for the `.lst` file
* DATA_DIR - relative path to directory with the data

In [None]:
!python im2rec.py --list --recursive train data_structured/train

In [None]:
!python im2rec.py --list --recursive val data_structured/val

View the contents of the `train.lst` file

In [None]:
!head train.lst

<pre>
</pre>

### Application/x-recordio (preferred format)
___
This format is commonly referred to as RecordIO. It creates a new file for your each of your training and validation datasets with the `.rec` suffix. The `.rec` file is a single file that contains all of the images in the dataset so it can be streamed directly to the SageMaker training algorithm without the overhead involved with transfering thousands of individual files. For datasets with many images this provides a huge reduction in training time because SageMaker doesn't need to download all the image files before it can run the training algorithm. If you use the `im2rec.py` script, it will also resize the images for you as well. The benefits of resizing the files before saving them in the RecordIO format is that it'll reduce the amount of data you need to transfer to s3 and will also speed up trainging by doing the resizing ahead of time instead of at training.

#### 1. Run Option 2 from application/x-image above and copy LST files
Once you've run Option 2 from above then proceed below.

In [None]:
recordio_dir = pathlib.Path("./data_recordio")
recordio_dir.mkdir(exist_ok=True)
shutil.copy("train.lst", "data_recordio/")
shutil.copy("val.lst", "data_recordio/");

#### 2. Generate .rec files in the RecordIO Format
Once the `.lst` file is generated, the same `im2rec.py` script will also generate the `.rec` file.

`python im2rec.py --resize 224 --quality 90 --num-thread 16 LST_FILE_PREFIX DATA_DIR/`
* **--resize**: Have the script resize the files before saving them all to a `.rec` file. For the image classification algorithm the default dimensions are 224x224. Resizing now will also reduce the size of your `.rec` file.
* **--quality**: Default settings will save the image data uncompressed. Adding some compression will keep the filesize of your `.rec` down especially if you're not resizing them.
* **--num_thread**: Set how many threads to parallelize the work
* **--LST_FILE_PREFIX**: Name of the `.lst` you're referencing for creating the `.rec` file
* **--DATA_DIR**: Relative path directory which holds the data listed in the `.lst` file



##### Training dataset

In [None]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/train data_structured/train

##### Validation dataset

In [None]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/val data_structured/val

<pre>
</pre>

### Upload the data to S3
___
In order for SageMaker's built-in algrorithms to train on the data, it must be stored in an S3 bucket. Here, we will create a bucket, but you can use an existing bucket if you like by replacing the `bucket_name` variable in the first line of the `else` statement below.

#### Get S3 Bucket

In [None]:
bucket_name = sagemaker.Session().default_bucket()
prefix = "DEMO-sm-preprocess-train-image-data-builtin-algo"
s3 = boto3.resource("s3")
region = sagemaker.Session().boto_region_name

#### Upload .rec files to S3

In [None]:
s3_uploader = sagemaker.s3.S3Uploader()

data_path = recordio_dir / "train.rec"

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/train"
)

In [None]:
data_path = recordio_dir / "val.rec"

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), desired_s3_uri=f"s3://{bucket_name}/{prefix}/data/val"
)

## Part 4: Train the Built-in Image Classification Algorithm
----
----
In this section, you will use the SageMaker SDK to create an Estimator for SageMaker's Built-in Image Classification algorithm and train it on a remote EC2 instance.

### Built-in Image Classification algorithm
___

#### Create SageMaker training and validation channels

In [None]:
train_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/{prefix}/data/train",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

val_data = sagemaker.inputs.TrainingInput(
    s3_data=f"s3://{bucket_name}/{prefix}/data/val",
    content_type="application/x-recordio",
    s3_data_type="S3Prefix",
    input_mode="Pipe",
)

data_channels = {"train": train_data, "validation": val_data}

### Configure the algorithm's hyperparameters
https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html
* **num_layers** - The built-in image classification algrorithm is based off the ResNet architecture. There are many different versions of this architecture differing by how many layers they use. We'll use the smallest one for this guide to speed up training. If the algorithm's accuracy is hitting a plateau and you need better accuracy, increasing the number of layers may help.
* **use_pretrained_model** - This will initialize the weights from a pre-trained model for transfer learning. Otherwise weights are initialized randomly.
* **augmentation_type** - Allows you to add augmentations to your trainingset to help your model generalize better. For small datasets, augmentation can greatly imporve training.
* **image_shape** -  The channel, height, width of all the images
* **num_classes** - Number of classes in your dataset
* **num_training_samples** - Total number of images in your training set (used to help calculate progres)
* **mini_batch_size** - The batch size you would like to use during training. 
* **epochs** - An epoch refers to one cycle through the training set and having more epochs to train means having more oppotunities to improve accracy. Suitable values range from 5 to 25 epochs depending on your time and budget constraints. Ideally, the right number of epochs is right before your validation accuracy plateaus.
* **learning_rate**: After each batch of training we update the model's weights to give us the best possible results for that batch. The learning rate controls by how much we should update the weights. Best practices dictate a value between 0.2 and .001, typically never going higher than 1. The higher the learning rate, the faster your training will converge to the optimal weights, but going too fast can lead you to overshoot the target. In this example, we're using the weights from a pre-trained model so we'd want to start with a lower learning rate because the weights have already been optimized and we don't want move too far away from them.
* **precision_dtype** -  Whether you want to use a 32-bit float data type for the model's weights or 16-bit. 16-bit can be used if you're running into memory management issues. However, weights can grow or shrink rapidly so having 32-bit weights make your training more robust to these issues and is typically the default in most frameworks.

In [None]:
num_classes = len(category_labels)
num_training_samples = len(set(pathlib.Path("data_structured/train").rglob("*.jpg")))

In [None]:
hyperparameters = {
    "num_layers": 18,
    "use_pretrained_model": 1,
    "augmentation_type": "crop_color_transform",
    "image_shape": "3,224,224",
    "num_classes": num_classes,
    "num_training_samples": num_training_samples,
    "mini_batch_size": 64,
    "epochs": 5,
    "learning_rate": 0.001,
    "precision_dtype": "float32",
}

### Configure the type of algorithm and resources to use

In [None]:
training_image = sagemaker.image_uris.retrieve(
    "image-classification", sagemaker.Session().boto_region_name
)

In [None]:
algo_config = {
    "hyperparameters": hyperparameters,
    "image_uri": training_image,
    "role": sagemaker.get_execution_role(),
    "instance_count": 1,
    "instance_type": "ml.p3.2xlarge",
    "volume_size": 100,
    "max_run": 360000,
    "output_path": f"s3://{bucket_name}/data/output",
}

### Create and train the algorithm

In [None]:
algorithm = sagemaker.estimator.Estimator(**algo_config)

In [None]:
algorithm.fit(inputs=data_channels, logs=True)

<a id='ipg4a.3'></a>
## Understanding the training output
___

```
[09/14/2020 05:37:38 INFO 139869866030912] Epoch[0] Batch [20]#011Speed: 111.811 samples/sec#011accuracy=0.452381
[09/14/2020 05:37:54 INFO 139869866030912] Epoch[0] Batch [40]#011Speed: 131.393 samples/sec#011accuracy=0.570503
[09/14/2020 05:38:10 INFO 139869866030912] Epoch[0] Batch [60]#011Speed: 139.540 samples/sec#011accuracy=0.617700
[09/14/2020 05:38:27 INFO 139869866030912] Epoch[0] Batch [80]#011Speed: 144.003 samples/sec#011accuracy=0.644483
[09/14/2020 05:38:43 INFO 139869866030912] Epoch[0] Batch [100]#011Speed: 146.600 samples/sec#011accuracy=0.664991
```

Training has begun:
* Epoch[0]: One epoch corresponds to one training cycle through all the data. Stochastic optimizers like SGD and Adam improve accuracy by running multiple epochs. Random data augmentations is also applied with each new epoch allowing the training algorithm to learn on modified data.
* Batch: The number of batches processed by the training algorithm. We specified one batch to be 64 images in the `mini_batch_size` hyperparameter. For algorithms like SGD, the model get a chance to update itself every batch.  
* Speed: the number of images sent to the training algorithm per second. This information is important in determining how changes in your dataset affect the speed of training.
* Accuracy: the training accuracy achieved at each interval (in this case, 20 batches).

```

[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Train-accuracy=0.677083
[09/14/2020 05:38:58 INFO 139869866030912] Epoch[0] Time cost=102.745
[09/14/2020 05:39:02 INFO 139869866030912] Epoch[0] Validation-accuracy=0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Storing the best model with validation accuracy: 0.729492
[09/14/2020 05:39:02 INFO 139869866030912] Saved checkpoint to "/opt/ml/model/image-classification-0001.params"
```

The first epoch of training has ended (for this example we only train for one epoch). The final training accuracy is reported as well as the accuracy on the validation set. Comparing these two number is important in determining if your model is overfit or underfit as well as the bais/variance trade-off. The saved model uses the learned weights from the epoch with the best validation accuracy.

```

2020-09-14 05:39:03 Uploading - Uploading generated training model
2020-09-14 05:39:15 Completed - Training job completed
Training seconds: 235
Billable seconds: 235
```

The final model parameters are saved as a `.tar.gz` in S3 to the directory specified in the `output_path` of `algo_config`. Total billable seconds is also reported to help compute the cost of training since you are only charged for the time the EC2 instance is training on the data. Other costs such as S3 storage also apply, but are not included here.