# Preprocessing Images for Built-in Algorithms (Part 3/4)
Download | Structure | **Preprocessing (Built-in)** | Train Model (Built-in)


**Notes**: 
* This notebook should be used with the conda_amazonei_mxnet_p36 kernel
* This notebook is part of a series of notebooks beginning with `01_download_data` and `02_structuring_data`. From here on it will focus on SageMaker's built-in algorithms. The next notebook in this series is `04a_builtin_training`.
* You can also explore preprocessing with TensorFlow and PyTorch by running `03b_tensorflow_preprocessing` and `03c_pytorch_preprocessing`, respectively.

<pre>
</pre>

In this notebook we will explore the different ways to format your image dataset for SageMaker's built-in algorithms. The first involves creating a manifest file for the train and validations sets and the other has you creating .REC files (RecordIO format) which are single binary files made up of all the images for the train and validation sets. Since the RecordIO format is preferred, we will upload the .REC files to S3 for training in the nedxt notebook.

<pre>
</pre>

## Overview
* #### [Dependencies](#ipg3a.1)
* #### [Application/x-image format](#ipg3a.2)
* #### [Application/x-recordio format](#ipg3a.3) (preferred format)
* #### [Upload the data to S3](#ipg3a.4)

<pre>
</pre>

<a id='ipg3a.1'></a>
## Dependencies
___

In [None]:
import uuid
import boto3
import shutil
import urllib
import pickle
import pathlib
import sagemaker
import subprocess

### Load Category Labels
The `category_labels` file was generated from the first notebook in this series `01_download_data.ipynb`. You will need to run that notebook before running the code here.

In [None]:
with open('pickled_data/category_labels.pickle', 'rb') as f:
    category_labels = pickle.load(f)

<pre>
</pre>

<a id='ipg3a.2'></a>
## Application/x-image format
___

This format is also referred to as "Image Format" or "LST" format. The benefit of using this format is that it doesn't require any modification or restructuring of your dataset. Instead, you create a manifest of the images for your training set and validation set. These two manifests are separate `.lst` files which list all the images giving each of them a unique index, the class they belong to and the relative path to the image file from the main training folder. The data in the `.lst` file is in tab separated values.

While its the easiest format to use, it requires SageMaker to do more work behind the scenes. For datasets with many images, this will cause training to take longer. For datasets with fewer images, the performance difference isn't as pronounced.

Below are two examples of how to create your .LST manifest files. One uses your own code and the other uses a script from MXNet. If you want to create .REC files of your images, you should skip to Option 2.

### Option 1: Manually generate the .LST files

In [None]:
category_ids = {name: idx for idx, name in enumerate(sorted(category_labels.values()))}
print(category_ids)

In [None]:
image_paths = pathlib.Path('./data_structured').rglob('*.jpg')

for idx, p in enumerate(image_paths):
    image_id = f'{idx:010}'
    category = category_ids[p.parts[-2]]
    path = p.as_posix()
    split = p.parts[-3]
    with open(f'{split}.lst', 'a') as f:
        line = f"{image_id}\t{category}\t{path}\n"
        f.write(line)

View the contents of the `train.lst` file

In [None]:
!head train.lst

<pre>
</pre>

### Option 2: Use im2rec.py script to generate the .LST files

In [None]:
script_url = 'https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py'
urllib.request.urlretrieve(script_url, "im2rec.py");

`python im2rec.py --list --recursive LST_FILE_PREFIX DATA_DIR`
* --list - generate an LST file
* --recursive - looks inside subfolders for image data
* LST_FILE_PREFIX - choose the name you want for the `.lst` file
* DATA_DIR - relative path to directory with the data

In [None]:
!python im2rec.py --list --recursive train data_structured/train

In [None]:
!python im2rec.py --list --recursive val data_structured/val

View the contents of the `train.lst` file

In [None]:
!head train.lst

<pre>
</pre>

<a id='ipg3a.3'></a>
## Application/x-recordio (preferred format)
___
This format is commonly referred to as RecordIO. It creates a new file for your each of your training and validation datasets with the `.rec` suffix. The `.rec` file is a single file that contains all of the images in the dataset so it can be streamed directly to the SageMaker training algorithm without the overhead involved with transfering thousands of individual files. For datasets with many images this provides a huge reduction in training time because SageMaker doesn't need to download all the image files before it can run the training algorithm. If you use the `im2rec.py` script, it will also resize the images for you as well. The benefits of resizing the files before saving them in the RecordIO format is that it'll reduce the amount of data you need to transfer to s3 and will also speed up trainging by doing the resizing ahead of time instead of at training.

### 1. Run Option 2 from application/x-image above and copy LST files
Once you've run Option 2 from above then proceed below.

In [None]:
recordio_dir = pathlib.Path('./data_recordio')
recordio_dir.mkdir(exist_ok=True)
shutil.copy('train.lst', 'data_recordio/');
shutil.copy('val.lst', 'data_recordio/');

### 2. Generate .rec files in the RecordIO Format
Once the `.lst` file is generated, the same `im2rec.py` script will also generate the `.rec` file.

`python im2rec.py --resize 224 --quality 90 --num-thread 16 LST_FILE_PREFIX DATA_DIR/`
* **--resize**: Have the script resize the files before saving them all to a `.rec` file. For the image classification algorithm the default dimensions are 224x224. Resizing now will also reduce the size of your `.rec` file.
* **--quality**: Default settings will save the image data uncompressed. Adding some compression will keep the filesize of your `.rec` down especially if you're not resizing them.
* **--num_thread**: Set how many threads to parallelize the work
* **--LST_FILE_PREFIX**: Name of the `.lst` you're referencing for creating the `.rec` file
* **--DATA_DIR**: Relative path directory which holds the data listed in the `.lst` file



#### Training dataset

In [None]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/train data_structured/train

#### Validation dataset

In [None]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/val data_structured/val

<pre>
</pre>

<a id='ipg3a.4'></a>
## Upload the data to S3
___
In order for SageMaker's built-in algrorithms to train on the data, it must be stored in an S3 bucket. Here, we will create a bucket, but you can use an existing bucket if you like by replacing the `bucket_name` variable in the first line of the `else` statement below.

### Create a bucket for your project

In [None]:
if pathlib.Path('pickled_data/builtin_bucket_name.pickle').exists():
    with open('pickled_data/builtin_bucket_name.pickle', 'rb') as f:
        bucket_name = pickle.load(f)
        print('Bucket Name:', bucket_name)
else:
    bucket_name = f'sagemaker-builtin-ic-{str(uuid.uuid4())}'
    s3 = boto3.resource('s3')
    region = sagemaker.Session().boto_region_name
    bucket_config = {'LocationConstraint': region}
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=bucket_config)
    
    with open('pickled_data/builtin_bucket_name.pickle', 'wb') as f:
        pickle.dump(bucket_name, f)
    print('Bucket Name:', bucket_name)

### Upload .rec files to S3

In [None]:
s3_uploader = sagemaker.s3.S3Uploader()

data_path = recordio_dir / 'train.rec'

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), 
    desired_s3_uri=f's3://{bucket_name}/data/train')

In [None]:
data_path = recordio_dir / 'val.rec'

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), 
    desired_s3_uri=f's3://{bucket_name}/data/val')

<pre>
</pre>

### Rollback to default version of SDK and TensorFlow
Only do this if you're done with this guide and want to use the same kernel for other notebooks with an incompatible version of the SageMaker SDK or TensorFlow.

In [None]:
# print(f'Original version: {original_sagemaker_version[0]}')
# print(f'Current version:  {sagemaker.__version__}')
# print('')
# print(f'Rolling back to {original_sagemaker_version[0]}')
# print('Restart notebook kernel to use changes.')
# print('')
# s = f'sagemaker=={original_sagemaker_version[0]}'
# !{sys.executable} -m pip install -q {s}

<pre>
</pre>

## Next Steps
Now that the training and validation data has be uploaded to S3, the next notebook will use SageMaker's built-in Image Classification algorithm to train a deep learning model to classify the animal images.