# Preprocessing Images for Built-in Algorithms (Part 3/4)
Download | Structure | **Preprocess** | Train  


**Note**: this notebook should be used with the conda_amazonei_mxnet_p36 kernel

## Overview
* #### Dependencies
* #### Application/x-image
* #### Application/x-recordio
* #### Upload to S3

<pre>
</pre>

## Dependencies
___
For this guide we'll use the SageMaker Python SDK version 2.9.2. By default, SageMaker Notebooks come with version 1.72.0. Other guides provided by Amazon may be set up to work with other versions of the Python SDK so you may wish to roll-back to 1.72.0.

#### Update the SageMaker Python SDK

In [36]:
import sys
original_sagemaker_version = !conda list | grep -E "sagemaker\s" | awk '{print $2}'
!{sys.executable} -m pip install -q "sagemaker==2.9.2" "opencv-python"

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/amazonei_mxnet_p36/bin/python -m pip install --upgrade pip' command.[0m


In [38]:
import uuid
import boto3
import shutil
import urllib
import pickle
import pathlib
import sagemaker
import subprocess

In [58]:
print(f'sagemaker updated  {original_sagemaker_version[0]} -> {sagemaker.__version__}')

sagemaker updated  1.72.1 -> 2.9.2


#### Load Category Labels
The `category_labels` file was generated from the first notebook in this series `01_download_data.ipynb`. You will need to run that notebook before running the code here.

In [4]:
with open('pickled_data/category_labels.pickle', 'rb') as f:
    category_labels = pickle.load(f)

<pre>
</pre>

## application/x-image
___

This format is also referred to as "Image Format" or "LST" format. The benefit of using this format is that it doesn't require any modification or restructuring of your dataset. Instead, you create a manifest of the images for your training set and validation set. These two manifests are separate `.lst` files which list all the images giving each of them a unique index, the class they belong to and the relative path to the image file from the main training folder. The data in the `.lst` file is in tab separated values.

While its the easiest format to use, it requires SageMaker to do more work behind the scenes. For datasets with many images, this will cause training to take longer. For datasets with fewer images, the performance difference isn't as pronounced.

#### Option 1: Manually generate the .LST files

In [5]:
category_ids = {name: idx for idx, name in enumerate(sorted(category_labels.values()))}
print(category_ids)

{'bear': 0, 'bird': 1, 'cat': 2, 'cow': 3, 'dog': 4, 'elephant': 5, 'frog': 6, 'giraffe': 7, 'horse': 8, 'sheep': 9, 'zebra': 10}


In [10]:
image_paths = pathlib.Path('./data_structured').rglob('*.jpg')

for idx, p in enumerate(image_paths):
    image_id = f'{idx:010}'
    category = category_ids[p.parts[-2]]
    path = p.as_posix()
    split = p.parts[-3]
    with open(f'{split}.lst', 'a') as f:
        line = f"{image_id}\t{category}\t{path}\n"
        f.write(line)

In [11]:
!head train.lst

0000000000	4	data_structured/train/dog/000000110395.jpg
0000000001	4	data_structured/train/dog/000000239881.jpg
0000000002	4	data_structured/train/dog/000000545720.jpg
0000000003	4	data_structured/train/dog/000000174305.jpg
0000000004	4	data_structured/train/dog/000000564053.jpg
0000000005	4	data_structured/train/dog/000000135748.jpg
0000000006	4	data_structured/train/dog/000000530624.jpg
0000000007	4	data_structured/train/dog/000000377575.jpg
0000000008	4	data_structured/train/dog/000000194540.jpg
0000000009	4	data_structured/train/dog/000000380893.jpg


<pre>
</pre>

#### Option 2: Use im2rec.py script to generate the .LST files

In [12]:
script_url = 'https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/im2rec.py'
urllib.request.urlretrieve(script_url, "im2rec.py");

`python im2rec.py --list --recursive LST_FILE_PREFIX DATA_DIR`
* --list - generate an LST file
* --recursive - looks inside subfolders for image data
* LST_FILE_PREFIX - choose the name you want for the `.lst` file
* DATA_DIR - relative path to directory with the data

In [13]:
!python im2rec.py --list --recursive train data_structured/train

bear 0
bird 1
cat 2
cow 3
dog 4
elephant 5
frog 6
giraffe 7
horse 8
sheep 9
zebra 10


In [14]:
!python im2rec.py --list --recursive val data_structured/val

bear 0
bird 1
cat 2
cow 3
dog 4
elephant 5
frog 6
giraffe 7
horse 8
sheep 9
zebra 10


In [15]:
!head train.lst

1999	9.000000	sheep/000000574928.jpg
121	0.000000	bear/000000359337.jpg
474	2.000000	cat/000000186635.jpg
877	4.000000	dog/000000187167.jpg
1840	9.000000	sheep/000000090683.jpg
536	2.000000	cat/000000333929.jpg
1769	8.000000	horse/000000422878.jpg
1104	5.000000	elephant/000000262979.jpg
1174	5.000000	elephant/000000460403.jpg
1139	5.000000	elephant/000000362284.jpg


<pre>

</pre>

## application/x-recordio (preferred)
___
This format is commonly referred to as RecordIO. It creates a new file for your each of your training and validation datasets with the `.rec` suffix. The `.rec` file is a single file that contains all of the images in the dataset so it can be streamed directly to the SageMaker training algorithm without the overhead involved with transfering thousands of individual files. For datasets with many images this provides a huge reduction in training time because SageMaker doesn't need to download all the image files before it can run the training algorithm. If you use the `im2rec.py` script, it will also resize the images for you as well. The benefits of resizing the files before saving them in the RecordIO format is that it'll reduce the amount of data you need to transfer to s3 and will also speed up trainging by doing the resizing ahead of time instead of at training.

<pre>
</pre>

#### 1. Run Option 2 from application/x-image above and copy LST files

In [16]:
recordio_dir = pathlib.Path('./data_recordio')
recordio_dir.mkdir(exist_ok=True)
shutil.copy('train.lst', 'data_recordio/')
shutil.copy('val.lst', 'data_recordio/')

'data_recordio/val.lst'

#### 2. Generate .rec files in the RecordIO Format
Once the `.lst` file is generated, the same `im2rec.py` script will also generate the `.rec` file.

`python im2rec.py --resize 224 --quality 90 --num-thread 16 LST_FILE_PREFIX DATA_DIR/`
* **--resize**: Have the script resize the files before saving them all to a `.rec` file. For the image classification algorithm the default dimensions are 224x224. Resizing now will also reduce the size of your `.rec` file.
* **--quality**: Default settings will save the image data uncompressed. Adding some compression will keep the filesize of your `.rec` down especially if you're not resizing them.
* **--num_thread**: Set how many threads to parallelize the work
* **--LST_FILE_PREFIX**: Name of the `.lst` you're referencing for creating the `.rec` file
* **--DATA_DIR**: Relative path directory which holds the data listed in the `.lst` file



#### Training dataset

In [17]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/train data_structured/train

Creating .rec file from /home/ec2-user/SageMaker/sagemaker-notebooks-WIP/sagemaker_image_data_guide/data_recordio/train.lst in /home/ec2-user/SageMaker/sagemaker-notebooks-WIP/sagemaker_image_data_guide/data_recordio
time: 0.2049543857574463  count: 0
time: 5.9312732219696045  count: 1000
time: 5.552445888519287  count: 2000


#### Validation dataset

In [18]:
!python im2rec.py --resize 224 --quality 90 --num-thread 16 data_recordio/val data_structured/val

Creating .rec file from /home/ec2-user/SageMaker/sagemaker-notebooks-WIP/sagemaker_image_data_guide/data_recordio/val.lst in /home/ec2-user/SageMaker/sagemaker-notebooks-WIP/sagemaker_image_data_guide/data_recordio
time: 0.0018703937530517578  count: 0


<pre>
</pre>

### Upload the data to S3

#### Create a bucket for your project

In [19]:
if pathlib.Path('pickled_data/prebuilt_bucket_name.pickle').exists():
    with open('pickled_data/prebuilt_bucket_name.pickle', 'rb') as f:
        bucket_name = pickle.load(f)
        print('Bucket Name:', bucket_name)
else:
    bucket_name = f'sagemaker-prebuilt-ic-{str(uuid.uuid4())}'
    s3 = boto3.resource('s3')
    region = sagemaker.Session().boto_region_name
    bucket_config = {'LocationConstraint': region}
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=bucket_config)
    
    with open('pickled_data/prebuilt_bucket_name.pickle', 'wb') as f:
        pickle.dump(bucket_name, f)
    print('Bucket Name:', bucket_name)

Bucket Name: sagemaker-prebuilt-ic-2a789e07-853b-4b89-b0eb-8ea33550520e


#### Upload .rec files to S3

**NEEDS UPDATING**

In [20]:
s3_uploader = sagemaker.s3.S3Uploader()

data_path = recordio_dir / 'train.rec'

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), 
    desired_s3_uri=f's3://{bucket_name}/data/train')

In [21]:
data_path = recordio_dir / 'val.rec'

data_s3_uri = s3_uploader.upload(
    local_path=data_path.as_posix(), 
    desired_s3_uri=f's3://{bucket_name}/data/val')

<pre>
</pre>

### Rollback to default version of SDK and TensorFlow
Only do this if you're done with this guide and want to use the same kernel for other notebooks with an incompatible version of the SageMaker SDK or TensorFlow.

In [None]:
# print(f'Original version: {original_sagemaker_version[0]}')
# print(f'Current version:  {sagemaker.__version__}')
# print('')
# print(f'Rolling back to {original_sagemaker_version[0]}')
# print('Restart notebook kernel to use changes.')
# print('')
# s = f'sagemaker=={original_sagemaker_version[0]}'
# !{sys.executable} -m pip install -q {s}