# Module 1. Data Augmentation
---

This notebook shows a representative image augmentation technique that increases the diversity of the training set by applying various transforms such as affine transform (rotate, shift, etc.) and blur using the `albumentations` library.

- Very similar to PyTorch's torchvision (you can learn it in 5-10 minutes) 
- Documentation: https://albumentations.readthedocs.io/en/latest/

This hands-on can be completed in about **10 minutes**. 

<br>

# 1. Preparation
---

## Install and upgrade packages

If you create a new jupyter notebook instance, change `install_needed = True` in the code cell below, run the code cell, and change `install_needed = False` when the kernel is restarted. You only need to do this once.

In [1]:
%store -z
%load_ext autoreload
%autoreload 2
%matplotlib inline
import sys
import logging
import IPython

install_needed = False
#install_needed = True

if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U smdebug albumentations torch==1.6.0 torchvision==0.7.0
    IPython.Application.instance().kernel.do_shutdown(True)
    
logging.basicConfig(
    format='%(asctime)s [%(levelname)s] %(name)s - %(message)s',
    level=logging.INFO,
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)

logger = logging.getLogger()    

In [2]:
import os
import glob2
import cv2
import numpy as np
import albumentations as A
from matplotlib import pyplot as plt

raw_dir = 'raw'
dataset_dir = 'bioplus'
classes = os.listdir(raw_dir)
num_classes = len(classes)
train_size = 0.8
num_augmentations = 2
!rm -rf {dataset_dir}
print(classes)

['brown_normal_korean', 'brown_abnormal_chinese', 'brown_abnormal_korean', 'red_normal', 'brown_normal_chinese', 'red_abnormal', 'no_box']


<br>

# 2. Data Augmentation
---


In [4]:
def _get_transforms_augmentation(cropsize_dim, resize_dim=500):
    """
    Declare an augmentation pipeline
    """
    transforms = A.Compose([
        A.CenterCrop(cropsize_dim, cropsize_dim),
        A.Resize(resize_dim, resize_dim),
        A.GaussNoise(p=0.4),
        A.RandomBrightnessContrast(p=0.2),
        A.OneOf([
            A.HorizontalFlip(p=0.5),
            A.RandomRotate90(p=0.5),
            A.VerticalFlip(p=0.5)           
        ], p=0.2),   
        A.OneOf([
            A.MotionBlur(p=.2),
            A.MedianBlur(blur_limit=3, p=0.1),
            A.Blur(blur_limit=3, p=0.1),
        ], p=0.3),  
        A.OneOf([
            A.CLAHE(clip_limit=2),
            A.Sharpen(),
            A.HueSaturationValue(p=0.3),           
        ], p=0.3),
        A.OneOf([
            A.Rotate(10, p=0.6),
            A.ShiftScaleRotate(shift_limit=0.0625, scale_limit=0.2, rotate_limit=10, p=0.4),
        ], p=0.3),
    ], p=1.0)
    return transforms


def _make_augmented_images(f, write_path, phase, num_augmentations=10):
    """
    Artificially augment raw image data. If you do not have enough raw data, you can take advantage of it.
    """    
    image = cv2.imread(f)
    
    h, w, c = image.shape
    cropsize_dim = np.min([h,w])

    filename = f.split('/')[-1]
    filename_noext = filename.split('.')[0]
    logger.info(f'[{phase}] Augmenting image: {filename}')
    
    for k in range(num_augmentations):
        transforms = _get_transforms_augmentation(cropsize_dim=cropsize_dim)
        transformed = transforms(image=image)
        transformed_image = transformed["image"]
        cv2.imwrite(os.path.join(write_path, f'{filename_noext}_aug_{k:05d}.jpg'), transformed_image)  

In [5]:
for c in classes:

    img_raw_path = os.path.join(raw_dir, c)
    img_train_path = os.path.join(dataset_dir, 'train', c)
    img_valid_path = os.path.join(dataset_dir, 'valid', c)

    os.makedirs(img_train_path, exist_ok=True)
    os.makedirs(img_valid_path, exist_ok=True)

    files = (glob2.glob(f"{img_raw_path}/*.jpg"))
    num_files = len(files)
    num_train_files = int(num_files * train_size)

    logger.info('-' * 70)   
    logger.info(f'Augmenting class: {c}')
    logger.info(f'img_train_path: {img_train_path}')
    logger.info(f'img_valid_path: {img_valid_path}')
    logger.info(f'num_raw_files={num_files}, num_raw_train_files={num_train_files}')
    logger.info('-' * 70)

    # training images
    for f in files[:num_train_files]:
        _make_augmented_images(f, img_train_path, 'train', num_augmentations)

    # validation images
    for f in files[num_train_files:]:
        _make_augmented_images(f, img_valid_path, 'valid', num_augmentations)
    
    logger.info('')

2021-08-23 11:27:04 [INFO] root - ----------------------------------------------------------------------
2021-08-23 11:27:04 [INFO] root - Augmenting class: brown_normal_korean
2021-08-23 11:27:04 [INFO] root - img_train_path: bioplus/train/brown_normal_korean
2021-08-23 11:27:04 [INFO] root - img_valid_path: bioplus/valid/brown_normal_korean
2021-08-23 11:27:04 [INFO] root - num_raw_files=105, num_raw_train_files=84
2021-08-23 11:27:04 [INFO] root - ----------------------------------------------------------------------
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk1_img_000218.jpg
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk2_img_000099.jpg
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk2_img_000021.jpg
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk1_img_000098.jpg
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk1_img_000195.jpg
2021-08-23 11:27:04 [INFO] root - [train] Augmenting image: bk1_img_000120.jpg


2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk2_img_000023.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk2_img_000188.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk1_img_000080.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk1_img_000436.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk1_img_000253.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk1_img_000013.jpg
2021-08-23 11:27:10 [INFO] root - [valid] Augmenting image: bk1_img_000036.jpg
2021-08-23 11:27:10 [INFO] root - 
2021-08-23 11:27:10 [INFO] root - ----------------------------------------------------------------------
2021-08-23 11:27:10 [INFO] root - Augmenting class: brown_abnormal_chinese
2021-08-23 11:27:10 [INFO] root - img_train_path: bioplus/train/brown_abnormal_chinese
2021-08-23 11:27:10 [INFO] root - img_valid_path: bioplus/valid/brown_abnormal_chinese
2021-08-23 11:27:10 [INFO] root - num_raw_files=61, num_ra

2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000116.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000142.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000044.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000164.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000163.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000429.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000181.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000113.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak1_img_000000.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000200.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000036.jpg
2021-08-23 11:27:16 [INFO] root - [train] Augmenting image: bak2_img_000006.jpg
2021-08-23 11:27:16 [INFO] root - [train

2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000029.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000145.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000237.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r2_img_000308.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r2_img_000036.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r2_img_000285.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000167.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000404.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r2_img_000305.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000238.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_img_000097.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r2_img_000133.jpg
2021-08-23 11:27:22 [INFO] root - [train] Augmenting image: r1_i

2021-08-23 11:27:28 [INFO] root - [valid] Augmenting image: r1_img_000463.jpg
2021-08-23 11:27:28 [INFO] root - [valid] Augmenting image: r1_img_000155.jpg
2021-08-23 11:27:28 [INFO] root - [valid] Augmenting image: r1_img_000064.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r2_img_000282.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r1_img_000402.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r2_img_000055.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r1_img_000168.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r2_img_000014.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r1_img_000345.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r2_img_000079.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r2_img_000005.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r1_img_000014.jpg
2021-08-23 11:27:29 [INFO] root - [valid] Augmenting image: r1_i

2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc2_img_000137.jpg
2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc2_img_000066.jpg
2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc2_img_000144.jpg
2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc2_img_000132.jpg
2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc1_img_000141.jpg
2021-08-23 11:27:34 [INFO] root - [valid] Augmenting image: bc2_img_000060.jpg
2021-08-23 11:27:34 [INFO] root - 
2021-08-23 11:27:34 [INFO] root - ----------------------------------------------------------------------
2021-08-23 11:27:34 [INFO] root - Augmenting class: red_abnormal
2021-08-23 11:27:34 [INFO] root - img_train_path: bioplus/train/red_abnormal
2021-08-23 11:27:34 [INFO] root - img_valid_path: bioplus/valid/red_abnormal
2021-08-23 11:27:34 [INFO] root - num_raw_files=173, num_raw_train_files=138
2021-08-23 11:27:34 [INFO] root - --------------------------------------------------------

2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra1_img_000490.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000382.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000415.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000099.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra1_img_000465.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra1_img_000379.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000132.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra1_img_000163.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000476.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000305.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra1_img_000316.jpg
2021-08-23 11:27:40 [INFO] root - [train] Augmenting image: ra2_img_000394.jpg
2021-08-23 11:27:41 [INFO] root - [train] Augmenting

2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00276.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00273.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00062.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00274.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00064.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00169.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00074.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00038.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00079.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00209.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00188.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00268.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00052.jpg
2021-08-23 11:27:46 [INFO] root - [train] Augmenting image: 00162.jpg
2021-08-23 11:27:46 

2021-08-23 11:27:52 [INFO] root - [train] Augmenting image: 00283.jpg
2021-08-23 11:27:52 [INFO] root - [train] Augmenting image: 00140.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00160.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00176.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00015.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00060.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00267.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00076.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00154.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00243.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00050.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00123.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00061.jpg
2021-08-23 11:27:53 [INFO] root - [train] Augmenting image: 00210.jpg
2021-08-23 11:27:53 

2021-08-23 11:27:59 [INFO] root - [valid] Augmenting image: 00236.jpg
2021-08-23 11:27:59 [INFO] root - [valid] Augmenting image: 00259.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00025.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00282.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00124.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00065.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00011.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00205.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00251.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00130.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00004.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00008.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00184.jpg
2021-08-23 11:28:00 [INFO] root - [valid] Augmenting image: 00020.jpg
2021-08-23 11:28:00 

## Copy data to S3

Copy data to S3. We are copying the raw image as it is, but try converting it to a file such as TFRecord or RecordIO in the future for more efficient training.

In [6]:
import sagemaker
bucket = sagemaker.Session().default_bucket()
s3_path = f's3://{bucket}/{dataset_dir}'

In [7]:
%%time
!aws s3 cp {dataset_dir} s3://{bucket}/{dataset_dir} --recursive --quiet

CPU times: user 129 ms, sys: 17.3 ms, total: 146 ms
Wall time: 11.8 s


## Store Class map as JSON

Store the class dictionary as json. This file will be useful for model inference in the future.

In [9]:
import src.train_utils as train_utils
classes, classes_dict = train_utils.get_classes(f'./{dataset_dir}/train') 
train_utils.save_classes_dict(classes_dict, 'classes_dict.json')

In [10]:
%store bucket dataset_dir raw_dir classes num_classes

Stored 'bucket' (str)
Stored 'dataset_dir' (str)
Stored 'raw_dir' (str)
Stored 'classes' (list)
Stored 'num_classes' (int)


<br>

# Next Step

With the training data ready, it is now time to develop and train the model. If you are unfamiliar with PyTorch, please proceed to `02_local_trining.ipynb` first. If you are somewhat familiar with PyTorch, skip `02_local_trining.ipynb` and proceed directly to `03_sm_training.ipynb`