# PreProcessing using SKLearn Processor

1. [Introduction](#Introduction)
2. [Prerequisites](#Prerequisites)
3. [Setup](#Setup)
4. [Dataset](#Dataset)
5. [Build a SageMaker Processing Job](#Build-a-SageMaker-Processing-Job)
    1. [Review Processcikit-learn Script](#Processcikit-Learn-scripts)
    2. [Configure Processing Job](#Configure-Processing-Job)
6. [Review Outputs](#Review-Outputs)

# Introduction
Preprocess dataset before model training is an important step in the overall MLOps process. In this lab you will learn how to use [SKLearnProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html), a type of SageMaker process uses Processcikit-learn scripts in a container image provided and maintained by SageMaker to preprocess data or evaluate models.

The example script will first Load the bird dataset, and then split data into train, validation, and test channels, and finally Export the data and annotation files to S3.


** Note: This Notebook was tested on Data Science Kernel in SageMaker Studio**

## Prerequisites

Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what's happening, you'll need:

- Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv_keras_cifar10" prefix of the bucket.
- Familiarity with Python and numpy
- Basic familiarity with AWS S3.
- Basic understanding of AWS Sagemaker.
- Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
- SageMaker Studio is preferred for the full UI integration

## Setup

Setting up the environment, load the libraries, and define the parameter for the entire notebook.

Run the cell below to ensure latest version of SageMaker is installed in your kernel

In [None]:
!pip install -U sagemaker --quiet # Ensure latest version of SageMaker is installed

In [5]:
import sagemaker
from sagemaker import get_execution_role
import boto3

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
account = sagemaker_session.account_id()
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket() # or use your own custom bucket name
base_job_prefix = "preprocess" # or define your own prefix

## Dataset
The dataset we are using is from [Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/visipedia/CUB-200-2011.html) dataset contains 11,788 images across 200 bird species (the original technical report can be found here). Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. Bounding boxes are provided, as are annotations of bird parts. A recommended train/test split is given, but image size data is not.

![Bird Dataset](statics/birds.png)

Run the cell below to download the full dataset or download manually [here](https://course.fast.ai/datasets). Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.

In [None]:
!wget 'https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz' --no-check-certificate
!tar xopf CUB_200_2011.tgz
!rm CUB_200_2011.tgz

upload to S3 and clean up

In [None]:
s3_raw_data = f's3://{default_bucket}/{base_job_prefix}/full/data'
!aws s3 cp --recursive ./CUB_200_2011 $s3_raw_data

## Build a SageMaker Processing Job
There are 3 types of processing job depanding on which framework you want to use: [AWS Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)

For this example, we are going to practice using scikit-learn processing.  This will use SageMaker built-in Scikit-learn container, so all you need to provide is a python script.

### Processcikit Learn scripts
The script takes in the raw images files and split them into training, validation and test set by class.  It also split the anotation file so you have a manifest file for each.

In [13]:
%%writefile preprocessing.py

import logging

import pandas as pd
import argparse
import boto3
import json
import os
import shutil

logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

input_path = "/opt/ml/processing/input" #"CUB_200_2011" # 
output_path = '/opt/ml/processing/output' #"output" # 
IMAGES_DIR   = os.path.join(input_path, 'images')
SPLIT_RATIOS = (0.6, 0.2, 0.2)


# this function is used to split a dataframe into 3 seperate dataframes
# one of each: train, validate, test

def split_to_train_val_test(df, label_column, splits=(0.7, 0.2, 0.1), verbose=False):
    train_df, val_df, test_df = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]

        lbl_train_df        = lbl_df.sample(frac=splits[0])
        lbl_val_and_test_df = lbl_df.drop(lbl_train_df.index)
        lbl_test_df         = lbl_val_and_test_df.sample(frac=splits[2]/(splits[1] + splits[2]))
        lbl_val_df          = lbl_val_and_test_df.drop(lbl_test_df.index)

        if verbose:
            print('\n{}:\n---------\ntotal:{}\ntrain_df:{}\nval_df:{}\ntest_df:{}'.format(lbl,
                                                                        len(lbl_df), 
                                                                        len(lbl_train_df), 
                                                                        len(lbl_val_df), 
                                                                        len(lbl_test_df)))
        train_df = train_df.append(lbl_train_df)
        val_df   = val_df.append(lbl_val_df)
        test_df  = test_df.append(lbl_test_df)

    # shuffle them on the way out using .sample(frac=1)
    return train_df.sample(frac=1), val_df.sample(frac=1), test_df.sample(frac=1)

# This function grabs the manifest files and build a dataframe, then call the split_to_train_val_test
# function above and return the 3 dataframes
def get_train_val_dataframes(BASE_DIR, classes, split_ratios):
    CLASSES_FILE = os.path.join(BASE_DIR, 'classes.txt')
    IMAGE_FILE   = os.path.join(BASE_DIR, 'images.txt')
    LABEL_FILE   = os.path.join(BASE_DIR, 'image_class_labels.txt')

    images_df = pd.read_csv(IMAGE_FILE, sep=' ',
                            names=['image_pretty_name', 'image_file_name'],
                            header=None)
    image_class_labels_df = pd.read_csv(LABEL_FILE, sep=' ',
                                names=['image_pretty_name', 'orig_class_id'], header=None)

    # Merge the metadata into a single flat dataframe for easier processing
    full_df = pd.DataFrame(images_df)

    full_df.reset_index(inplace=True, drop=True)
    full_df = pd.merge(full_df, image_class_labels_df, on='image_pretty_name')

    # grab a small subset of species for testing
    criteria = full_df['orig_class_id'].isin(classes)
    full_df = full_df[criteria]
    print('Using {} images from {} classes'.format(full_df.shape[0], len(classes)))

    unique_classes = full_df['orig_class_id'].drop_duplicates()
    sorted_unique_classes = sorted(unique_classes)
    id_to_one_based = {}
    i = 1
    for c in sorted_unique_classes:
        id_to_one_based[c] = str(i)
        i += 1

    full_df['class_id'] = full_df['orig_class_id'].map(id_to_one_based)
    full_df.reset_index(inplace=True, drop=True)

    def get_class_name(fn):
        return fn.split('/')[0]
    full_df['class_name'] = full_df['image_file_name'].apply(get_class_name)
    full_df = full_df.drop(['image_pretty_name'], axis=1)

    train_df = []
    test_df  = []
    val_df   = []

    # split into training and validation sets
    train_df, val_df, test_df = split_to_train_val_test(full_df, 'class_id', split_ratios)

    print('num images total: ' + str(images_df.shape[0]))
    print('\nnum train: ' + str(train_df.shape[0]))
    print('num val: ' + str(val_df.shape[0]))
    print('num test: ' + str(test_df.shape[0]))
    return train_df, val_df, test_df

# this function copy images by channel to its destination folder
def copy_files_for_channel(df, channel_name, verbose=False):
    print('\nCopying files for {} images in channel: {}...'.format(df.shape[0], channel_name))
    for i in range(df.shape[0]):
        target_fname = df.iloc[i]['image_file_name']
#         if verbose:
#             print(target_fname)
        src = "{}/{}".format(IMAGES_DIR, target_fname) #f"{IMAGES_DIR}/{target_fname}"
        dst = "{}/{}/{}".format(output_path,channel_name,target_fname)
        shutil.copyfile(src, dst)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--classes", type=str, default="")
    parser.add_argument("--input-data", type=str, default="classes.txt")
    args, _ = parser.parse_known_args()


    c_list = args.classes.split(',')
    input_data = args.input_data
        
    CLASSES_FILE = os.path.join(input_path, input_data)

    CLASS_COLS      = ['class_number','class_id']
    
    if len(c_list)==0:
        # Otherwise, you can use the full set of species
        CLASSES = []
        for c in range(200):
            CLASSES += [c + 1]
        prefix = prefix + '-full'
    else:
        CLASSES = list(map(int, c_list))

            
    classes_df = pd.read_csv(CLASSES_FILE, sep=' ', names=CLASS_COLS, header=None)

    criteria = classes_df['class_number'].isin(CLASSES)
    classes_df = classes_df[criteria]

    class_name_list = sorted(classes_df['class_id'].unique().tolist())
    print(class_name_list)
    
    
    train_df, val_df, test_df = get_train_val_dataframes(input_path, CLASSES, SPLIT_RATIOS)
        
    for c in class_name_list:
        os.mkdir('{}/{}/{}'.format(output_path, 'valid', c))
        os.mkdir('{}/{}/{}'.format(output_path, 'test', c))
        os.mkdir('{}/{}/{}'.format(output_path, 'train', c))

    copy_files_for_channel(val_df,   'valid')
    copy_files_for_channel(test_df,  'test')
    copy_files_for_channel(train_df, 'train')
    
    # export manifest file for validation
    train_m_file = "{}/manifest/train.csv".format(output_path)
    train_df.to_csv(train_m_file, index=False)
    test_m_file = "{}/manifest/test.csv".format(output_path)
    test_df.to_csv(test_m_file, index=False)
    val_m_file = "{}/manifest/valid.csv".format(output_path)
    val_df.to_csv(val_m_file, index=False)
    
    print("Finished running processing job")

Writing preprocessing.py


## Configure Processing Job

In [6]:
from sagemaker.sklearn.processing import SKLearnProcessor

from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
)
import time 

timpstamp = str(time.time()).split('.')[0]
# SKlearnProcessor for preprocessing
output_prefix = f'{base_job_prefix}/outputs/{timpstamp}'
output_s3_uri = f's3://{default_bucket}/{output_prefix}'

class_selection = '13, 17, 35, 36, 47, 68, 73, 87'
input_annotation = 'classes.txt'
processing_instance_type = "ml.m5.xlarge"
processing_instance_count = 1

sklearn_processor = SKLearnProcessor(base_job_name = f"{base_job_prefix}-preprocess",  # choose any name
                                    framework_version='0.20.0',
                                    role=role,
                                    instance_type=processing_instance_type,
                                    instance_count=processing_instance_count)

In [None]:
sklearn_processor.run(
    code='preprocessing.py',
    arguments=["--classes", class_selection, 
               "--input-data", input_annotation],
    inputs=[ProcessingInput(source=s3_raw_data, 
            destination="/opt/ml/processing/input")],
    outputs=[
            ProcessingOutput(source="/opt/ml/processing/output/train", destination = output_s3_uri +'/train'),
            ProcessingOutput(source="/opt/ml/processing/output/valid", destination = output_s3_uri +'/valid'),
            ProcessingOutput(source="/opt/ml/processing/output/test", destination = output_s3_uri +'/test'),
            ProcessingOutput(source="/opt/ml/processing/output/manifest", destination = output_s3_uri +'/manifest'),
        ],
    )

# Review Outputs

At the end of the lab, you dataset will be randomly split into train, valid, and test folders. YUou will also have a csv manifest file for each channel. Validate your results with the script below. **If you plan to complete other modules in this workshop, please keep these data.  Otherwise, you can clean up after this lab.**

In [None]:
s3_client = boto3.client("s3")
response = s3_client.list_objects_v2(Bucket=default_bucket, Prefix=output_prefix)
files = response.get("Contents")

for file in files:
    
    print(f"file_name: {file['Key']}, size: {file['Size']}")