# Pre-processing using a SKLearn Processor

> **Note**
> 
> This notebook has been tested using the `Python 3 (Data Science)` kernel in SageMaker Studio.

1. [Introduction](#Introduction)
2. [Prerequisites](#Prerequisites)
3. [Setup](#Setup)
4. [The Raw Dataset](#The-Raw-Dataset)
5. [The Data Labels](#The-Data-Labels)
5. [Defining a SageMaker Processing Job](#Defining-a-SageMaker-Processing-Job)
6. [Review Outputs](#Review-Outputs)

# Introduction

Data processing tasks such as feature engineering, data validation, model evaluation, and model interpretation are essential steps performed by engineers and data scientists in this machine learning workflow.

With Amazon SageMaker Processing jobs you can run custom scripts for all the above tasks in several popular frameworks such as Scikit learn and Spark. 

In this lab you will learn how to use [SKLearnProcessor](https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html), a SageMaker library helper class that allows you to leverage a specific type of SageMaker processing container. The SKLearnProcessor uses scikit-learn scripts in a container image provided and maintained by AWS in order to preprocess data or evaluate models.

![Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/images/Processing-1.png)

The example script will:
1. Load the bird dataset
2. Split data into train, validation, and test channels
3. Export the data and annotation files to S3

## Prerequisites

Download the notebook into your environment, and you can run it by simply execute each cell in order. To understand what is happening, you will need:

- Access to the SageMaker default S3 bucket. All the files related to this lab will be stored under the "cv-sagemaker-immersionday" prefix of the bucket.
- Familiarity with Python and numpy
- Basic familiarity with AWS S3.
- Basic understanding of AWS Sagemaker.
- Basic familiarity with AWS Command Line Interface (CLI) -- ideally, you should have it set up with credentials to access the AWS account you're running this notebook from.
- SageMaker Studio is preferred for the full UI integration

## Setup

Setting up the environment, load the libraries, and define the parameter for the entire notebook.

Run the cell below to ensure latest version of SageMaker is installed in your kernel

In [None]:
!pip install -U sagemaker --quiet # Ensure latest version of SageMaker is installed

In [None]:
%%time
import sagemaker
from sagemaker import get_execution_role
import boto3

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
account = sagemaker_session.account_id()
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket() # or use your own custom bucket name
base_job_prefix = "cv-sagemaker-immersionday" # or define your own prefix

## The Raw Dataset
The dataset we are using is the [Caltech Birds (CUB 200 2011)](https://www.vision.caltech.edu/datasets/cub_200_2011/) dataset.

It contains 11,788 images across 200 bird species (the original technical report can be found [here](https://authors.library.caltech.edu/27452/)). 

Each species comes with around 60 images, with a typical size of about 350 pixels by 500 pixels. 

Bounding boxes are provided, as are annotations of bird parts. 

A recommended train/test split is given, but image size data is not.

![Bird Dataset](statics/birds.png)

Run the cell below to download the full dataset from a public S3 location and unzip the folder structure. Note that the file size is around 1.2 GB, and can take a while to download. If you plan to complete the entire workshop, please keep the file to avoid re-download and re-process the data.

In [None]:
!wget 'https://s3.amazonaws.com/fast-ai-imageclas/CUB_200_2011.tgz'
!tar xopf CUB_200_2011.tgz
!rm CUB_200_2011.tgz

Run the cell below to upload the unzipped dataset to your SageMaker default bucket.

In [None]:
s3_raw_data = f's3://{default_bucket}/{base_job_prefix}/full/data'
!aws s3 cp --recursive ./CUB_200_2011 $s3_raw_data --quiet

### The Data Labels 

The dataset comes with bird class labels. They are encoded in two files:

    - `classes.txt` which gives the human-readable format of each class
    - `image_class_labels.txt` which describes the class of each image

In [None]:
!head CUB_200_2011/classes.txt

In [None]:
!head CUB_200_2011/image_class_labels.txt

If we would have not had the classes of the images, we could have used SageMaker Ground Truth to find the resources for labelling the data.

Ground Truth is fully managed data labeling service in which you can launch a labeling job with just a few clicks in the console or use a single AWS SDK API call. 

It provides 30+ labeling workflows for computer vision and NLP use cases, and also allows you to tap into different workforce options.

![SMGT](https://docs.aws.amazon.com/sagemaker/latest/dg/images/image-classification-example.png)

## Defining a SageMaker Processing Job

As mentioned before, we are going to practice using scikit-learn processing jobs. 

Because we are using a built-in SageMaker Scikit-learn container, the only thing you need to provide in addition is a Python script.

Please inspect the [preprocessing.py](preprocessing.py) script that has been provided for you.

The script:
- takes in the raw images files and splits them into training, validation and test sets by class
- merges the class annotation files so that you have a manifest file for each separate data set
- exposes two parameters: classes (allows you to filter the number of classes you want to train the model on; default is all classes) and input-data (the human readable name of the classes)

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor

from sagemaker.processing import (
    ProcessingInput,
    ProcessingOutput,
)
# SKlearnProcessor for preprocessing
output_prefix = f'{base_job_prefix}/outputs'
output_s3_uri = f's3://{default_bucket}/{output_prefix}'

class_selection = '13, 17, 35, 36, 47, 68, 73, 87'
input_annotation = 'classes.txt'
processing_instance_type = "ml.m5.xlarge"
processing_instance_count = 1

sklearn_processor = SKLearnProcessor(base_job_name = f"{base_job_prefix}-preprocess",  # choose any name
                                    framework_version='0.20.0',
                                    role=role,
                                    instance_type=processing_instance_type,
                                    instance_count=processing_instance_count)

In [None]:
sklearn_processor.run(
    code='preprocessing.py',
    arguments=["--classes", class_selection, 
               "--input-data", input_annotation],
    inputs=[ProcessingInput(source=s3_raw_data, 
            destination="/opt/ml/processing/input")],
    outputs=[
            ProcessingOutput(source="/opt/ml/processing/output/train", destination = output_s3_uri +'/train'),
            ProcessingOutput(source="/opt/ml/processing/output/valid", destination = output_s3_uri +'/valid'),
            ProcessingOutput(source="/opt/ml/processing/output/test", destination = output_s3_uri +'/test'),
            ProcessingOutput(source="/opt/ml/processing/output/manifest", destination = output_s3_uri +'/manifest'),
        ],
    )

# Review Outputs

At the end of the lab, you dataset will be randomly split into train, valid, and test folders. You will also have a csv manifest file for each channel. 

Validate your results with the script below. 

**If you plan to complete other modules in this workshop, please keep these data.  Otherwise, you can clean up after this lab.**

In [None]:
s3_client = boto3.client("s3")
response = s3_client.list_objects_v2(Bucket=default_bucket, Prefix=output_prefix)
files = response.get("Contents")

for file in files:
    
    print(f"file_name: {file['Key']}, size: {file['Size']}")