# Lab: SKLearn Migration Challenge

> *Run this notebook in the same environment you used for the 'Local Notebook': `Python 3 (Data Science 3.0)` on SageMaker Studio, or `conda_python3` on classic SageMaker Notebook Instances*

## Introduction
Your new colleague in the data science team (who isn't very familiar with SageMaker) has written a nice notebook to tackle a classification problem with scikitlearn: [Local Notebook.ipynb](Local%20Notebook.ipynb)

It works OK with the simple Iris data set they were working on before, but now they'd like to take advantage of some of the features of SageMaker to tackle bigger and harder challenges.

Can you help refactor the Local Notebook code, to show them how to use SageMaker effectively?

## Getting Started

First, check you can run the [Local Notebook.ipynb](Local%20Notebook.ipynb) notebook through - reviewing what steps it takes.

This notebook sets out a structure you can use to migrate code into, and lists out some of the changes you'll need to make at a high level. You can either work directly in here, or duplicate this notebook so you still have an unchanged copy of the original.

Try to work through the sections first with an MVP goal in mind (fitting the model to data in S3 via a SageMaker Training Job, and deploying/using the model through a SageMaker Endpoint). The goal is to understand the big picture on how you can bring your own code to SageMaker and scale your training and deploy. You can always build more advanced models or more complex training code.

## SKLearn "script mode" training and serving

SageMaker provides [pre-built container images](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html) for a range of ML frameworks, including Scikit-Learn, which allow you to bring custom models without worrying about building and maintaining your own container images or serving stacks: You can even install extra libraries by [providing a requirements.txt file](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-third-party-libraries) if you want.

This pattern is sometimes called "framework mode" or ["script mode"](https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/) - separate from building fully-custom containers or using the pre-built algorithms.

The [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) handles transferring your script to a SageMaker training instance. On the training instance, SageMaker's native SKlearn support sets up training-related environment variables and executes your training script. Script mode supports training with a Python script, a Python module, or a shell script.

## Dependencies
Listing all our imports at the start helps to keep the requirements to run any script/file transparent up-front, and is specified by nearly every style guide including Python's official [PEP 8](https://peps.python.org/pep-0008/#imports)

In [None]:
%load_ext autoreload
%autoreload 2

# Python Built-Ins:
import os

# External Dependencies:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# TODO: What else will you need?
# Have a look at the documentation: https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html
# to see which libraries need to be imported to use sagemaker and the Sklearn estimator estimator


## Prepare the Data

Initial data preparation will be similar to what we did in the [Local Notebook.ipynb](Local%20Notebook.ipynb).

In [None]:
# TODO: download the data from internet


In [None]:
# TODO: Read in the data with the headers


In [None]:
# TODO: Split the data into train and test


## Set up the environment: Execution Role, Session and S3 Bucket
Now that we have downloaded and reduced the data in the local directory, we will need to upload it to Amazon S3 to make it available for Amazon Sagemaker training.

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.

- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the `get_execution_role()` method from sagemaker python SDK.

In [None]:
# TODO: This is where you can setup execution role, session and S3 bucket.

# Define sagemaker session

# Fetch samemaker execution role

# Fetch the default bucket


## Upload Data to Amazon S3

Next is the part where you need to upload the images to Amazon S3 for SageMaker training.

Modern versions of Pandas should support saving to S3 directly with `dataframe.to_csv("s3://{bucket_name}/{file_path}")`

Alternatively, you can refer to the previous exercises for examples copying files between S3 and local storage using the aws s3 sync CLI command or using the boto3 SDK.

The high-level [`aws s3 sync` command](https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html) synchronizes the contents of a local folder to or from an S3 bucket/folder. You can use options like `--delete` to remove objects from the target that are not present in the source, and `--include` or `--exclude` to filter what files get copied.

In [None]:
# TODO: Import the AWS Python SDK, boto3

# TODO: Upload your `test` and `train` data splits to your SageMaker default S3 bucket


## Data Input ("Channels") Configuration
The draft code has 2 data sets: One for training, and one for test/validation. 

In SageMaker terminology, each input data set is a "channel" and we can name them however we like... Just make sure you're consistent about what you call each one!

For a simple input configuration, a channel spec might just be the S3 URI of the folder. For configuring more advanced options, there's the s3_input class in the SageMaker SDK.

In [None]:
# TODO: Define your 2 data channels (train and test)
# The data can be found in: "s3://{bucket_name}/mnist/training" and "s3://{bucket_name}/mnist/testing"
# We can use either the s3_input (which gives us additional configuration options), or a plain string:


## Algorithm ("Estimator") Configuration and Run
Instead of loading and fitting this data here in the notebook, we'll be creating a Sklearn Estimator through the SageMaker SDK, to run the code on a separate container that can be scaled as required.

The ["Using SKlearn with the SageMaker Python SDK"](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#using-scikit-learn-with-the-sagemaker-python-sdk) docs give a good overview of this process. You should run your estimator in script mode (which is easier to follow than the old default legacy mode) and as Python 3.

▶️ Use the **[src/main.py file](src/main.py) already prepared for you** in your local directory as your entry point to port code into. This includes a basic template, but with more TODOs you'll need to fill in.

In [None]:
# TODO: Define your estimator using SKlearn framework


> ⚠️ **Before running the actual training job** on SageMaker, we suggest running your script locally using the example command below.
>
> This can help you find and fix errors faster, because you won't need to wait for the job to start up each time.

In [None]:
!python3 ./main.py --train ./data --test ./data --model-dir ./data/model --n_estimators=100 --min_samples_leaf=3

## Run the SageMaker Training Job

When you're ready to try your script in a SageMaker training job, you can call `estimator.fit(...)` as we did in previous exercises: Specifying your input data location(s).

When training is complete, the training job will automatically upload the saved model to S3 for deployment.

In [None]:
# TODO: Call the fit function, passing in the data you uploaded to S3 earlier


## Deploy and Use Your Model (Real-Time Inference)
We are now ready to deploy our model to Sagemaker hosting services and make real time predictions

In [None]:
# TODO: Deploy your trained model to a real time endpoint


Let's now send some data to our model to predict.

Note you'll need to send the correct input fields the model expects (X_test only, excluding label column), and will need to send it in a [format supported](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html#get-predictions) by the deployed endpoint.

In [None]:
# TODO: Load some test data to test your model with, in the same format as it was trained on


In [None]:
# TODO: Invoke your endpoint and return the predictions
