# Sagemaker Scikit-learn 


First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [1]:
import sagemaker
from sagemaker import get_execution_role
from datetime import datetime

sagemaker_session = sagemaker.Session()

smclient = sagemaker_session.sagemaker_client

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Upload the data for training <a class="anchor" id="upload_data"></a>

For the purposes of this example, we're using a sample of the classic Iris dataset, which is included with Scikit-learn. We will load the dataset, write locally, then write the dataset to s3 to use.

In [2]:
import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs('./data', exist_ok=True)
np.savetxt('./data/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [3]:
bucket = 'Scikit-iris'
folder = 'data'

train_input = sagemaker_session.upload_data(folder, key_prefix=f"{bucket}/{folder}")

## Create a Scikit-learn script to train <a class="anchor" id="create_sklearn_script"></a>
SageMaker can now run a scikit-learn script using the `SKLearn` estimator. When executed on SageMaker a number of helpful environment variables are available to access properties of the training environment, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.
* `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the `SKLearn` estimator's `fit()` method, the following environment variables will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel
* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. 

Because the Scikit-learn container imports your training script, you should always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.

For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers.

For example, the script that we will run in this notebook is the below:

In [None]:
%%writefile preprocess_train.py

import argparse
import pandas as pd
import os

from sklearn import tree
from sklearn.externals import joblib
  

def load_preprocess():
    """
        Take the set of files and read them all into a single pandas dataframe
    """
    input_files = [ os.path.join(args.train, file) for file in os.listdir(args.train) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))
    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    train_data = pd.concat(raw_data)

    # labels are in the first column
    train_y = train_data.ix[:,0]
    train_X = train_data.ix[:,1:]
    
    return train_X, train_y


def train(train_X, train_y):
    """
        Now use scikit-learn's decision tree classifier to train the model.
    """
    
    classifier = tree.DecisionTreeClassifier(max_leaf_nodes = args.max_leaf_nodes,
                                             min_samples_leaf = args.min_samples_leaf)
    classifier = classifier.fit(train_X, train_y)
    
    return classifier
    
    
def model_fn(model_dir):
    """
        Load you model fron the S3 bucket and returns it. This function must be
        must provide in your script.
    """
    classifier = joblib.load(os.path.join(model_dir, "model.joblib"))
    return classifier


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument('--max_leaf_nodes', type=int, default=30)
    parser.add_argument('--min_samples_leaf', type=int, default=1)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    # Preprocess the data
    train_X, train_y = load_preprocess()
    
    # Train our classifier
    classifier = train(train_X, train_y)
    
    # Save the model
    joblib.dump(classifier, os.path.join(args.model_dir, "model.joblib"))


In the script above we have dfine the function `model_fn`, which load our model. After our model had been loaded we have to serve it. Model serving is the process in which the model response to inference request. This process is divided into 3 steps(functions):

* `input_fn`
* `predict_fn`
* `output_fn`

Each step involves invoking a python function, with information about the request and the return-value from the previous function in the chain. Inside the SageMaker Scikit-learn model server, the process looks like:

```python
# Deserialize the Invoke request body into an object we can perform prediction on
input_object = input_fn(request_body, request_content_type)

# Perform prediction on the deserialized object, with the loaded model
prediction = predict_fn(input_object, model)

# Serialize the prediction result into the desired response content type
output = output_fn(prediction, response_content_type)
```

SageMaker Scikit-learn model server provide default implementations for the 3 functions. But we can override them if we need it.

Default funtions, see https://github.com/aws/sagemaker-scikit-learn-container/blob/master/src/sagemaker_sklearn_container/serving.py

More info at https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#process-input


## Create SageMaker Scikit Estimator  <a class="anchor" id="create_sklearn_estimator"></a>

To run our Scikit-learn training script on SageMaker we need a `sagemaker.sklearn.estimator.sklearn` estimator. We can construct a new estimator (model) or use a previous one we already had created. Our sklearn estimator accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters. 

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

To create a new estimator (model) must especify the entry point (our python script), the instance type, the role and the session.

In [None]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'preprocess_train.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'max_leaf_nodes': 30,
                     'min_samples_leaf': 1})
type(sklearn)

If we already have an existing pretrained model we can load our model (passing hyperparameter "model-dir") and then attach an existing training job.

In [4]:
from sagemaker.sklearn.estimator import SKLearn

script_path = 'preprocess_train.py'

sklearn = SKLearn(
    entry_point=script_path,
    train_instance_type="ml.c4.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'max_leaf_nodes': 30,
                     'min_samples_leaf': 1,
                     "model-dir": "s3://sagemaker-eu-west-1-788236952534/sagemaker-scikit-learn-2020-01-27-15-15-44-131/output/model.tar.gz"})

sklearn = sklearn.attach(training_job_name = "sagemaker-scikit-learn-2020-02-12-14-00-09-401")
type(sklearn)

2020-02-12 14:02:42 Starting - Preparing the instances for training
2020-02-12 14:02:42 Downloading - Downloading input data
2020-02-12 14:02:42 Training - Training image download completed. Training in progress.
2020-02-12 14:02:42 Uploading - Uploading generated training model
2020-02-12 14:02:42 Completed - Training job completed[34m2020-02-12 14:02:31,255 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-02-12 14:02:31,257 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-02-12 14:02:31,268 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-02-12 14:02:31,555 sagemaker-containers INFO     Module preprocess_train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-02-12 14:02:31,555 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-02-12 14:02:31,555 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34m2020-0

sagemaker.sklearn.estimator.SKLearn

In [None]:
# print(sklearn.image_name)
# print(sklearn.hyperparameters())
# print(sklearn.latest_job_debugger_artifacts_path)
# print(sklearn.output_path)
# print(sklearn.model_data)
# print(sklearn.train_image())
# print(sklearn.base_job_name)

As example, once we have our model stored in our S3 bucket we can download and rebuild it at any time.

In [None]:
# import tarfile
# # use joblib from sklearn.externals, as in training script
# from sklearn.externals import joblib

# key_prefix = "sagemaker-scikit-learn-2020-01-27-15-15-44-131/output/model.tar.gz"
# bucket = "sagemaker-eu-west-1-788236952534"
# sagemaker_session.download_data(bucket = bucket, key_prefix = key_prefix, path="model")

# # decompress
# tar = tarfile.open("model/model.tar.gz", "r:gz")
# tar.extractall("model/")
# tar.close()

# # load model with joblib
# model = joblib.load("model/model.joblib")
# type(model)

## Train SKLearn Estimator on Iris data <a class="anchor" id="train_sklearn"></a>
Training is very simple, just call `fit` on the Estimator! This will start a SageMaker Training job that will download the data for us, invoke our scikit-learn code (in the provided script file), and save any model artifacts that the script creates.

In [None]:
sklearn.fit({'train': train_input})

## Hyperparameter Tunning <a class="anchor" id="train_sklearn"></a>

Now to configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:

* The ranges of hyperparameters you want to tune
* The limits of the resource the tuning job can consume
* The objective metric for the tuning job

See, https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html

In [5]:
now = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
job_name = f"scikitlearn-{now}"

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [],
      "IntegerParameterRanges": [{
                "MaxValue": "30",
                "MinValue": "10",
                "Name": "max_leaf_nodes",
                "ScalingType": "Auto"
            },{
                "MaxValue": "5",
                "MinValue": "1",
                "Name": "min_samples_leaf",
                "ScalingType": "Auto"
            }
        ]
    },
    "ResourceLimits": {
        "MaxNumberOfTrainingJobs": 1,
        "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
        "MetricName": "auc",
        "Type": "Maximize"
    }
  }

To configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call. In this JSON object, you specify:

* Metrics that the training jobs emit
* The container image for the algorithm to train
* The input configuration for your training and test data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job
* The type of instance to use for the training jobs
* The stopping condition for the training jobs

In [6]:
train_input_path = dict(sklearn.latest_training_job.describe()["InputDataConfig"][0])
train_input_path = train_input_path["DataSource"]["S3DataSource"]["S3Uri"]

In [7]:
training_job_definition = {
    "AlgorithmSpecification": {
        "MetricDefinitions": [{
            "Name": "auc",
            "Regex": "Auc = (.*?);"
        }],
        "TrainingImage": f"{sklearn.train_image()}",
        "TrainingInputMode": "File"
    },
    "InputDataConfig": [{
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
            "S3DataSource": {
                "S3DataDistributionType": "FullyReplicated",
                "S3DataType": "S3Prefix",
                "S3Uri": train_input_path
            }
        }
      },
#       {
#         "ChannelName": "validation",
#         "CompressionType": "None",
#         "ContentType": "csv",
#         "DataSource": {
#           "S3DataSource": {
#             "S3DataDistributionType": "FullyReplicated",
#             "S3DataType": "S3Prefix",
#             "S3Uri": s3_input_validation
#           }
#         }
#       }
    ],
    "OutputDataConfig": { #Model artifacts
        "S3OutputPath": f"{sklearn.output_path}tunning"
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 5
    },
    "RoleArn": role,
    "StoppingCondition":{
        "MaxRuntimeInSeconds": 1800
    }
}

In [9]:
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = job_name,
                                           HyperParameterTuningJobConfig = tuning_job_config,
                                           TrainingJobDefinition = training_job_definition)

{'HyperParameterTuningJobArn': 'arn:aws:sagemaker:eu-west-1:788236952534:hyper-parameter-tuning-job/scikitlearn-2020-02-13-08-25-45',
 'ResponseMetadata': {'RequestId': '4ab8bdb7-1716-4fd2-88a6-f55f61864f14',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '4ab8bdb7-1716-4fd2-88a6-f55f61864f14',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Thu, 13 Feb 2020 08:26:09 GMT'},
  'RetryAttempts': 0}}

In [None]:
# smclient.stop_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)

In [10]:
smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = job_name)['HyperParameterTuningJobStatus']

'InProgress'

In [11]:
smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = job_name)

{'HyperParameterTuningJobName': 'scikitlearn-2020-02-13-08-25-45',
 'HyperParameterTuningJobArn': 'arn:aws:sagemaker:eu-west-1:788236952534:hyper-parameter-tuning-job/scikitlearn-2020-02-13-08-25-45',
 'HyperParameterTuningJobConfig': {'Strategy': 'Bayesian',
  'HyperParameterTuningJobObjective': {'Type': 'Maximize',
   'MetricName': 'auc'},
  'ResourceLimits': {'MaxNumberOfTrainingJobs': 1,
   'MaxParallelTrainingJobs': 3},
  'ParameterRanges': {'IntegerParameterRanges': [{'Name': 'max_leaf_nodes',
     'MinValue': '10',
     'MaxValue': '30',
     'ScalingType': 'Auto'},
    {'Name': 'min_samples_leaf',
     'MinValue': '1',
     'MaxValue': '5',
     'ScalingType': 'Auto'}],
   'ContinuousParameterRanges': [],
   'CategoricalParameterRanges': []}},
 'TrainingJobDefinition': {'StaticHyperParameters': {'_tuning_objective_metric': 'auc'},
  'AlgorithmSpecification': {'TrainingImage': '141502667606.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.20.0-cpu-py3',
   'TrainingInput

## Using the trained model to make inference requests <a class="anchor" id="inference"></a>

### Deploy the model. Create our endpoint <a class="anchor" id="deploy"></a>

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count and instance type.

In [None]:
predictor = sklearn.deploy(initial_instance_count=1, 
                           instance_type="ml.t2.medium", 
                           endpoint_name="scikitlearn-endpoint")

print(f"\n{type(predictor)}")

If we already have an existing endpoint deployed we can create our predictor `sagemaker.sklearn.model.SKLearnPredictor` based on it simply filling the endpoint_name.

In [None]:
# predictor_loaded = sagemaker.sklearn.model.SKLearnPredictor(endpoint_name="scikitlearn-endpoint", 
#                                                             sagemaker_session=sagemaker_session)

# print(type(predictor_loaded))

Let´s check the state of our recently endpoint deployed.

In [None]:
endpoint_name = sagemaker.session.boto3.client("sagemaker").describe_endpoint(EndpointName=predictor.endpoint)['EndpointName']
status = sagemaker.session.boto3.client("sagemaker").describe_endpoint(EndpointName=predictor.endpoint)['EndpointStatus']

print(f"End point: {endpoint_name} \n"
      f"Status: {status}")

### Choose some data and use it for a prediction <a class="anchor" id="prediction_request"></a>

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [None]:
import itertools
import pandas as pd

shape = pd.read_csv("data/iris.csv", header=None, engine="python")

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data = shape.iloc[indices[:-1]]
test_X = test_data.iloc[:,1:]
test_y = test_data.iloc[:,0]
type(test_X)

In [None]:
#test_X.values

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The output from the endpoint return an numerical representation of the classification prediction; in the original dataset, these are flower names, but in this example the labels are numerical. We can compare against the original label that we parsed.

In [None]:
print(predictor.predict(test_X.values))
print(test_y.values)

### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up.

__¡¡¡ Beware of living the endpoint deployed if you are not gonna use it, it charge per hour deployed !!!__

In [None]:
#sklearn.delete_endpoint()