# Introduction

see WORD

# Prerequisites 

Another advantage of using managed cloud services is that you will need only little configurations to get up and running. Compared to managing compute resources manually, this allows you to focus more on the ml task itself. **(There is a typical word for this cloud benefit --> Look it up)**

To implement such a solution yourself, you will need access to an AWS account, create a user with a policy that grants permissions to all services that will be used in this example. The code can be run on any environment, given that authentication is provided, however the easiest way would be to use an AWS Sagemaker notebook instance. You can find more information about setting that up [here](https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html). 

# Data

In this example we will be using the Plamer Penguin Dataset, which provides a suitable alternative to the frequently used Iris dataset. It contains information about various penguins. You can read more about it [here](https://allisonhorst.github.io/palmerpenguins/articles/intro.html). The objective we will be solving with our machine learning algorithm is to predict the gender of a penguin by using all other columns as features. 

In [None]:
from palmerpenguins import load_penguins

In [None]:
penguins = load_penguins()
penguins.head(3)

Minimal data procesing is required: We drop entries with null values as well as duplicates. Then, we perform a train-test-split and save the data on our working directory as well as on s3 storage.

In [None]:
import numpy as np
import os

In [None]:
s3_data_storage_path = "s3://sandbox-carsten-123/data/"
s3_output_storage_path = "s3://sandbox-carsten-123/model/"

In [None]:
penguins.dropna(inplace=True)
penguins.drop_duplicates(inplace=True)

features = [
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "species",
    "island",
]

target = "sex"

test_amount = 0.3
train = [np.random.uniform() >= test_amount for _ in range(len(penguins))]
test = [not train_flag for train_flag in train]

X_train = penguins[train][features]
y_train = penguins[train][target]
X_test = penguins[test][features]
y_test = penguins[test][target]

try:
    os.mkdir("data/")
    print("Created data/ directory.")
except:
    print("Data directory already exists.")

for raw_data_bucket in ["data/", s3_data_storage_path]:

    X_train.to_csv(os.path.join(raw_data_bucket, "X_train.csv"), index=False)
    y_train.to_csv(os.path.join(raw_data_bucket, "y_train.csv"), index=False)
    X_test.to_csv(os.path.join(raw_data_bucket, "X_test.csv"), index=False)
    y_test.to_csv(os.path.join(raw_data_bucket, "y_test.csv"), index=False)
    print(f"Stored data in '{raw_data_bucket}' .")

# Model Training


To execute model training and deployment of the trained model, we need to write a script comprising of the training routine. 
The crucial part for the training lies in the *__main__* clause. 
It reads the data, instanciates a pipeline and trains the the model. Here, a minimal preprocessing of one-hot-encoding and standard scaling is chosen. LogisticRegression acts as a baseline model. The model is then serialized and saved given the model directory. 

The script takes four arguments. First, we need to define input path for the training data. It assumes the existence of two files: X_train.csv and y_train.csv. The differentiation between categorical and numerical variables is explicitly given in these example. Finally, the output path for the serialized model is defined. 

In [None]:
%%writefile train_and_deploy.py

import os
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import joblib
import argparse
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline


def model_fn(model_dir):
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model

def float_if_number(entry):
    try:
        return float(entry)
    except:
        return entry

def input_fn(request_body, content_type):
    if content_type == 'text/csv':
        samples = []
        for r in request_body.split('|'):
            print(r)
            samples.append(list(map(float_if_number,r.split(','))))
        return np.array(samples)
    else:
        raise ValueError("Thie model only supports text/csv input")

def predict_fn(input_data, model):
    return model.predict(pd.DataFrame(input_data, columns=model.steps[0][1]._feature_names_in))

def output_fn(prediction, content_type):
    return str(prediction)


if __name__ == "__main__":
    
    parser = argparse.ArgumentParser()

    parser.add_argument('--train', type=str, default="/opt/ml/input/data/train")
    parser.add_argument('--num_features', type=str) 
    parser.add_argument('--cat_features', type=str)
    parser.add_argument('--model-dir', type=str, default="/opt/ml/model")
    args, _ = parser.parse_known_args()
    
    train_path = args.train
    num_features = args.num_features.split()
    cat_features = args.cat_features.split()
    model_dir = args.model_dir

    X_train = pd.read_csv(os.path.join(train_path, "X_train.csv"))
    y_train = pd.read_csv(os.path.join(train_path, "y_train.csv"))
    
    preprocessor = make_column_transformer(
        (StandardScaler(), num_features),
        (OneHotEncoder(sparse=False), cat_features),
    )
    
    model = LogisticRegression(class_weight="balanced", solver="lbfgs")
    
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', model)])
    
    pipeline.fit(X_train, np.ravel(y_train))
    
    model_output_directory = os.path.join(model_dir, "model.joblib")
    print("Model saving path {}".format(model_output_directory))
    joblib.dump(pipeline, model_output_directory)

The script also contains several serving functions that Sagemaker requires for model serving via the Sagemaker model endpoint service. These functions comprise of model_fn() ensuring that the model gets loaded from file, input_fn() handling the input in a way that it can be used for calling the predict() function on the model, the predict_fn() which calls predict on the model and the output_fn(), which will convert the model output to a format that can be send back to the caller. 

Unless we are ready to train and deploy on a dedicated container, the script is callable our local machine, resp. on the instance on which the notebook is running.

In [None]:
!python3 train_and_deploy.py --train ./  \
                             --num_features "bill_length_mm bill_depth_mm flipper_length_mm"  \
                             --cat_features "species island"  \
                             --model-dir ./  

The SKLearn object is the standard interface for scheduling and defining model training and deployment of scikit-learn models. We specify the resources needed, the framework version, the entry point, the role as well as the output_path which will be the model-dir argument. Further arguments like the numerical and categorical feature list can be passed via the hyperparameters dictionary. 
Then, we can call fit() to execute the training job. 
  
We pass a dictionary with a single keyword "train" that specifies the path to the processed data in S3. The training data is then copied from there to the directory of the training container. The SKLearn object will move the model artifacts to the desired output path in S3, defined via the keyword "output_path" in its definition. 

In [None]:
from sagemaker import get_execution_role
from sagemaker.sklearn.estimator import SKLearn

In [None]:
sagemaker_role = get_execution_role()

In [None]:
sklearn = SKLearn(
    entry_point="train_and_deploy.py",
    framework_version="0.23-1", 
    instance_type="ml.m5.xlarge", 
    role=sagemaker_role,
    hyperparameters={
        "num_features": "bill_length_mm bill_depth_mm flipper_length_mm",
        "cat_features": "species island"
    },
    output_path=s3_output_storage_path
)

In [None]:
sklearn.fit({"train": s3_data_storage_path})

# Model deployment

After evaluating our model, we can now go on and deploy it. To do so, we only must call deploy() on the SKlearn object that we used for model training. A model endpoint is now booted in the background. 

In [None]:
predictor = sklearn.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge")

We can now run our first test against our model endpoint directly from our jupyter notebook. To do so, we can simply take some of the training features, add them to a request and then call our model by using the Sagemaker client with the invoc_endpoint() method. 

In [None]:
import boto3

In [None]:
to_be_predicted = X_test.head(10).values.tolist()

request_body = ""
for sample in to_be_predicted:
    request_body += ",".join([str(n) for n in sample]) + "|"
request_body = request_body[:-1] 
print("*"*20)
print(f"Calling Sagemaker Endopint with the following request_body: {request_body}")

client = boto3.client('sagemaker-runtime')

endpoint_name = predictor.endpoint_name
content_type = 'text/csv'

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=request_body,
    ContentType=content_type
    )
response_from_endpoint = response['Body'].read().decode("utf-8")
print("*"*20)
print(f"Response from Endpoint: {response_from_endpoint}")

At the end of our journey, the endpoint should be shut down.

In [None]:
predictor.delete_endpoint()

# Outlook

see WORD