# Train a Scikit-Learn model in SageMaker and track with MLFlow

## Intro

The main objective of these 3 notebooks is to show how you can integrate Amazon SageMaker and MLFlow in a secured environment.

## Pre-Requisites

In order to run successfully these notebooks, you must have prepared the infrastructure using CDK, which setups up for you the MLFlow server in an isolated VPC, a VPC for the SageMaker environment, and a SageMaker Notebook instance in the VPC SageMaker environment, where you should execute these notebooks. When running this example in the SageMaker Notebook instance provisioned via CDK, the following environmental variable is also automatically set for you via notebook lifecycle policies:

* `MLFLOWSERVER` - the URI of the MLFlow server we will use for tracking purposes. In our case, this corresponds to the `HTTP API Gateway` endpoint that exposes our MLFlow server reacheable via a `PrivateLink`

## The Problem

In this example, we will solve a regression problem which aims to answer the question: "what is the expected price of a house in the California area?". The target variable is the house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

## Install required and/or update libraries

At the time of writing, the `sagemaker` SDK version tested is `2.63.1`, while the MLFlow SDK library used is the one corresponding to our MLFlow server version, i.e., `1.18.0`

In [None]:
!pip install -q --upgrade pip
!pip install -q --upgrade sagemaker==2.63.1
!pip install -q --upgrade mlflow==1.18.0

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the notebook instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/using-identity-based-policies.html) for more details on creating these.  Note, if a role not associated with the current notebook instance, or more than one role is required for training and/or hosting, please replace `sagemaker.get_execution_role()` with a the appropriate full IAM role arn string(s).
- The tracking URI where the MLFlow server runs
- The experiment name as the logical entity to keep our tests grouped and organized.

In [None]:
import os
import pandas as pd
import random

## SageMaker and SKlearn libraries
import sagemaker
from sagemaker.sklearn.estimator import SKLearn
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

## MLFlow libraries
import mlflow
from mlflow.tracking.client import MlflowClient

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name
account = role.split("::")[1].split(":")[0]
tracking_uri = os.environ['MLFLOWSERVER']
experiment_name = 'california-housing'
model_name = 'california-housing-model'

print('SageMaker role: {}'.format(role.split("/")[-1]))
print('bucket: {}'.format(bucket))
print('Account: {}'.format(account))
print("Using AWS Region: {}".format(region))
print("MLflow server URI: {}".format(tracking_uri))

## Data Preparation
We load the dataset from sklearn, then split the data in training and testing datasets, where we allocate 75% of the data to the training dataset, and the remaining 25% to the traning dataset.

The variable `target` is what we intend to estimate, which represents the value of a house, expressed in hundreds of thousands of dollars ($100,000)

In [None]:
# we use the California housing dataset 
data = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

Finally, we save a copy of the data locally, as well as in S3. The data stored in S3 will be used SageMaker to train and test the model.

In [None]:
# save the data locally
trainX.to_csv('california_train.csv', index=False)
testX.to_csv('california_test.csv', index=False)

# save the data to S3.
train_path = sess.upload_data(path='california_train.csv', bucket=bucket, key_prefix='sagemaker/sklearncontainer')
test_path = sess.upload_data(path='california_test.csv', bucket=bucket, key_prefix='sagemaker/sklearncontainer')

## Training

For this example, we use the `SKlearn` framework in script mode with SageMaker. Let us explore in more details the different components we need to define.

### Traning script

The `./source_dir/train.py` script provides all the code we need for training a SageMaker model. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'training' channel.
* `SM_CHANNEL_TEST`: A string representing the path to the directory containing data in the 'testing' channel.

For more information about training environment variables, please visit 
[SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit).

#### Hyperparmeters

We are using the `RandomForestRegressor` algorithm from the SKlearn framework. For the purpose of this exercise, we are only using a subset of hyperparameters supported by this algorithm, i.e. `n-estimators` and `min-samples-leaf`

If you would like to know more the different hyperparmeters for this algorithm, please refer to the [`RandomForestRegressor` official documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

Furthermore, it is important to note that for the purpose of this excercise, we are essentially omitting completely the feature engineering step, which is an essential step in any machine learning problem.

#### MLFlow interaction

To interact with the MLFlow server, we use the mlflow SDK, which allows us to set the tracking URI and the experiment name. One this initial setup is completed, we can store the parameters used (`mlflow.log_params(params)`), the model that is generated (`mlflow.sklearn.log_model(model, "model")`) with its associated metrics (`mlflow.log_metric(f'AE-at-{str(q)}th-percentile', np.percentile(a=abs_err, q=q))`).

In [None]:
!pygmentize ./source_dir/train.py

### SKlearn container

For this example, we use the `SKlearn` framework in script mode with SageMaker. For more information please refere to [the official documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html)

Our training script makes use of other 3rd party libraries, i.e. `mlflow`, which are not installed by default in the `Sklearn` container SageMaker provides. However, this can be easily overcome by supplying a `requirement.txt` file in the `source_dir` folder, which then SageMaker will `pip`-install before executing the training script.

### Metric definition

SageMaker emits every log to CLoudWatch. Since we are using scripting mode, we need to specify a metric definition object to define the format of the metric we are interested in via regex, so that SageMaker knows how to extract this metric from the CloudWatch logs of the training job.

In our case our custom metric is as follow

```python
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]
```

### Local mode

During early experimentation, and for testing that your script is bug-free, it is a good practise to run SageMaker in `local` mode with a small amount of data. When you specify the `instance_type='local'`, you are instructing the SageMaker SDK to run the training on your local machine rather then in the SageMaker remote managed infrastructure. This approach makes it quicker to address problems while developing your script.

Once you are confident that your code is correct, you can then offload the training to the SageMaker managed infrastructure. We will see this in the second notebook.

In [None]:
metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

hyperparameters = {
    'tracking_uri': tracking_uri,
    'experiment_name': experiment_name,
    'n-estimators': 100,
    'min-samples-leaf': 3,
    'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup',
    'target': 'target'
}

estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    instance_count=1,
    instance_type='local',        # to run SageMaker in local mode
    framework_version='0.23-1',
    base_job_name='mlflow',
)

Now we are ready to execute the training locally, which in turn will save its execution data to the MLFlow server. After initializing an `SKlearn` estimator object, all we need to do is to call the `.fit` method specifying where the training and testing data are located.

In [None]:
estimator.fit({'train':train_path, 'test': test_path})

## Register the model to MLFlow

At the end of the training, our model has been saved to the MLflow server and we are ready to register the model, i.e. assign it to a model package and create a version. Please refer to the [official MLFlow documentation](https://www.mlflow.org/docs/latest/model-registry.html) for furthe information.

In [None]:
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)
client = MlflowClient()

# Find the experiment ID
experiment = mlflow.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id

# Get the latest run
run = client.search_runs(
  experiment_ids=experiment_id,
  filter_string="",
  max_results=1,
  order_by=["attribute.start_time DESC"]
)[0]

try:
    client.create_registered_model(model_name)
except:
    print("Registered model already exists")

model_version = client.create_model_version(
    name=model_name,
    source="{}/model".format(run.info.artifact_uri),
    run_id=run.info.run_uuid
)

print("model_version: {}".format(model_version))

## Local Predictions

We are now ready to make predictions with our model locally for testing purposes.

In [None]:
# get the model URI from the MLFlow registry
model_uri = model_version.source
print("Model URI: {}".format(model_uri))

# Load model as a Sklearn model.
loaded_model = mlflow.sklearn.load_model(model_uri)

# get a random index to test the prediction from the test data
index = random.randrange(0, len(testX))
print("Random index value: {}".format(index))

# Prepare data on a Pandas DataFrame to make a prediction.
data = testX.drop(['Latitude','Longitude','target'], axis=1).iloc[[index]]

print("#######\nData for prediction \n{}".format(data))

y_hat = loaded_model.predict(data)[0]
y = y_test[index]

print("Predicted value: {}".format(y_hat))
print("Actual value: {}".format(y))