# Tune a Scikit-Learn model in SageMaker and track with MLFlow

## Intro

In this second notebook, we are going to offload the training to the remote infrastructure managed by SageMaker. We want now to leverage SageMaker's hyperparameter tuning to kick off multiple training jobs with different hyperparameter combinations, to find the set with best model performance. This is an important step in the machine learning process as hyperparameter settings can have a large impact on model accuracy. In this example, we'll use the SageMaker Python SDK to create a hyperparameter tuning job for an SKlearn estimator.

## Setup environment

In [None]:
!pip install -q --upgrade pip
!pip install -q --upgrade sagemaker==2.63.1
!pip install -q --upgrade mlflow==1.18.0

In [None]:
import os
import pandas as pd

import sagemaker
from sagemaker.tuner import IntegerParameter, HyperparameterTuner
from sagemaker.sklearn.estimator import SKLearn

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

import mlflow
from mlflow.tracking.client import MlflowClient

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = sess.boto_region_name
account = role.split("::")[1].split(":")[0]
tracking_uri = os.environ['MLFLOWSERVER']
experiment_name = 'california-housing'

print('SageMaker role name: {}'.format(role.split("/")[-1]))
print('Account: {}'.format(account))
print('bucket: {}'.format(bucket))
print("Using AWS Region: {}".format(region))
print("MLflow server: {}".format(tracking_uri))

## Prepare data
We load a dataset from sklearn, split it and send it to S3

In [None]:
# we use the California housing dataset 
data = fetch_california_housing()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)

trainX = pd.DataFrame(X_train, columns=data.feature_names)
trainX['target'] = y_train

testX = pd.DataFrame(X_test, columns=data.feature_names)
testX['target'] = y_test

trainX.to_csv('california_train.csv', index=False)
testX.to_csv('california_test.csv', index=False)

In [None]:
# send data to S3. SageMaker will take training data from s3
train_path = sess.upload_data(path='california_train.csv', bucket=bucket, key_prefix='sagemaker/sklearncontainer')
test_path = sess.upload_data(path='california_test.csv', bucket=bucket, key_prefix='sagemaker/sklearncontainer')

## Training



We are again using `SKlearn` in script mode, with the same training script we have used in the previous notebook, i.e. `./source_dir/train.py`. The only different is that we are specifying an `instance_type` $\neq$ `local`. Indeed, we are specifying now the instance type we want the training job to run to, in our specific case this is a `ml.m5.xlarge` instance type.

In [None]:
hyperparameters = {
    'tracking_uri': tracking_uri,
    'experiment_name': experiment_name,
    'features': 'MedInc HouseAge AveRooms AveBedrms Population AveOccup',
    'target': 'target'
}

metric_definitions = [{'Name': 'median-AE', 'Regex': "AE-at-50th-percentile: ([0-9.]+).*$"}]

estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    hyperparameters=hyperparameters,
    metric_definitions=metric_definitions,
    framework_version='0.23-1',
    py_version='py3'
)

## Hyperparameter tuning

Once we've defined our estimator we can specify the hyperparameters we'd like to tune and their possible values.  We have three different types of hyperparameters.
- Categorical parameters need to take one value from a discrete set.  We define this by passing the list of possible values to `CategoricalParameter(list)`
- Continuous parameters can take any real number value between the minimum and maximum value, defined by `ContinuousParameter(min, max)`
- Integer parameters can take any integer value between the minimum and maximum value, defined by `IntegerParameter(min, max)`

*Note, if possible, it's almost always best to specify a value as the least restrictive type.  For example, tuning `thresh` as a continuous value between 0.01 and 0.2 is likely to yield a better result than tuning as a categorical parameter with possible values of 0.01, 0.1, 0.15, or 0.2.*

In [None]:
hyperparameter_ranges = {
    'n-estimators': IntegerParameter(50, 200),
    'min-samples-leaf': IntegerParameter(1, 10)
}

Next we'll specify the objective metric that we'd like to tune and its definition. This refers to the regular expression (Regex) needed to extract that metric from the CloudWatch logs of our training job we defined earlier, as well as whether we are looking to `Maximize` or `Minimize` the objective metric.

In [None]:
objective_metric_name = 'median-AE'
objective_type = 'Minimize'

Now, we'll create a `HyperparameterTuner` object, which we pass:
- The SKLearn estimator we created earlier
- Our hyperparameter ranges
- Objective metric name and type
- Number of training jobs to run in total and how many training jobs should be run simultaneously.  More parallel jobs will finish tuning sooner, but may sacrifice accuracy.  We recommend you set the parallel jobs value to less than 10% of the total number of training jobs (we'll set it higher just for this example to keep it short).

In [None]:
max_jobs = 10
max_parallel_jobs = 3

tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=max_jobs,
                            max_parallel_jobs=max_parallel_jobs,
                            objective_type=objective_type,
                            base_tuning_job_name='mlflow')

And finally, we can start our tuning job by calling `.fit()` and passing in the S3 paths to our train and test datasets.

In [None]:
tuner.fit({'train':train_path, 'test': test_path})

We can now query the MLFlow server to see the different models and their metrics that have been stored.

In [None]:
mlflow.set_tracking_uri(tracking_uri)
mlflow.set_experiment(experiment_name)
client = MlflowClient()

experiment = mlflow.get_experiment_by_name(experiment_name)
experiment_id = experiment.experiment_id

# Get the runs sorted by the metric
runs = client.search_runs(
  experiment_ids=experiment_id,
  filter_string="",
  max_results=max_jobs,
  order_by=["metrics.`AE-at-50th-percentile` ASC"]
)

print('##### Runs sorted by metric')
for run in runs:
    metric = run.data.metrics['AE-at-50th-percentile']
    min_samples_leaf = run.data.params['min-samples-leaf']
    n_estimators = run.data.params['n-estimators']
    run_id = run.info.run_id
    print("###############")
    print("run ID: {}".format(run_id))
    print("AE-at-50th-percentile :{}".format(metric))
    print("min-samples-leaf (hyperparameter 1): {}".format(min_samples_leaf))
    print("n_estimators (hyperparameter 2): {}".format(n_estimators))