# Demonstration: Optimizing Amazon SageMaker Hyperparameters - Companion Notebook

This Jupyter notebook is the companion notebook for the Module 3 demonstration Optimizing Amazon SageMaker Hyperparameters.

## About this dataset

This demonstration uses the [Wine Data Set](https://archive.ics.uci.edu/ml/datasets/Wine), which was obtained from the [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml). 

It contains information about wine quality.

## Attribute information

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
    
Output variable (based on sensory data):

12. quality (score between 0 and 10)

For more information about this dataset, see the [Wine Quality Data Set webpage](https://archive.ics.uci.edu/ml/datasets/wine+quality).

## Dataset attributions
This dataset was obtained from: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

## Importing the data

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df_wine = pd.read_csv(url,';')


## Preprocessing

The quality values in the dataset contain values from 3-8. These values are mapped to 0-5 as target classes.

XGBoost requires the training data to be in a single file. In the file, the target value must be the first column. 

Get the target column and move it to the first position.

In [None]:
df_wine['quality']=df_wine['quality'].map({3: 0, 4: 1, 5: 2, 6: 3, 7: 4, 8: 5})

cols = df_wine.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_wine = df_wine[cols]
df_wine.head() 

## Splitting the data

You will start by splitting the dataset into two datasets. You will use one dataset for training, and you will split the other dataset again for use with validation and testing.

You will use the *train_test_split function* from the *scikit-learn library*, which is a free machine learning library for Python. It has many algorithms and useful functions, such as the one you will use in this demonstration. 

- For more information about the function, see the [Train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- For more information about scikit-learn, see the [scikit-learn guide](https://scikit-learn.org/stable/).

Because you don't have a lot of data, you want to make sure that the split datasets contain a representative amount of each class. Thus, you will use the *stratify* switch. Finally, you will use a random number so that you can repeat the splits.

In [None]:
from sklearn.model_selection import train_test_split

train, test_and_validate = train_test_split(df_wine, 
                                            test_size=0.2, 
                                            random_state=42, 
                                            stratify=df_wine['quality'])

test, validate = train_test_split(test_and_validate, 
                                  test_size=0.5, 
                                  random_state=42, 
                                  stratify=test_and_validate['quality'])

## Uploading to Amazon S3

XGboost will load the data for training from Amazon Simple Storage Service (Amazon S3). Thus, you must write the data to a comma-separated values (CSV) file, and then upload the file to Amazon S3.

Start by setting up some variables for the S3 bucket, then create a function to upload the CSV file to Amazon S3. You can reuse this function.

First, explore the function.

Note the following line:

`dataframe.to_csv(csv_buffer, header=False, index=False)`

This line writes the pandas DataFrame (which was passed into the function) into the I/O buffer that's named *csv_buffer*. You use a buffer because you don't need to write the file locally.

To stop the column headers from being written out, use `header=False`. To stop the pandas index from being output, use `index=False`.

To write the csv_buffer to Amazon S3 as an object, use the PUT operation on the `object`, which is a property of the `bucket`.




In [None]:
import boto3
import io
import os

bucket='c45317a617679l1523854t1w00381652629-sandboxbucket-3apxi73oxsw6'
prefix='mod03-demo-hyperparam'
train_file='wine_train.csv'
test_file='wine_test.csv'
validate_file='wine_validate.csv'
whole_file='wine.csv'
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

## Creating the estimator

Now that the data in Amazon S3, you can train a model.

The first step is to get the XGBoost container URI.

In [None]:
import sagemaker
from sagemaker.image_uris import retrieve
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

The only value to point out is the *num_class*, which is set to *6* to match the number of target classes in the dataset.

In [None]:
hyperparams={
    "num_round":"40",
    "num_class":"6",
    "objective":"multi:softmax"}

Use the `estimator` function to set up the model. Here are a few parameters of interest:

- **train_instance_count** - Defines how many instances will be used for training. You will use *one* instance.
- **train_instance_type** - Defines the instance type for training. In this case, it's *ml.m4.xlarge*.

In [None]:
xgb_model=sagemaker.estimator.Estimator(container,
                                        sagemaker.get_execution_role(),
                                        instance_count=1,
                                        instance_type='ml.m5.xlarge',
                                        output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

## Creating the input channels

The estimator needs *channels* to feed data into the model. For training, you will use the *train_channel* and the *validate_channel*.

In [None]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

## Creating a hyperparameter tuning job

               
Next, you must specify the hyperparameters that you want to tune, in addition to the ranges that you must select for each parameter.

The hyperparameters that have the largest effect on XGBoost objective metrics are: 

- alpha
- min_child_weight
- subsample
- eta
- num_round 

For more information about the recommended tuning ranges, see [Tune an XGBoost Model](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-tuning.html) in the AWS Documentation.

For this demonstration, you will use a *subset* of values. These values were obtained by running the tuning job with the full range, then minimizing the range so that you can use fewer iterations to get better performance. Though this practice isn't strictly realistic, it prevents you from waiting several hours for the tuning job to complete.

```
hyperparameter_ranges = {'alpha': ContinuousParameter(0, 1000),
                         'eta': ContinuousParameter(0.1, 0.5),
                         'min_child_weight': ContinuousParameter(1, 120),
                         'subsample': ContinuousParameter(0.5, 1),
                         'num_round': IntegerParameter(1,4000)}
```


You must specify how you are rating the model. You could use several different objective metrics, a subset of which applies to a binary classifcation problem. Because the evaluation metric is **merror**, you set the objective to *merror*.

```
objective_metric_name = 'validation:merror'
objective_type = 'Minimize'
```

Finally, run the tuning job.

```
tuner = HyperparameterTuner(xgb_model,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=40,
                            max_parallel_jobs=1,
                            objective_type=objective_type))

tuner.fit(inputs=data_channels, include_cls_metadata=False)
tuner.wait()
```



In [None]:
from sagemaker.parameter import (
    CategoricalParameter,
    ContinuousParameter,
    IntegerParameter,
    ParameterRange,
)
from sagemaker.amazon.hyperparameter import Hyperparameter
from sagemaker.tuner import HyperparameterTuner

container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

hyperparameter_ranges = {'alpha': ContinuousParameter(0, 1000),
                         'eta': ContinuousParameter(0.1, 0.5),
                         'min_child_weight': ContinuousParameter(1, 120),
                         'subsample': ContinuousParameter(0.5, 1),
                         'num_round': IntegerParameter(1,4000)}

objective_metric_name = 'validation:merror'
objective_type = 'Minimize'

tuner = HyperparameterTuner(xgb_model,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=40,
                            max_parallel_jobs=1,
                            objective_type=objective_type)

tuner.fit(inputs=data_channels, include_cls_metadata=False)
tuner.wait()

After the training job is finished, check the job and make sure that it completed successfully.

In [None]:
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

When the job is complete, there should be 10 completed jobs. One of the jobs should be marked as the best.

You can examine the metrics by getting *HyperparameterTuningJobAnalytics* and loading that data into a pandas DataFrame.

In [None]:
from pprint import pprint
from sagemaker.analytics import HyperparameterTuningJobAnalytics

tuner_analytics = HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.name, sagemaker_session=sagemaker.Session())

df_tuning_job_analytics = tuner_analytics.dataframe()

# Sort the tuning job analytics by the final metrics value
df_tuning_job_analytics.sort_values(
    by=['FinalObjectiveValue'],
    inplace=True,
    ascending=False if tuner.objective_type == "Maximize" else True)

# Show detailed analytics for the top 5 models
df_tuning_job_analytics.head()


You should be able to see the hyperparameters that were used for each job, along with the score. You could use those parameters and create a model, or you can get the best model from the hyperparameter tuning job.

In [None]:
attached_tuner = HyperparameterTuner.attach(tuner.latest_tuning_job.name, sagemaker_session=sagemaker.Session())
best_training_job = attached_tuner.best_training_job()

Now, you must attach the estimator to the best training job and create the model.

In [None]:
from sagemaker.estimator import Estimator
algo_estimator = Estimator.attach(best_training_job)

best_algo_model = algo_estimator.create_model(env={'SAGEMAKER_DEFAULT_INVOCATIONS_ACCEPT':"text/csv"})

This demonstration is now complete!