# Demonstration: Training a Model Using Amazon SageMaker - Companion Notebook

This Jupyter notebook is the companion notebook for the Module 3 demonstration Training a Model Using Amazon SageMaker.

## About this dataset

This demonstration uses the [Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml).

It contains information on wine quality.

## Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):
1. fixed acidity
2. volatile acidity
3. citric acid
4. residual sugar
5. chlorides
6. free sulfur dioxide
7. total sulfur dioxide
8. density
9. pH
10. sulphates
11. alcohol
    
Output variable (based on sensory data):

12. quality (score between 0 and 10)


## Dataset attributions
This dataset is from: 
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

## Loading the data

In [None]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df_wine = pd.read_csv(url,';')

## Preprocessing

The quality values in the dataset contain the values 3-8. These values are mapped to 0-5 as target classes.

XGBoost requires the training data to be in a single file. In the file, the target value must be the first column. 

Get the target column and move it to the first position.

In [None]:
df_wine['quality']=df_wine['quality'].map({3: 0, 4: 1, 5: 2, 6: 3, 7: 4, 8: 5})

In [None]:
cols = df_wine.columns.tolist()
cols = cols[-1:] + cols[:-1]
df_wine = df_wine[cols]

In [None]:
pd.set_option('precision', 6)

## Training a model

Start by showing a sample of the data.

In [None]:
df_wine.shape

In [None]:
df_wine.head(20)

## Splitting the data

You will start by splitting the dataset into two datasets. You will use one dataset for training, and you will split the other dataset again for use with validation and testing.

You will use the *train_test_split function* from the *scikit-learn library*, which is a free machine learning library for Python. It has many algorithms and useful functions, such as the one you will use. 

- For more information about the function, see the [Train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- For more information about scikit-learn, see the [scikit-learn guide](https://scikit-learn.org/stable/).

Because you don't have a lot of data, you want to make sure that the split datasets contain a representative amount of each class. Thus, you will use the *stratify* switch. Finally, you will use a random number so that you can repeat the splits.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test_and_validate = train_test_split(df_wine, 
                                            test_size=0.2, 
                                            random_state=42, 
                                            stratify=df_wine['quality'])

In [None]:
test, validate = train_test_split(test_and_validate, 
                                  test_size=0.5, 
                                  random_state=42, 
                                  stratify=test_and_validate['quality'])

You can see the size of each dataset based on the split.

In [None]:
print(train.shape)
print(test.shape)
print(validate.shape)

You can see the distribution of the target from each dataset.

In [None]:
t1 = train['quality'].value_counts()
t2 = test['quality'].value_counts()
t3 = validate['quality'].value_counts()
result = pd.concat([t1,t2,t3], axis=1, sort=False)
result

## Uploading to Amazon S3

XGboost will load the data for training from Amazon Simple Storage Service (Amazon S3). Thus, you must write the data to a comma-separated values (CSV) file, and then upload the file to Amazon S3.

Start by setting up some variables for the S3 bucket, then create a function to upload the CSV file to Amazon S3. You can reuse this function.

First, explore the function.

Note the following line:

`dataframe.to_csv(csv_buffer, header=False, index=False)`

This line writes the pandas DataFrame (which was passed into the function) into the I/O buffer that's named *csv_buffer*. You use a buffer because you don't need to write the file locally.

To stop the column headers from being written out, use `header=False`. To stop the pandas index from being output, use `index=False`.

To write the csv_buffer to Amazon S3 as an object, use the PUT operation on the `object`, which is a property of the `bucket`.



In [None]:
import boto3
import io
import os

In [None]:
bucket='c45317a617679l1523854t1w00381652629-sandboxbucket-3apxi73oxsw6'
prefix='mod03-demo-training-a-model'
train_file='wine_train.csv'
test_file='wine_test.csv'
validate_file='wine_validate.csv'
whole_file='wine.csv'
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

## Create the estimator

Now that the data in Amazon S3, you can train a model.

The first step is to get the XGBoost container URI.

In [None]:
from sagemaker.image_uris import retrieve
import sagemaker
role=sagemaker.get_execution_role()
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

The only value to point out is the *num_class*, which is set to *6* to match the number of target classes in the dataset.

In [None]:
hyperparams={
    "num_round":"40",
    "num_class":"6",
    "objective":"multi:softmax"}

Use the `estimator` function to set up the model. Here are a few parameters of interest:

- **instance_count** - Defines how many instances will be used for training. You will use *one* instance.
- **instance_type** - Defines the instance type for training. In this case, it's *ml.m4.xlarge*.

In [None]:
xgb_model=sagemaker.estimator.Estimator(container,
                                        role,
                                        instance_count=1,
                                        instance_type='ml.m5.xlarge',
                                        output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

## Creating the input channels

The estimator needs *channels* to feed data into the model. For training, the *train_channel* and the *validate_channel* will be used.

In [None]:
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

## Training the model

Running `fit` will train the model.

**Note:** This process can take up to 5 minutes.

In [None]:
xgb_model.fit(inputs=data_channels, logs=False)

## Viewing the metrics from the training job

After the job is complete, you can view the metrics from the training job.

In [None]:
s=sagemaker.analytics.TrainingJobAnalytics(xgb_model._current_job_name, 
                                         metric_names = ['train:merror', 
                                                         'validation:merror']
                                        )

s_df=s.dataframe()
s_df = s_df.iloc[:,1:3]
s_df
#(wrong cases)/#(all cases)

This demonstration is now complete!