# DSE230: Programming Assignment 5 - XGBoost using SageMaker 

## Classification on Amazon SageMaker

Perform a classification task on the given dataset.<br>
Using the features given, you will train a XGBoost decision tree model to predict a given person's salary (the `WAGP` column) - which will be categorized into multiple bins.<br>

--- 

#### Tasks: 

- Perform Exploratory Data Analysis on the given dataset
- Save preprocessed datasets to Amazon S3
- Use the Amazon Sagemaker platform to train an XGBoost model
- Evaluate the model on the test set
- Perform hyperparameter tuning on the XGBoost model
- Submit
  - Submit this Jupyter Notebook (`.ipynb`) to "PA5"
  - Screenshot of SageMaker dashboard showing no running jobs (nothing should be in green).
  - Make sure all the cell outputs are present in the notebook
  - You can put both the `.ipynb` and the screenshot in a `.zip` file for submission.
  
#### Due date: Thursday 6/10/2021 at 11:59 PM PST

---

Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame. 

Pandas API documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

Amazon Sagemaker API documentation: https://sagemaker.readthedocs.io/en/stable/

Amazon Sagemaker Tutorials: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html 

---

### 1. Import packages and Get Amazon IAM execution role & instance region

In [None]:
import os, sagemaker
from sagemaker import get_execution_role
from sklearn.model_selection import train_test_split

Make sure to create an S3 bucket or re-use the ones from prior exercises

In [None]:
# Define IAM role- this will be necessary when defining your model
iam_role = get_execution_role()

# Set SageMaker session handle
sess = sagemaker.Session()

# set the region of the instance and get a reference to the client
region = sess.boto_session.region_name

bucket = << BUCKET NAME >>

print('Using bucket ' + bucket)
print("Success - the SageMaker instance is in the " + region + " region")

### 2. Read data.

NOTE - Upload the data to your S3 bucket before this step. Make sure it is in `.csv` format

In [None]:
import pandas as pd
import pickle

# Read data from the S3 bucket
file_path = << PATH TO S3 OBJECT >>

df = pd.read_csv(file_path)
df.head()

### Description of Columns

There are lots of columns in the original dataset. However, we'll only use the following columns whose descriptions are given below.


AGEP -  Age

COW - Class of worker

WAGP - Wages or salary income past 12 months

JWMNP - Travel time to work

JWTR - Means of transportation to work

MAR - Marital status

PERNP - Total person's earnings

NWAV - Available for work

NWLA - On layoff from work

NWLK - Looking for work

NWAB - Temporary absence from work

SCHL - Educational attainment

WKW - Weeks worked during past 12 months

Task:
* Select the given column names below.

In [None]:
colNames = ['AGEP', 'COW', 'WAGP', 'JWMNP', 'JWTR', 'MAR', 'PERNP', 'NWAV', 
            'NWLA', 'NWLK', 'NWAB', 'SCHL', 'WKW']

<< YOUR CODE HERE >>

### 3. Filtering data

Find the correlation of the WAGP value with all other features.
You can use the following technique for finding correlation between two columns:

`df['col_1'].corr(df['col_2'])` gives you the correlation between col_1 and col_2.

Your task is to find the correlation between WAGP and all other columns.

In [None]:
<< YOUR CODE HERE >>

From the results of the above cell, you should see that `PERNP` is highly correlated with `WAGP`.
Since `PERNP` is highly correlated with `WAGP` remove that column from the dataset.

In [None]:
colNames = ['AGEP', 'COW', 'WAGP', 'JWMNP', 'JWTR', 'MAR', 'NWAV', 'NWLA', 'NWLK', 'NWAB', 'SCHL', 'WKW']

<< YOUR CODE HERE >>

See the statistics of the target variable. Use the `.describe()` method to see the statistics of the WAGP column.

In [None]:
<< YOUR CODE HERE >>

### 4. Outlier Removal

Remove outlier rows based on values in the `WAGP` column. This will be an important step that impacts our model's predictive performance in the classification step below.

Based on the statistics above, we need an **upper limit** to filter out significant outliers.
We'll filter out all the data points for which WAGP is more than the mean + 3 standard deviations.

Your tasks:
1. Filter the dataframe using a calculated upper limit for WAGP

Expected Output:
1. Number of outlier rows removed from DataFrame

Instructions:
* Find the mean ($\mu$) and standard deviation($\sigma$) of the column `WAGP`
* Set `upperLimit` to 3 standard deviations from the mean i.e. $upperLimit = \mu + 3 \sigma$
* Filter the dataframe so that values in `WAGP` column are less than the upper limit i.e. `df['WAGP'] < upperLimit`
* Print the difference in length of original dataframe and the filtered dataframe
* For the following tasks after this step, you will use the filtered dataframe

In [None]:
<< YOUR CODE HERE >>

### 5. Dropping NAs

Drop rows with any nulls in any of the columns.<br>
Print the resulting DataFrame's row count.

**Note**: The more features you choose, the more rows with nulls you will drop. This may be desirable if you are running into memory problems<br>

Your tasks:
1. Drop rows with any nulls

Expected Output: 
1. Number of rows in cleaned DataFrame

In [None]:
df_cleaned = << YOUR CODE HERE >>

### 6. Discretize salary

We want to convert the WAGP column, which contains continuous values, into a column with discrete labels so that we can use it as the label column for our classification problem. 
We're essentially turning a regression problem into a classification problem. Instead of predicting a person's exact salary, we're predicting the range in which that person's salary lies.

Note that labels are integers and should start from 0. 

XGBoost expects that the Label column (WAGP_CAT) is the first column in the dataset.

Your tasks:
1. Make a new column for discretized labels with 5 bins. Recommended column name is `WAGP_CAT`
    - XGBoost expects that the Label column (WAGP_CAT) is the first column in the dataset.
    - Remember to put your label column as the first column in the dataframe, otherwise training won't run!
2. Examine the label column - plot a histogram of the `WAGP_CAT` column values

Expected Output: 
1. The first 5 rows of the dataframe with the discretized label column. The label column must be the first column in the dataframe. 
2. A histogram from the discretized label column

* Categorize the labels into multiple bins - 5 bins in this case
* Look up the pd.cut() function to see how the WAGP column is converted to different bins

In [None]:
import matplotlib.pyplot as plt

df_cleaned['WAGP_CAT'] = pd.cut(df_cleaned['WAGP'], bins=5, labels=[0,1,2,3,4])

# Plot a histogram of the WAGP_CAT column
<< YOUR CODE HERE >>

df_cleaned.head(5)

Rearranging the colums so that the WAGP_CAT column is the first column and drop WAGP (will make problem trivial otherwise). XGBoost expects labels to be in the first column. The code has been given for you

In [None]:
cols = df_cleaned.columns.tolist()
df_cleaned = df_cleaned[cols[-1:] + cols[:-1]].drop('WAGP', axis=1)
df_cleaned.head()

### 7. Splitting data and converting to CSV

 Split the dataset into train, validation, and test sets using sklearn's train_test_split. 
Look up the API definition of train_test_split to see what values you need to pass.
First, we'll split the df_cleaned2 dataframe into two parts - `train_data` and `val_data` with an 80:20 ratio, and then
we'll split the `train_data` into `train_data` and `test_data` in a 90:10 ratio.

Use the following parameters for train_test_split:
* `random_state = 42`
* `shuffle = True`
* `train_size = 0.8`, `test_size = 0.2` for the first split
* `train_size = 0.9`, `test_size = 0.1` for the second split

In [None]:
train_data, val_data = << YOUR CODE HERE >>

train_data, test_data = << YOUR CODE HERE >>

len(train_data), len(val_data), len(test_data)

### Write prepared data to files.
Refer to the demo to write the train_data, val_data, and test_data to csv files using the `.to_csv()` method
Use `index = False` and `header = False` as the parameters.

In [None]:
<< YOUR CODE HERE >>

### 8. Save processed data to S3

This step is needed for using XGBoost with Amazon Sagemaker. Send data to S3. SageMaker will read training data from S3.

In [None]:
prefix = "data"
key_prefix = prefix + "/model_data"

trainpath = sess.upload_data(
    path='train_data.csv', bucket=bucket,
    key_prefix=key_prefix)

valpath = sess.upload_data(
    path='val_data.csv', bucket=bucket,
    key_prefix=key_prefix)

testpath = sess.upload_data(
    path='test_data.csv', bucket=bucket,
    key_prefix=key_prefix)

In [None]:
trainpath, valpath, testpath

## 9. Create channels for train and validation data to feed to model
Set up data channels for the training, validation, and test data as shown in the demo.
You'll have to use the TrainingInput function and pass the s3_data and content_type parameters.

In [None]:
s3_input_train = << YOUR CODE HERE >>
s3_input_val = << YOUR CODE HERE >>
s3_input_test = << YOUR CODE HERE >>

Set model output location as shown in the demo.

In [None]:
output_location = "s3://{}/{}/model".format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

### 10. Create the XGBoost model
We'll create the XGBoost model, and set its hyperparameters.

In [None]:
from sagemaker.amazon.amazon_estimator import image_uris
xgb_image = image_uris.retrieve(framework="xgboost", region=region, version='latest')

### Create an Estimator using sagemaker.estimator.Estimator.
You'll need to pass the xgb_image and the iam_role parameters.

Use the following values for other parameters:
* `instance_count = 1`
* `instance_type = ml.m5.xlarge`
* `output_path = output_location`
* `sagemaker_session = sess`

In [None]:
xgb = << YOUR CODE HERE >>

### 11. Set model hyperparameters
Set the hyperparameters for the model. You'll have to use the `set_hyperparameters()` method.
Refer to the demo for how it's done.

Read the below references for more information:
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters

Use the following values for the parameters:
* `num_class = 5`
* `max_depth = 2`
* `min_child_weight = 2`
* `early_stopping_rounds=5`
* `objective='multi:softmax'`
* `num_round=100`

In [None]:
<< YOUR CODE HERE >>

### 12. Train model using train and validation data channels
Use the `.fit()` method to fit the model using the training and validation data channels. 
Execute the XGBoost training job.

NOTE:  This step may take several minutes

In [None]:
%%time

<< YOUR CODE HERE >>

### 13. Deploying model
Deploy the model so that it can be used for inference.

Use the .deploy() method to deploy your model.

Use the following values for the parameters:

* `initial_instance_count = 1`
* `instance_type = 'ml.t2.medium'`
* `serializer = sagemaker.serializers.CSVSerializer()`

NOTE:  This step may take several minutes

In [None]:
%%time

xgb_predictor = << YOUR CODE HERE >>

### 14. Testing the model on test data

* Store the values in `WAGP_CAT` column of test_data in `y_true` variable
* Drop `WAGP_CAT` column from the test_data. Convert the resulting dataframe to an array using `.values`
* Use the deployed model(from the previous step) to get the predictions on the test data
* Store the value of predictions in `y_pred`

In [None]:
<< YOUR CODE HERE >>

### 15. Confusion matrix and classification report

Use the `confusion_matrix` and the `classification_report` methods to see how your model performs on the test set.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

<< YOUR CODE HERE >>

### IMPORTANT: DELETE THE ENDPOINT

Delete the endpoint once it has served its purpose.

In [None]:
xgb_predictor.delete_endpoint()

### 16. Hyperparameter tuning

Read through the following links for more information:
https://sagemaker.readthedocs.io/en/stable/api/training/tuner.html
https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-supports-random-search-and-hyperparameter-scaling/

We'll use do hyperparameter tuning on two hyperparameters:

1. min_child_weight
2. max_depth

We'll use a `Random` search strategy since that's more effective than searching all possible combinations of hyperparameters. The code has been given for you.

In [None]:
from sagemaker.tuner import HyperparameterTuner, IntegerParameter

# Define exploration boundaries
hyperparameter_ranges = {
    'min_child_weight': IntegerParameter(1, 10),
    'max_depth': IntegerParameter(1, 10)
}

# create Optimizer
Optimizer = HyperparameterTuner(
    estimator=xgb,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name='XGBoost-Tuner',
    objective_type='Minimize',
    objective_metric_name='validation:merror',
    max_jobs=5,
    max_parallel_jobs=5,
    strategy='Random')

Now that we have created the Optimizer. We need to call `.fit()` on it to start the tuning job.

Refer to the demo and see how to call `fit()` and pass the appropriate data channels.

In [None]:
%%time

<< YOUR CODE HERE >>

### 17. Results of tuning job

Get the tuner results in a dataframe. The code is given to you for getting the results of the tuning job in a dataframe.

In [None]:
results = Optimizer.analytics().dataframe()
results

See the best hyperparameters found by the optimizer.

In [None]:
<< YOUR CODE HERE >>

### 18. Deploy the tuned model.

"Use the .deploy() method to deploy the best model found by the Optimizer.
If you call Optimizer.deploy() method, it will deploy the best model it found.

Use these parameters when calling deploy:
* `initial_instance_count=1`
* `instance_type= 'ml.t2.medium'`
* `serializer = sagemaker.serializers.CSVSerializer()`

Refer to the demo if you are unsure of what to do.

In [None]:
tuned_model_predictor = << YOUR CODE HERE >>

### 19. Test the tuned model on test data

* Use the deployed model(from the previous step) to get the predictions on the test data
* Store the value of predictions in `y_pred`

In [None]:
<< YOUR CODE HERE >>

### 20. Confusion matrix and classification report
Use the `confusion_matrix` and the `classification_report` methods to see how your model performs on the test set.

You should see that the tuned model gives you better performance in the f1-score for each (or most) of  the classses. If not, then you're probably doing something wrong.

HINT - Follow instructions similar to section **14. Testing the model on test data**

In [None]:
<< YOUR CODE HERE >>

### IMPORTANT: DELETE THE ENDPOINT


In [None]:
tuned_model_predictor.delete_endpoint()

### 21. Screenshot of everything terminated.

You need to submit a screenshot of terminated endpoints and notebook instances once you are done with the assignment. Nothing should be in green in this screenshot since all running instances are shown in green.

You can take the screenshot of the Dashboard once you go to Amazon SageMaker.