# Student Evasion Demonstration

In this notebook we are going to utilize the dataset **Student Performance Data Set** (https://archive.ics.uci.edu/ml/datasets/student+performance) from **UCI** as base for our study. We will create a classification model to predict student evasion based on historical data.

In [None]:
import boto3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.predictor import csv_serializer
from sklearn.metrics import accuracy_score

role = get_execution_role()
bucket = 'martinig-models'
prefix = 'student-evasion'
bucket_path = 's3://{}/{}'.format(bucket,prefix)

## Preparing Data

First we are going to explore the dataset to setup the demonstration data.

In [None]:
df = pd.read_csv('{}/dataset/{}'.format(bucket_path, 'student-por.csv'), sep=';')
print(df.shape)
df.head()

In [None]:
df.describe()

In [None]:
plt.hist(x=df['G3'], bins='auto', color='#0504aa',
                            alpha=0.7, rwidth=0.85)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Final Grade Histogram')

For this demonstrarion, we are going to minimize the number of features (columns) for only 5:
* sex
* age
* absences
* final_grade (G3)
* evasor

Also, we are going to fake the data, classifing everyone that took a **final_grade** lower than **12** (min=0, max=20) as evasor (not evasor=0, evasor=1).

*__note: It is not a real scenario, we are just faking this data for demonstration purpose. In a real scenario we need to evaluate wich feature is relevant to make decisions.__*

In [None]:
# selecting features
df = df[['sex', 'age', 'absences', 'G3']]

# changing G3 column for final_grade
df = df.rename(columns={"G3": "final_grade"})

# creating new column for classification
df['evasor'] = 1
df.loc[(df['final_grade'] < 12), ['evasor']] = 0

print('Shape:', df.shape)
df.head()

In [None]:
df['evasor'].value_counts()

In [None]:
sns.heatmap(df.corr(), linewidths=1, cmap='Purples', annot=True)

So, as we faked the data, we have a **high** correlation between **evasion** and **final_grade**.

## Preparing for training

For this classification, we are going to use the algorithm **XGBoost** (https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d).

Amazon SageMaker XGBoost can train on data in either a CSV or LibSVM format. For this example, with CSV, It should:
* Have the predictor variable in the first column
* Not have a header row

In [None]:
# transforming categorical to numeric
le = preprocessing.LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])

print(df.info())
df.head()

In [None]:
# predictor variable in the first column
cols = list(df.columns)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
df.head()

In [None]:
# let's split the data into training and test sets
train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=1729), [int(0.7 * len(df)), int(0.9 * len(df))])

In [None]:
# saving train and validation csv without header
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)

# uploading train and validation data to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

In [None]:
# creating SageMaker session
sess = sagemaker.Session()

# setup the algorithm container
container = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')

In [None]:
# input paths
s3_input_train = sagemaker.s3_input(s3_data='{}/train'.format(bucket_path), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='{}/validation/'.format(bucket_path), content_type='csv')



Now, we can specify a few parameters like what type of training instances we'd like to use and how many, as well as our XGBoost hyperparameters. A few key hyperparameters are:

* max_depth controls how deep each tree within the algorithm can be built. Deeper trees can lead to better fit, but are more computationally expensive and can lead to overfitting. There is typically some trade-off in model performance that needs to be explored between a large number of shallow trees and a smaller number of deeper trees.
* subsample controls sampling of the training data. This technique can help reduce overfitting, but setting it too low can also starve the model of data.
* num_round controls the number of boosting rounds. This is essentially the subsequent models that are trained using the residuals of previous iterations. Again, more rounds should produce a better fit on the training data, but can be computationally expensive or lead to overfitting.
* eta controls how aggressive each round of boosting is. Larger values lead to more conservative boosting.
* gamma controls how aggressively trees are grown. Larger values lead to more conservative models.

More detail on XGBoost's hyperparmeters can be found on AWS documentation ( https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html ).


## 01) Training the Model without hyperparameter tuning

### Training

In [None]:
xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.m5.xlarge',
                                    output_path='{}/output'.format(bucket_path),
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

### Deploying

In [None]:
# create endpoint - use it if you do not deployed the model
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m5.xlarge')

In [None]:
# get predictor endpoint - use it if you have already deployed your model
#endpoint_name = 'sagemaker-xgboost-2019-10-16-04-21-13-089'
#xgb_predictor = sagemaker.predictor.RealTimePredictor(endpoint=endpoint_name, sagemaker_session=sess)

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

### Evaluating

In [None]:
def predictPredictor(data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

In [None]:
test_data['evasor'].value_counts()

In [None]:
# use predictor
predictions = predictPredictor(test_data.values[:, 1:])

pd.crosstab(index=test_data.iloc[:, 0], columns=np.round(predictions), rownames=['actual'], colnames=['predictions'])

In [None]:
predictions.round()

In [None]:
print('Accuracy:', accuracy_score(test_data['evasor'], predictions.round())*100,'%')

## Compiling

Amazon SageMaker Neo (https://aws.amazon.com/pt/sagemaker/neo/) optimizes models to run up to twice as fast, with no loss in accuracy. When calling **compile_model()** function, we specify the target instance family as well as the S3 bucket to which the compiled model would be stored.

In [None]:
compiled_target = 'rasp3b' # ml_c5, ml_m5, ml_c4, ml_m4, jetsontx1, jetsontx2, ml_p2, ml_p3, deeplens, rasp3b
try:
    xgb.create_model()._neo_image_account(boto3.Session().region_name)
except:
    print('Neo is not currently supported in', boto3.Session().region_name)
else:
    compiled_model = xgb.compile_model(target_instance_family=compiled_target, 
                                   input_shape={'data':[1, 69]},
                                   role=role,
                                   framework='xgboost',
                                   framework_version='0.90-1',
                                   output_path='{}/output-compiled'.format(bucket_path))
    compiled_model.name = 'deployed-xgboost-student-evasion'
    compiled_model.image = get_image_uri(sess.boto_region_name, 'xgboost-neo', repo_version='latest')

## Cleanup

In [None]:
# sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)