# Production model

Here we retrain a gbm classifier using AWS so that we may easily integrate with a web-based API. We upload our training and testing data to an S3 bucket, train a xgboost classifier using the optimal hyperparameters found during our exploratory data analysis, and test the performance of the xgboost model for consistency with the GBM trained previously.

In [1]:
import pandas as pd

DATA_DIR = '../data/'

## Read data

Below we read in the training and testing data.

In [2]:
train_data = pd.read_csv(DATA_DIR + 'train.csv')
test_data = pd.read_csv(DATA_DIR + 'test.csv')
print(train_data.head())
print(test_data.head())

     Length    Width    Size    Conc   Conc1     Asym   M3Long  M3Trans  \
0   24.6014  12.8664  2.3793  0.5637  0.2985  17.3748  19.5657   8.2758   
1   49.9223  22.3316  3.2930  0.2363  0.1210   6.7223  51.4013 -12.9949   
2   28.1635  15.8070  2.4200  0.4259  0.2490 -17.9803 -13.7842  -8.3405   
3  138.3880  44.5241  3.2860  0.0949  0.0674 -71.2351  66.0265 -48.3917   
4   80.7882  55.0349  3.4713  0.1449  0.0725  31.8274  70.9064  58.1480   

     Alpha      Dist class  
0  22.8210  216.9520     g  
1   6.8500  240.1190     g  
2  17.1252  248.0300     g  
3  45.0480  250.2907     h  
4  22.5290  277.3980     g  
    Length    Width    Size    Conc   Conc1     Asym   M3Long  M3Trans  \
0  21.0483  17.6112  2.4409  0.4529  0.2373  29.5910  -9.6241  14.2623   
1  20.9103  15.2311  2.4487  0.4484  0.2331  -0.7028  -7.7378  -7.4624   
2  24.0954  10.0126  2.2989  0.6030  0.3141  25.4443  17.9618 -10.6985   
3  38.6943  23.8422  2.7029  0.3211  0.1834  14.4893  -9.3039  22.6965   
4  69

## Recode target

Below we'll transform the `class` column so that g = 1 and h = 0.

In [3]:
train_data['class'] = (train_data['class'] == 'g').astype(int)
test_data['class'] = (test_data['class'] == 'g').astype(int)

## Format data for S3

Data must be formated properly for S3, which requires csv files with no headers. The training data must have the target in the first column.

In [4]:
test_data[test_data.columns.drop('class')].to_csv(DATA_DIR + 'test_s3.csv', header=False, index=False)

s3_column_order = ['class'] + train_data.columns.drop('class').to_list()
train_data[s3_column_order].to_csv(DATA_DIR + 'train_s3.csv', header=False, index=False)

## Upload data to S3

In [5]:
import sagemaker

session = sagemaker.Session() # Store the current SageMaker session

# S3 prefix (which folder will we use)
prefix = 'magic-gamma'

test_location = session.upload_data(DATA_DIR + 'test_s3.csv', key_prefix=prefix)
train_location = session.upload_data(DATA_DIR + 'train_s3.csv', key_prefix=prefix)

## Initialize XGBoost model

In [6]:
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri

role = get_execution_role()
container = get_image_uri(session.boto_region_name, 'xgboost', repo_version='0.90-1')

xgb = sagemaker.estimator.Estimator(container,
                                    role,
                                    train_instance_count=1,
                                    train_instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

xgb.set_hyperparameters(max_depth=5,                   # Note: hyperparameters determined from
                        eta=0.1,                       # model trained using sklearn's
                        silent=0,                      # GradientBoostingClassifier
                        objective='binary:logistic',
                        num_round=300)

## Train model

In [7]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
xgb.fit({'train': s3_input_train})

2019-09-25 19:17:40 Starting - Starting the training job...
2019-09-25 19:17:42 Starting - Launching requested ML instances...
2019-09-25 19:18:39 Starting - Preparing the instances for training......
2019-09-25 19:19:32 Downloading - Downloading input data...
2019-09-25 19:20:05 Training - Training image download completed. Training in progress..[31m2019-09-25 19:20:07,798 sagemaker-containers INFO     Imported framework sagemaker_xgboost_container.training[0m
[31m2019-09-25 19:20:07,799 sagemaker-containers INFO     Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[31mReturning the value itself[0m
[31m2019-09-25 19:20:07,803 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-09-25 19:20:07,817 sagemaker_xgboost_container.training INFO     Running XGBoost Sagemaker in algorithm mode[0m
[31m2019-09-25 19:20:07,820 root         INFO     Determined delimiter of CSV input is ','[0m
[31m2019-09-25 19:20:07,820 roo


2019-09-25 19:20:22 Uploading - Uploading generated training model
2019-09-25 19:20:22 Completed - Training job completed
Training seconds: 50
Billable seconds: 50


## Evaluate model

As demonstrated below, the model we've trained here using AWS Sagemaker achieves a 0.94 AUC which is consistent with the model developed on a laptop using Scikit-Learn's GradientBoostingClassifier.

In [8]:
xgb_transformer = xgb.transformer(instance_count = 1, instance_type = 'ml.m4.xlarge')

In [9]:
xgb_transformer.transform(test_location, content_type='text/csv', split_type='Line')

In [10]:
xgb_transformer.wait()

......................[31m[2019-09-25 19:24:10 +0000] [17] [INFO] Starting gunicorn 19.9.0[0m
[31m[2019-09-25 19:24:10 +0000] [17] [INFO] Listening at: unix:/tmp/gunicorn.sock (17)[0m
[31m[2019-09-25 19:24:10 +0000] [17] [INFO] Using worker: gevent[0m
[31m[2019-09-25 19:24:10 +0000] [24] [INFO] Booting worker with pid: 24[0m
[31m[2019-09-25 19:24:10 +0000] [25] [INFO] Booting worker with pid: 25[0m
[31m[2019-09-25 19:24:10 +0000] [26] [INFO] Booting worker with pid: 26[0m
[31m[2019-09-25 19:24:10 +0000] [27] [INFO] Booting worker with pid: 27[0m
[31m[2019-09-25:19:24:31:INFO] No GPUs detected (normal if no gpus installed)[0m
[31m169.254.255.130 - - [25/Sep/2019:19:24:31 +0000] "GET /ping HTTP/1.1" 200 0 "-" "Go-http-client/1.1"[0m
[31m169.254.255.130 - - [25/Sep/2019:19:24:31 +0000] "GET /execution-parameters HTTP/1.1" 200 84 "-" "Go-http-client/1.1"[0m
[31m[2019-09-25:19:24:31:INFO] Determined delimiter of CSV input is ','[0m
[31m169.254.255.130 - - [25/Sep/2019

In [11]:
!aws s3 cp --recursive $xgb_transformer.output_path $DATA_DIR

Completed 72.0 KiB/72.0 KiB (1.1 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-east-2-053461515887/sagemaker-xgboost-2019-09-25-19-20-53-249/test_s3.csv.out to ../data/test_s3.csv.out


In [12]:
from sklearn.metrics import roc_auc_score
predictions = pd.read_csv(DATA_DIR + 'test_s3.csv.out', header=None)
roc_auc_score(test_data['class'], predictions)

0.9401347149632211

## Processing a single event

Below, we create a JSON string of the format that we intend to POST to our model api. We'll need to set up a Lambda function to transform that string to a numpy array that can be consumed the model endpoint.

In [32]:
test_event = test_data.loc[0, test_data.columns != 'class'].to_json()
print(test_event)
print()
print(test_data.loc[9, test_data.columns != 'class'].to_json())

{"Length":21.0483,"Width":17.6112,"Size":2.4409,"Conc":0.4529,"Conc1":0.2373,"Asym":29.591,"M3Long":-9.6241,"M3Trans":14.2623,"Alpha":32.4415,"Dist":173.549}

{"Length":154.583,"Width":46.0362,"Size":3.0382,"Conc":0.2253,"Conc1":0.1204,"Asym":78.1719,"M3Long":169.773,"M3Trans":11.9243,"Alpha":56.935,"Dist":200.131}


In [33]:
import json

def json_to_numpy(event):
    """
    param: event, a json string that includes key value pairs of the features associate with the MAGIC data set
    returns: comma separated string of values in appropriate order to be consumed by the model endpoint
    """
    
    e = json.loads(event)
    column_order = ['Length', 'Width', 'Size', 'Conc', 'Conc1', 'Asym', 'M3Long', 'M3Trans', 'Alpha', 'Dist']
    return ','.join([str(e[column]) for column in column_order]).encode('utf-8')

json_to_numpy(test_event)

b'21.0483,17.6112,2.4409,0.4529,0.2373,29.591,-9.6241,14.2623,32.4415,173.549'

In [34]:
xgb_predictor = xgb.deploy(initial_instance_count = 1, instance_type = 'ml.m4.xlarge')

Using already existing model: sagemaker-xgboost-2019-09-25-19-17-40-267


ClientError: An error occurred (ValidationException) when calling the CreateEndpoint operation: Cannot create already existing endpoint "arn:aws:sagemaker:us-east-2:053461515887:endpoint/sagemaker-xgboost-2019-09-25-19-17-40-267".

In [35]:
import boto3

runtime = boto3.Session().client('sagemaker-runtime')

In [36]:
xgb_predictor.endpoint

'sagemaker-xgboost-2019-09-25-19-17-40-267'

In [37]:
response = runtime.invoke_endpoint(EndpointName = xgb_predictor.endpoint, # The name of the endpoint we created
                                       ContentType = 'text/csv',                     # The data format that is expected
                                       Body = json_to_numpy(test_event))

In [38]:
print(response)

{'ResponseMetadata': {'RequestId': 'b57bb8a5-faed-4d54-9ffa-cda35a7eb161', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'b57bb8a5-faed-4d54-9ffa-cda35a7eb161', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Wed, 25 Sep 2019 23:48:18 GMT', 'content-type': 'text/csv; charset=utf-8', 'content-length': '18'}, 'RetryAttempts': 0}, 'ContentType': 'text/csv; charset=utf-8', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fdd6e1f6fd0>}


In [39]:
response = response['Body'].read().decode('utf-8')
print(response)

0.9058170914649963


## Define Lambda function

We define the Lambda function below which is responsible for interfracing with our model endpoint on behalf of our API. The API expects data to be passed as a json string formatted as follows:  
`'{"Length":21.0483,"Width":17.6112,"Size":2.4409,"Conc":0.4529,"Conc1":0.2373,"Asym":29.591,"M3Long":-9.6241,
"M3Trans":14.2623,"Alpha":32.4415,"Dist":173.549}'`  
The Lambda function parses this string to be passed to our model endpoint. It then recieves the response from the model and returns a value of either `g` (gamma ray) or `h` (hadron) depending on whether the model prediction is above or below the 0.5 prediction threshold.

```python
import boto3
import json

def json_to_numpy(event):
    """
    param: event, a json string that includes key value pairs of the features
                  associated with the MAGIC data set
    returns: comma separated string of values in appropriate order to be
             consumed by the model endpoint
    """
    
    e = json.loads(event)
    column_order = ['Length', 'Width', 'Size', 'Conc', 'Conc1', 'Asym',
                    'M3Long', 'M3Trans', 'Alpha', 'Dist']
    return ','.join([str(e[column]) for column in column_order]).encode('utf-8')

def lambda_handler(event, context):
    runtime = boto3.Session().client('sagemaker-runtime')
    
    response = runtime.invoke_endpoint(
        EndpointName = 'sagemaker-xgboost-2019-09-25-19-17-40-267',
        ContentType = 'text/csv',
        Body = json_to_numpy(event['body']))
    
    if response['Body'].read().decode('utf-8') > 0.5:
        result = 'g' #gamma ray
    else:
        result = 'h' # hadron
    
    return {
        'statusCode': 200,
        'headers' : {'Content-Type' : 'text/plain',
                     'Access-Control-Allow-Origin' : '*'},
        'body': result
    }
```