# Lab 3.5 - Student Notebook

## Overview

This lab is a continuation of the guided labs in Module 3. 

In this lab, you will deploy a trained model and perform a prediction against the model. You will then delete the endpoint and perform a batch transform on the test dataset.


## Introduction to the business scenario

You work for a healthcare provider, and want to improve the detection of abnormalities in orthopedic patients. 

You are tasked with solving this problem by using machine learning (ML). You have access to a dataset that contains six biomechanical features and a target of *normal* or *abnormal*. You can use this dataset to train an ML model to predict if a patient will have an abnormality.


## About this dataset

This biomedical dataset was built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The data has been organized in two different, but related, classification tasks. 

The first task consists in classifying patients as belonging to one of three categories: 

- *Normal* (100 patients)
- *Disk Hernia* (60 patients)
- *Spondylolisthesis* (150 patients)

For the second task, the categories *Disk Hernia* and *Spondylolisthesis* were merged into a single category that is labeled as *abnormal*. Thus, the second task consists in classifying patients as belonging to one of two categories: *Normal* (100 patients) or *Abnormal* (210 patients).


## Attribute information

Each patient is represented in the dataset by six biomechanical attributes that are derived from the shape and orientation of the pelvis and lumbar spine (in this order): 

- Pelvic incidence
- Pelvic tilt
- Lumbar lordosis angle
- Sacral slope
- Pelvic radius
- Grade of spondylolisthesis

The following convention is used for the class labels: 
- DH (Disk Hernia)
- Spondylolisthesis (SL)
- Normal (NO) 
- Abnormal (AB)

For more information about this dataset, see the [Vertebral Column dataset webpage](http://archive.ics.uci.edu/ml/datasets/Vertebral+Column).


## Dataset attributions

This dataset was obtained from:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.


# Lab setup

Because this solution is split across several labs in the module, you run the following cells so that you can load the data and train the model to be deployed.

**Note:** The setup can take up to 5 minutes to complete.

## Importing the data

By running the following cells, the data will be imported and ready for use. 

**Note:** The following cells represent the key steps in the previous labs.


In [1]:
bucket='c94466a2114432l5130264t1w295847703765-labbucket-1s9m53vwziobx'

In [2]:
import warnings, requests, zipfile, io
warnings.simplefilter('ignore')
import pandas as pd
from scipy.io import arff

import os
import boto3
import sagemaker
from sagemaker.image_uris import retrieve
from sklearn.model_selection import train_test_split

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [4]:
import gdown
#Load the dataset from my google drive
#https://drive.google.com/file/d/1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE/view?usp=sharing
#https://drive.google.com/file/d/1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x/view?usp=drive_link

# Define a dictionary of file names and their corresponding file IDs
file_ids = {
    'combined_csv_v1.csv': '1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE',
    'combined_csv_v2.csv': '1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x',
}

# Define the destination folder where you want to save the files
destination_folder = './'

# Download the files
for file_name, file_id in file_ids.items():
    url = f'https://drive.google.com/uc?id={file_id}'
    output = f'{destination_folder}/{file_name}'
    gdown.download(url, output, quiet=False)

print('Files downloaded successfully.')


Downloading...
From (uriginal): https://drive.google.com/uc?id=1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE
From (redirected): https://drive.google.com/uc?id=1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE&confirm=t&uuid=f50f4517-7c70-408e-ab23-03a389f01936
To: /home/ec2-user/SageMaker/en_us/combined_csv_v1.csv
100%|██████████| 318M/318M [00:03<00:00, 96.3MB/s] 
Downloading...
From (uriginal): https://drive.google.com/uc?id=1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x
From (redirected): https://drive.google.com/uc?id=1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x&confirm=t&uuid=bc28147c-7f06-4574-aeb4-9535fe9c0b68
To: /home/ec2-user/SageMaker/en_us/combined_csv_v2.csv
100%|██████████| 384M/384M [00:03<00:00, 103MB/s]  

Files downloaded successfully.





In [14]:
def read_optimized_csv(filename, target_column='target'):
    # Read a small sample to infer data types
    small_sample = pd.read_csv(filename, nrows=1000)

    # Identify columns to be converted to bool (binary columns)
    bool_columns = [col for col in small_sample.columns if col not in [target_column, 'Distance'] and small_sample[col].nunique() == 2]

    # Create a dictionary with specified data types
    column_types = {col: 'bool' for col in bool_columns}
    column_types['Distance'] = 'float32'
    column_types[target_column] = 'float32'

    # Read the full CSV with optimized data types
    df = pd.read_csv(filename, dtype=column_types)

    return df
data_v1 = pd.read_csv("combined_csv_v1.csv").sample(frac=0.5)  # samples 50% of the data
data_v2 = pd.read_csv("combined_csv_v2.csv").sample(frac=0.5)  # samples 50% of the data

In [15]:
train, test_and_validate = train_test_split(data_v1, test_size=0.2, random_state=42, stratify=data_v1['target'])
test, validate = train_test_split(test_and_validate, test_size=0.5, random_state=42, stratify=test_and_validate['target'])

prefix='lab3'

train_file='vertebral_train.csv'
test_file='vertebral_test.csv'
validate_file='vertebral_validate.csv'

s3_resource = boto3.Session().resource('s3')
def upload_s3_csv(filename, folder, dataframe):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False )
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

hyperparams={"num_round":"42",
             "eval_metric": "auc",
             "objective": "binary:logistic"}

s3_output_location="s3://{}/{}/output/".format(bucket,prefix)
xgb_model=sagemaker.estimator.Estimator(container,
                                       sagemaker.get_execution_role(),
                                       instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       output_path=s3_output_location,
                                        hyperparameters=hyperparams,
                                        sagemaker_session=sagemaker.Session())

train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

data_channels = {'train': train_channel, 'validation': validate_channel}

xgb_model.fit(inputs=data_channels, logs=False)

print('ready for hosting!')

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker.image_uris:Defaulting to only available Python version: py3
INFO:sagemaker.image_uris:Defaulting to only supported image scope: cpu.


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2023-11-01-21-42-39-081



2023-11-01 21:42:39 Starting - Starting the training job......
2023-11-01 21:43:15 Starting - Preparing the instances for training............
2023-11-01 21:44:24 Downloading - Downloading input data......
2023-11-01 21:44:59 Training - Downloading the training image.......
2023-11-01 21:45:40 Training - Training image download completed. Training in progress................
2023-11-01 21:47:01 Uploading - Uploading generated training model...
2023-11-01 21:47:17 Completed - Training job completed
ready for hosting!


# Step 1: Hosting the model

Now that you have a trained model, you can host it by using Amazon SageMaker hosting services.

The first step is to deploy the model. Because you have a model object, *xgb_model*, you can use the **deploy** method. For this lab, you will use a single ml.m4.xlarge instance.



In [None]:
xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                serializer = sagemaker.serializers.CSVSerializer(),
                instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-xgboost-2023-11-01-21-47-21-993
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2023-11-01-21-47-21-993
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2023-11-01-21-47-21-993


-

# Step 2: Performing predictions

Now that you have a deployed model, you will run some predictions.

First, review the test data and re-familiarize yourself with it.

In [None]:
test.shape

You have 31 instances, with seven attributes. The first five instances are:

In [None]:
test.head(5)

You don't need to include the target value (class). This predictor can take data in the comma-separated values (CSV) format. You can thus get the first row *without the class column* by using the following code:

`test.iloc[:1,1:]` 

The **iloc** function takes parameters of [*rows*,*cols*]

To only get the first row, use `0:1`. If you want to get row 2, you could use `1:2`.

To get all columns *except* the first column (*col 0*), use `1:`



In [None]:
row = test.iloc[0:1,1:]
row.head()

You can convert this to a comma-separated values (CSV) file, and store it in a string buffer.

In [None]:
batch_X_csv_buffer = io.StringIO()
row.to_csv(batch_X_csv_buffer, header=False, index=False)
test_row = batch_X_csv_buffer.getvalue()
print(test_row)

Now, you can use the data to perform a prediction.

In [None]:
xgb_predictor.predict(test_row)

The result you get isn't a *0* or a *1*. Instead, you get a *probability score*. You can apply some conditional logic to the probability score to determine if the answer should be presented as a 0 or a 1. You will work with this process when you do batch predictions.

For now, compare the result with the test data.

In [None]:
test.head(5)

**Question:** Is the prediction accurate?

**Challenge task:** Update the previous code to send the second row of the dataset. Are those predictions correct? Try this task with a few other rows.

It can be tedious to send these rows one at a time. You could write a function to submit these values in a batch, but SageMaker already has a batch capability. You will examine that feature next. However, before you do, you will terminate the model.

# Step 3: Terminating the deployed model

To delete the endpoint, use the **delete_endpoint** function on the predictor.

In [None]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

# Step 4: Performing a batch transform

When you are in the training-testing-feature engineering cycle, you want to test your holdout or test sets against the model. You can then use those results to calculate metrics. You could deploy an endpoint as you did earlier, but then you must remember to delete the endpoint. However, there is a more efficient way.

You can use the transformer method of the model to get a transformer object. You can then use the transform method of this object to perform a prediction on the entire test dataset. SageMaker will: 

- Spin up an instance with the model
- Perform a prediction on all the input values
- Write those values to Amazon Simple Storage Service (Amazon S3) 
- Finally, terminate the instance

You will start by turning your data into a CSV file that the transformer object can take as input. This time, you will use **iloc** to get all the rows, and all columns *except* the first column.


In [None]:
batch_X = test.iloc[:,1:];
batch_X.head()

Next, write your data to a CSV file.

In [None]:
batch_X_file='batch-in.csv'
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

Last, before you perform a transform, configure your transformer with the input file, output location, and instance type.

In [None]:
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

xgb_transformer = xgb_model.transformer(instance_count=1,
                                       instance_type='ml.m4.xlarge',
                                       strategy='MultiRecord',
                                       assemble_with='Line',
                                       output_path=batch_output)

xgb_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')
xgb_transformer.wait()

After the transform completes, you can download the results from Amazon S3 and compare them with the input.

First, download the output from Amazon S3 and load it into a pandas DataFrame.


In [None]:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',',names=['class'])
target_predicted.head(5)

You can use a function to convert the probabilty into either a *0* or a *1*.

The first table output will be the *predicted values*, and the second table output is the *original test data*.

In [None]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['class'].apply(binary_convert)

print(target_predicted.head(10))
test.head(10)

**Note:** The *threshold* in the **binary_convert** function is set to *.65*.

**Challenge task:** Experiment with changing the value of the threshold. Does it impact the results?

**Note:** The initial model might not be good. You will generate some metrics in the next lab, before you tune the model in the final lab.

# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.