# Breast Cancer Prediction (SageMaker V3)
_**Predict Breast Cancer using SageMaker's Linear-Learner with features derived from images of Breast Mass**_

---

This notebook has been migrated to use SageMaker Python SDK V3 interfaces.

---

## Contents

1. [Background](#Background)
1. [Setup](#Setup)
1. [Data](#Data)
1. [Train](#Train)
1. [Host](#Host)
1. [Predict](#Predict)
1. [Extensions](#Extensions)

---

## Background
This notebook illustrates how one can use SageMaker's algorithms for solving applications which require `linear models` for prediction. For this illustration, we have taken an example for breast cancer prediction using UCI'S breast cancer diagnostic data set available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor. The data set will be used to illustrate

* Basic setup for using SageMaker V3.
* converting datasets to protobuf format used by the Amazon SageMaker algorithms and uploading to S3. 
* Training SageMaker's linear learner on the data set using ModelTrainer.
* Hosting the trained model using V3 resources.
* Scoring using the trained model.



---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give learning and hosting access to your data.
* The S3 bucket that you want to use for training and storing model objects.

In [None]:
import os
import boto3
import re

# V3 imports
from sagemaker.core.helper.session_helper import Session, get_execution_role
from sagemaker.core import image_uris
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import InputData, Compute
from sagemaker.core.resources import Model, EndpointConfig, Endpoint

# Initialize V3 session
sagemaker_session = Session()
role = get_execution_role()
region = sagemaker_session.boto_region_name

# S3 bucket for saving code and model artifacts.
bucket = sagemaker_session.default_bucket()

prefix = 'sagemaker/DEMO-breast-cancer-prediction-v3' # place to upload training files within the bucket

Now we'll import the Python libraries we'll need.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import json

---
## Data

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
        https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)

# specify columns extracted from wbdc.names
data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst"] 

# save the data
data.to_csv("data.csv", sep=',', index=False)

# print the shape of the data file
print(data.shape)

# show the top few rows
display(data.head())

# describe the data object
display(data.describe())

# we will also summarize the categorical field diganosis 
display(data.diagnosis.value_counts())


#### Key observations:
* Data has 569 observations and 32 columns.
* First field is 'id'.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features available for prediction.

## Create Features and Labels
#### Split the data into 80% training, 10% validation and 10% testing.

In [None]:
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
test_list = rand_split >= 0.9

data_train = data[train_list]
data_val = data[val_list]
data_test = data[test_list]

train_y = ((data_train.iloc[:,1] == 'M') +0).to_numpy();
train_X = data_train.iloc[:,2:].to_numpy();

val_y = ((data_val.iloc[:,1] == 'M') +0).to_numpy();
val_X = data_val.iloc[:,2:].to_numpy();

test_y = ((data_test.iloc[:,1] == 'M') +0).to_numpy();
test_X = data_test.iloc[:,2:].to_numpy();

Now, we'll convert the datasets to CSV format and upload to S3. Linear Learner expects the label in the first column.

In [None]:
# Training data - label in first column
train_df = pd.DataFrame(train_X)
train_df.insert(0, 'label', train_y)
train_file = 'linear_train.csv'
train_df.to_csv(train_file, header=False, index=False)

# Upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'train', train_file)
).upload_file(train_file)

train_s3_uri = f"s3://{bucket}/{prefix}/train/{train_file}"
print(f"Training data uploaded to: {train_s3_uri}")

Next we'll convert and upload the validation dataset.

In [None]:
# Validation data - label in first column
val_df = pd.DataFrame(val_X)
val_df.insert(0, 'label', val_y)
validation_file = 'linear_validation.csv'
val_df.to_csv(validation_file, header=False, index=False)

# Upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object(
    os.path.join(prefix, 'validation', validation_file)
).upload_file(validation_file)

validation_s3_uri = f"s3://{bucket}/{prefix}/validation/{validation_file}"
print(f"Validation data uploaded to: {validation_s3_uri}")

---
## Train

Now we can begin to specify our linear model using SageMaker V3's ModelTrainer.  Amazon SageMaker's Linear Learner actually fits many models in parallel, each with slightly different hyperparameters, and then returns the one with the best fit.  This functionality is automatically enabled.  We can influence this using parameters like:

- `num_models` to increase to total number of models run.  The specified parameters will always be one of those models, but the algorithm also chooses models with nearby parameter values in order to find a solution nearby that may be more optimal.  In this case, we're going to use the max of 32.
- `loss` which controls how we penalize mistakes in our model estimates.  For this case, let's use absolute loss as we haven't spent much time cleaning the data, and absolute loss will be less sensitive to outliers.
- `wd` or `l1` which control regularization.  Regularization can prevent model overfitting by preventing our estimates from becoming too finely tuned to the training data, which can actually hurt generalizability.  In this case, we'll leave these parameters as their default "auto" though.

### Get container image for linear-learner using

In [None]:
# V3: Use image_uris.retrieve instead of get_image_uri
container = image_uris.retrieve(
    framework='linear-learner',
    region=region
)
print(f"Using container: {container}")

### Create and train model using ModelTrainer

In [None]:
%%time

# V3: Use ModelTrainer instead of boto3 create_training_job
trainer = ModelTrainer(
    training_image=container,
    role=role,
    compute=Compute(
        instance_count=1,
        instance_type="ml.c4.2xlarge",
        volume_size_in_gb=10
    ),
    hyperparameters={
        "feature_dim": "30",
        "mini_batch_size": "100",
        "predictor_type": "regressor",
        "epochs": "10",
        "num_models": "32",
        "loss": "absolute_loss"
    },
    sagemaker_session=sagemaker_session
)

# Train the model
training_job = trainer.train(
    input_data_config=[
        InputData(
            channel_name="train",
            data_source=train_s3_uri,
            content_type="text/csv"
        ),
        InputData(
            channel_name="validation",
            data_source=validation_s3_uri,
            content_type="text/csv"
        )
    ],
    wait=True,
    logs=True
)

# Get the training job from the trainer
training_job = trainer._latest_training_job
print(f"Training job completed: {training_job.training_job_name}")
print(f"Model artifacts: {training_job.model_artifacts.s3_model_artifacts}")

---
## Host

Now that we've trained the linear algorithm on our data, let's setup a model which can later be hosted using V3 resources.

In [None]:
# V3: Create Model using resources
model_name = f"breast-cancer-model-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"

model = Model.create(
    model_name=model_name,
    execution_role_arn=role,
    primary_container={
        'image': container,
        'model_data_url': training_job.model_artifacts.s3_model_artifacts
    },
    session=sagemaker_session.boto_session
)

print(f"Model created: {model.model_name}")

### Create Endpoint Configuration

In [None]:
# V3: Create EndpointConfig
endpoint_config_name = f"breast-cancer-config-{time.strftime('%Y-%m-%d-%H-%M-%S', time.gmtime())}"

endpoint_config = EndpointConfig.create(
    endpoint_config_name=endpoint_config_name,
    production_variants=[{
        'variant_name': 'AllTraffic',
        'model_name': model.model_name,
        'instance_type': 'ml.m4.xlarge',
        'initial_instance_count': 1
    }],
    session=sagemaker_session.boto_session
)

print(f"Endpoint config created: {endpoint_config.endpoint_config_name}")

### Create and Deploy Endpoint

In [None]:
%%time

# V3: Create Endpoint
endpoint_name = f"breast-cancer-endpoint-{time.strftime('%Y%m%d%H%M', time.gmtime())}"

endpoint = Endpoint.create(
    endpoint_name=endpoint_name,
    endpoint_config_name=endpoint_config.endpoint_config_name,
    session=sagemaker_session.boto_session
)

print(f"Endpoint created: {endpoint.endpoint_name}")
print("Waiting for endpoint to be in service...")

# Wait for endpoint to be ready
endpoint.wait_for_status('InService')

print(f"Endpoint is ready: {endpoint.endpoint_status}")

## Predict
### Predict on Test Data

Now that we have our hosted endpoint, we can generate statistical predictions from it.  Let's predict on our test dataset to understand how accurate our model is.

There are many metrics to measure classification accuracy.  Common examples include include:
- Precision
- Recall
- F1 measure
- Area under the ROC curve - AUC
- Total Classification Accuracy 
- Mean Absolute Error

For our example, we'll keep things simple and use total classification accuracy as our metric of choice. We will also evaluate  Mean Absolute  Error (MAE) as the linear-learner has been optimized using this metric, not necessarily because it is a relevant metric from an application point of view. We'll compare the performance of the linear-learner against a naive benchmark prediction which uses majority class observed in the training data set for prediction on the test data.




### Function to convert an array to a csv

In [None]:
import io
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=',', fmt='%g')
    return csv.getvalue().decode().rstrip()

Next, we'll invoke the endpoint to get predictions using V3.

In [None]:
# V3: Use endpoint.invoke instead of runtime.sagemaker
payload = np2csv(test_X)

response = endpoint.invoke(
    body=payload,
    content_type='text/csv'
)

result = json.loads(response.body.read().decode())
test_pred = np.array([r['score'] for r in result['predictions']])

Let's compare linear learner based mean absolute prediction errors from a baseline prediction which uses majority class to predict every instance.

In [None]:
test_mae_linear = np.mean(np.abs(test_y - test_pred))
test_mae_baseline = np.mean(np.abs(test_y - np.median(train_y))) ## training median as baseline predictor

print("Test MAE Baseline :", round(test_mae_baseline, 3))
print("Test MAE Linear:", round(test_mae_linear,3))


Let's compare predictive accuracy using a classification threshold of 0.5 for the predicted and compare against the majority class prediction from training data set

In [None]:
test_pred_class = (test_pred > 0.5)+0;
test_pred_baseline = np.repeat(np.median(train_y), len(test_y))

prediction_accuracy = np.mean((test_y == test_pred_class))*100
baseline_accuracy = np.mean((test_y == test_pred_baseline))*100

print("Prediction Accuracy:", round(prediction_accuracy,1), "%")
print("Baseline Accuracy:", round(baseline_accuracy,1), "%")

### Cleanup
Run the cell below to delete endpoint once you are done.

In [None]:
# V3: Use endpoint.delete()
endpoint.delete()
print(f"Endpoint {endpoint_name} deleted")

---
## Extensions

- Our linear model does a good job of predicting breast cancer and has an overall accuracy of close to 92%. We can re-run the model with different values of the hyper-parameters, loss functions etc and see if we get improved prediction. Re-running the model with further tweaks to these hyperparameters may provide more accurate out-of-sample predictions.
- We also did not do much feature engineering. We can create additional features by considering cross-product/intreaction of multiple features, squaring or raising higher powers of the features to induce non-linear effects, etc. If we expand the features using non-linear terms and interactions, we can then tweak the regulaization parameter to optimize the expanded model and hence generate improved forecasts.
- As a further extension, we can use many of non-linear models available through SageMaker such as XGBoost, MXNet etc.
