## Mammography Severity - Binary Classification Model using XGBoost Framework

### Introduction

This notebook walks through using the mammography severity dataset from UCI to build a binary classification model using the XGBoost framework. SageMaker Pipelines is leveraged to orchestrate training of the model and also to generate inferences on a scheduled basis. 

### Step 1: Mammography Severity Dataset

This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient's age. It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes together with the ground truth (the severity field) for 516 benign and 445 malignant masses that have been identified on full field digital mammograms collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006.

Reference: https://archive.ics.uci.edu/ml/datasets/mammographic+mass

%%html
<style>
table {float:left}
</style>

| Column| Description |
| --- | --- |
| BI-RADS assessment | 1 to 5 (ordinal) |
| Age | patient's age in years (integer) |
| Shape | mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal) |
| Margin | mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal) |
| Density | mass density high=1 iso=2 low=3 fat-containing=4 (ordinal) |
| Severity | benign=0 or malignant=1 (binominal) |

### 2. Import Packages and Constants

In [2]:
import boto3
import sagemaker
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [3]:
#Replace the values based on the resoures created
default_bucket = "<s3-bucket-in-dev-account>"
model_artifacts_bucket = "<s3-bucket-in-central-model-registry-account>"
region = "us-east-1"
model_name = "mammography-severity-model"
role = sagemaker.get_execution_role()
lambda_role = "arn:aws:iam::<dev-account-id>:role/lambda-sagemaker-role"
kms_key = "arn:aws:kms:us-east-1:<dev-account-id>:key/<kms-key-id-in-dev-account>"
model_package_group_name="arn:aws:sagemaker:us-east-1:<central-model-registry-account-id>:model-package-group/mammo-severity-model-package"

In [4]:
feature_columns_names = [
    'BIRADS',
    'Age',
    'Shape',
    'Margin',
    'Density',
]
feature_columns_dtype = {
    'BIRADS': np.float64,
    'Age': np.float64,
    'Shape': np.float64,
    'Margin': np.float64,
    'Density': np.float64,
}

### 3. Generate raw, batch and batch with outliers datasets

#### 3.1 Raw Dataset

In [5]:
mammographic_data = pd.read_csv("data/mammographic_masses.data",header=None)

In [6]:
# split data into batch and raw datasets
batch_df =mammographic_data.sample(frac=0.05,random_state=200)
raw_df =mammographic_data.drop(batch_df.index)

In [7]:
# Split the raw datasets to two parts, one of which will be used to train
#the model initially and then other dataset will be leveraged when 
#retraining the model
train_dataset_part2 =raw_df.sample(frac=0.1,random_state=200)
train_dataset_part1 =raw_df.drop(train_dataset_part2.index)

In [8]:
# save the train datasets 
train_dataset_part1.to_csv("data/mammo-train-dataset-part1.csv",index=False)
train_dataset_part2.to_csv("data/mammo-train-dataset-part2.csv",index=False)

#### 3.2 Batch Dataset

In [9]:
# remove label column from the batch dataset which will be used to generate inferences
batch_df.drop(5,axis=1,inplace=True)

In [10]:
# create a copy of the batch dataset 
batch_modified_df = batch_df

In [11]:
def preprocess_batch_data(feature_columns_names,feature_columns_dtype,batch_df):
    batch_df.replace("?", "NaN", inplace = True)
    batch_df.columns = feature_columns_names
    batch_df = batch_df.astype(feature_columns_dtype)
    numeric_transformer = Pipeline( 
        steps=[("imputer", SimpleImputer(strategy="median"))]
        )
    numeric_features = list(feature_columns_names)
    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features)
        ]
    ) 
    batch_df = preprocess.fit_transform(batch_df)
    return batch_df

In [12]:
# save the batch dataset file
batch_df = preprocess_batch_data(feature_columns_names,feature_columns_dtype,batch_df)
pd.DataFrame(batch_df).to_csv("data/mammo-batch-dataset.csv", header=False, index=False)

#### 3.3 Batch Dataset with Outliers

In [13]:
# modify batch dataset to introduce missing values
batch_modified_df.replace("?", "NaN", inplace = True)
batch_modified_df.columns = feature_columns_names
batch_modified_df = batch_modified_df.astype(feature_columns_dtype)
# save the batch dataset with outliers file
batch_modified_df.to_csv("data/mammo-batch-dataset-outliers.csv",index=False)

### 4. Copy Train Dataset Part 1, Batch Dataset with and without Outliers  into S3 Bucket

In [14]:
s3_client = boto3.resource('s3')
s3_client.Bucket(default_bucket).upload_file("data/mammo-train-dataset-part1.csv","mammography-severity-model/data/train-dataset/mammo-train-dataset-part1.csv")
s3_client.Bucket(default_bucket).upload_file("data/mammo-batch-dataset.csv","mammography-severity-model/data/batch-dataset/mammo-batch-dataset.csv")
s3_client.Bucket(default_bucket).upload_file("data/mammo-batch-dataset-outliers.csv","mammography-severity-model/data/batch-dataset/mammo-batch-dataset-outliers.csv")



### 5. Generate Script to PreProcess Raw Dataset

In [15]:
%%writefile pipelines/train/scripts/raw_preprocess.py

import argparse
import os
import glob
import requests
import tempfile

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Since we get a headerless CSV file, we specify the column names here.
feature_columns_names = [
    'BIRADS',
    'Age',
    'Shape',
    'Margin',
    'Density',
]

label_column = 'Severity'

feature_columns_dtype = {
    'BIRADS': np.float64,
    'Age': np.float64,
    'Shape': np.float64,
    'Margin': np.float64,
    'Density': np.float64,
}

#benign=0 or malignant=1 
label_column_dtype = {'Severity': bool}


def merge_two_dicts(x, y):
    z = x.copy()
    z.update(y)
    return z

if __name__ == "__main__":
    
    base_dir = "/opt/ml/processing" 
    data_files = glob.glob(f"{base_dir}/input/*.csv")
    df = pd.concat((pd.read_csv(f) for f in data_files))
    df.replace("?", "NaN", inplace = True)
    df.columns = feature_columns_names + [label_column]
    feature_dtypes = merge_two_dicts(feature_columns_dtype, label_column_dtype)
    df = df.astype(feature_dtypes)
    
    numeric_features = list(feature_columns_names)
    numeric_transformer = Pipeline( 
        steps=[("imputer", SimpleImputer(strategy="median"))]
    )

    # This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each 
    # transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several 
    # feature extraction mechanisms or transformations into a single transformer.
    preprocess = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features)
        ]
    )

    y = df.pop("Severity") 

    X_pre = preprocess.fit_transform(df)
    y_pre = y.to_numpy().reshape(len(y), 1)

    X = np.concatenate((y_pre, X_pre), axis=1)
    np.random.shuffle(X)
    train, validation, test = np.split(X, [int(0.7 * len(X)), int(0.85 * len(X))])

    pd.DataFrame(train).to_csv(f"{base_dir}/train/train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(
        f"{base_dir}/validation/validation.csv", header=False, index=False
    )
    pd.DataFrame(test).to_csv(f"{base_dir}/test/test.csv", header=False, index=False)
    pd.DataFrame(X_pre).to_csv(f"{base_dir}/baseline/baseline_dataset.csv", header=False, index=False)

Overwriting pipelines/train/scripts/raw_preprocess.py


### 6. Generate Script to Evaluate Model

In [25]:
%%writefile pipelines/train/scripts/evaluate_model.py

import json
import pathlib
import tarfile

import joblib
import numpy as np
import pandas as pd
import xgboost

from sklearn.metrics import roc_auc_score

if __name__ == "__main__":
    model_path = f"/opt/ml/processing/model/model.tar.gz"
    with tarfile.open(model_path) as tar:
        tar.extraction_filter = getattr(tarfile, 'data_filter',(lambda member, path: member))
        tar.extractall()

    model = joblib.load("xgboost-model")

    test_path = "/opt/ml/processing/test/test.csv"
    df = pd.read_csv(test_path, header=None)
    
    y_test = df.iloc[:, 0].to_numpy()
    df.drop(df.columns[0], axis=1, inplace=True)
    print ("y_test is", y_test) 
    print("y_test length is", len(y_test)) 

    X_test = xgboost.DMatrix(df.values)

    predictions = model.predict(X_test)
    
    print("predictions before changing to 1 and 0", predictions) 
    for i in range(len(predictions)):
        if predictions[i] >= 0.5:
            predictions[i] = 1
        else:
            predictions[i] = 0
    
    print("predictions after changing to 1 and 0", predictions) 
    print("predictions length is", len(predictions)) 
    
    prediction_diff=[]
    for j in range(len(predictions)):
        prediction_diff.append(y_test[j] - predictions[j])
        
    count = 0
    for k in range(len(prediction_diff)):
        if prediction_diff[k] == 0: 
            count = count + 1
            
    print("count is", count) 
        
    auc = roc_auc_score(y_test, predictions)
    
    report_dict= {
        "classification_metrics": {
            "auc": {"value": auc},
        },
    }
    
    print ("auc is", auc) 
    
    output_dir = "/opt/ml/processing/evaluation"
    pathlib.Path(output_dir).mkdir(parents=True, exist_ok=True)

    evaluation_path = f"{output_dir}/evaluation.json"
    with open(evaluation_path, "w") as f:
        f.write(json.dumps(report_dict))

Overwriting pipelines/train/scripts/evaluate_model.py


### 7. Copy Scripts to S3 Bucket

In [27]:
s3_client.Bucket(default_bucket).upload_file("pipelines/train/scripts/raw_preprocess.py","mammography-severity-model/scripts/raw_preprocess.py")
s3_client.Bucket(default_bucket).upload_file("pipelines/train/scripts/evaluate_model.py","mammography-severity-model/scripts/evaluate_model.py")

### 8. Get Train Model Pipeline Instance

In [35]:
from pipelines.train.train_pipeline import get_pipeline

train_pipeline = get_pipeline(
                        region=region,
                        role=role,
                        default_bucket=default_bucket,
                        model_artifacts_bucket=model_artifacts_bucket,
                        model_name = model_name,
                        kms_key = kms_key,
                        model_package_group_name=model_package_group_name,
                        pipeline_name="mammo-severity-train-pipeline",
                        base_job_prefix="mammo-severity",
                    )

In [36]:
train_pipeline.definition()

### 9. Submit the train pipeline and start execution

In [37]:
train_pipeline.upsert(role_arn=role)

In [34]:
train_execution = train_pipeline.start()

### 10. Generate script to pull the latest approved model

In [22]:
%%writefile pipelines/inference/scripts/lambda_helper.py

"""
This Lambda function gets information of the latest approved model from the central model registry
"""

import boto3
import json

def lambda_handler(event, context):
    
    sm_client = boto3.client('sagemaker', region_name=event['region'])

    # get a list of approved model packages from the model package group specified 
    approved_model_packages = sm_client.list_model_packages(
          ModelApprovalStatus='Approved',
          ModelPackageGroupName=event["model_package_group_name"],
          SortBy='CreationTime',
          SortOrder='Descending'
      )

    # find the latest approved model package
    try:
        latest_approved_model_package_arn = approved_model_packages['ModelPackageSummaryList'][0]['ModelPackageArn']
    except Exception as e:
        print("Failed to retrieve an approved model package:", e)

    # retrieve required information about the model
    latest_approved_model_package_descr =  sm_client.describe_model_package(ModelPackageName = latest_approved_model_package_arn)

    # model artifact uri (tar.gz file)
    model_artifact_uri = latest_approved_model_package_descr['InferenceSpecification']['Containers'][0]['ModelDataUrl']
    # sagemaker image in ecr
    image_uri = latest_approved_model_package_descr['InferenceSpecification']['Containers'][0]['Image']

    # get baseline metrics
    s3_baseline_uri_statistics = latest_approved_model_package_descr["ModelMetrics"]["ModelDataQuality"]["Statistics"]["S3Uri"]
    s3_baseline_uri_constraints = latest_approved_model_package_descr["ModelMetrics"]["ModelDataQuality"]["Constraints"]["S3Uri"]

    return {
        "model_artifact_uri": model_artifact_uri,
        "image_uri": image_uri,
        "s3_baseline_uri_statistics": s3_baseline_uri_statistics,
        "s3_baseline_uri_constraints": s3_baseline_uri_constraints
    }

Overwriting pipelines/inference/scripts/lambda_helper.py


#### Upload script to S3 Bucket

In [23]:
s3_client = boto3.resource('s3')
s3_client.Bucket(default_bucket).upload_file("pipelines/inference/scripts/lambda_helper.py","mammography-severity-model/scripts/lambda_helper.py")

### 11. Get Inference Pipeline Instance Using Batch Dataset

In [38]:
from pipelines.inference.inference_pipeline import get_pipeline

inference_pipeline = get_pipeline(
                        region=region,
                        role=role,
                        lambda_role = lambda_role,
                        default_bucket=default_bucket,
                        kms_key=kms_key,
                        model_name = model_name,
                        model_package_group_name=model_package_group_name,
                        pipeline_name="mammo-severity-inference-pipeline",
                        batch_dataset_filename = "mammo-batch-dataset"
                    )

### 12. Submit the inference pipeline and start execution

In [39]:
inference_pipeline.upsert(role_arn=role)

In [26]:
inference_execution = inference_pipeline.start()

### 13. Get Inference Pipeline Instance Using Batch Dataset with Outliers

In [40]:
from pipelines.inference.inference_pipeline import get_pipeline

inference_pipeline = get_pipeline(
                        region=region,
                        role=role,
                        lambda_role = lambda_role,
                        default_bucket=default_bucket,
                        model_name = model_name,
                        model_package_group_name=model_package_group_name,
                        pipeline_name="mammo-severity-inference-pipeline",
                        batch_dataset_filename = "mammo-batch-dataset-outliers"
                    )

### 14. Submit the inference pipeline and start execution

In [41]:
inference_pipeline.upsert(role_arn=role)

In [41]:
inference_execution = inference_pipeline.start()