## AWS Public Sector Machine Learning Workshop

**Part 1: Predicting fraudulent Medicare providers using XGBoost**

---

Since 2003, the US federal government has made approximately $\$$1.7 trillion in improper payments, with an estimated $\$$206 billion made in FY 2020 alone. Improper payments are now anticipated to increase proportionally to new levels of federal spending, from the $\$$1 trillion infrastructure bill, to the anticipated $\$$3.5 trillion budget reconciliation plan

How can we go beyond basic heuristic rulesets to help agencies fight improper payments at scale? 

Using preprocessed data from the Centers for Medicare & Medicaid Services (CMS), we'll demonstrate how to train a classification model to predict fraudulent Medicare providers using the XGBoost algorithm.

**Let's get started!**

### 1. Setup
<a id=section_1_0></a>

#### 1.1 Prerequisites
<a id=section_1_1></a>

In [None]:
!pip install imblearn

#### 1.2 Import packages and modules
<a id=section_1_2></a>

In [None]:
import numpy as np 
import pandas as pd
import boto3
import os
import sagemaker
import seaborn as sns
import matplotlib.pyplot as plt
import io
import sklearn
from math import sqrt
from sagemaker import get_execution_role
from sagemaker import RandomCutForest
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer
from sagemaker.amazon.amazon_estimator import get_image_uri
from sklearn.datasets import dump_svmlight_file  
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.datasets import dump_svmlight_file   
from collections import Counter
from sagemaker.s3 import S3Downloader

%matplotlib inline

#### 1.3 Global config settings
<a id=section_1_3></a>

In [None]:
# Allow viewing of all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#### 1.4 Global config variables
<a id=section_1_4></a>

In [None]:
# Get IAM role and SakeMaker session
role = get_execution_role()
session = sagemaker.Session()

# Set service clients
s3_client = boto3.client('s3')
#sm_client = boto3.client('sagemaker')
#smr_client = boto3.client('sagemaker-runtime')

# S3 settings
bucket = session.default_bucket() # Update as needed
prefix = 'fraud-detect-demo' # Update as needed

### 2. Exploratory Data Analysis
<a id=section_2_0></a>

#### 2.1 Unzipping the prepocessed data file
<a id=section_2_1></a>

In [None]:
#!gzip -d processed_data_classification.csv.gz

#### 2.2 Read the preprocessed medicare data
<a id=section_2_2></a>

In [None]:
data = pd.read_csv('processed_data_classification_v2.csv', delimiter=',')

#### 2.3 View the dimensions of the dataset (#rows, #cols)
<a id=section_2_3></a>

In [None]:
data.shape

#### 2.4 Visually inspect the first few rows in the dataset
<a id=section_2_4></a>

In [None]:
data.head()

#### 2.5 Check data for any nulls
<a id=section_2_5></a>

In [None]:
data.isnull().values.any()

#### 2.6 Check for imbalance
<a id=section_2_6></a>

Review the target (fraudulent_provider) value counts to check for imbalance

In [None]:
data['fraudulent_provider'].value_counts()

We see that the majority of data is non-fraudulent. We will need to rebalance the data using sampling techniques that are designed specifically for imbalanced problems to improve the performance of the model.We use the Random Under Sampler and Over Sampling techniques from imblearn to do this (http://glemaitre.github.io/imbalanced-learn/api.html)

#### 2.7 Display correlation matix
<a id=section_2_7></a>

In [None]:
fig = plt.figure(figsize=(36, 36))
sns.heatmap(data.corr(), annot = True, fmt = '.2f', cmap='RdYlGn_r')

### 3. Preprocessing
<a id=section_3_0></a>

#### 3.1 Data Preparation
<a id=section_3_1></a>

Remove column headers from dataset as SageMaker does not need headers for processing csv files

In [None]:
# Removing column headers from CSV file
feature_columns = data.columns[1:]
label_column = data.columns[0]

# Setting the datatype to float32
features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

Split the dataset into a train and test to evaluate the performance of our models. Since the data is highly imbalanced, it is important to stratify across the data sets to ensure an even distribution.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, stratify=labels)

#### 3.2 Applying SMOTE
<a id=section_3_2></a>

The ratio in oversampling and the sampling strategy for undersampling are very important in improving the performance of the models. We have selected ratios based ased on research from https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0225-0 for this dataset. However, try to expirement with different ratios to see the impact

In [None]:
# Oversample the minority class with SMOTE to a 1:4 ratio
over = SMOTE(sampling_strategy=0.25)

# Undersample the majority class to achieve about a 1:1 ratio.
# The minority class will be the same amount (1 to 1) as the majority class
under = RandomUnderSampler(sampling_strategy=1)

# Add steps to parameter list
steps = [('o', over), ('u', under)]

# Create imblearn.pipeline and pass steps
pipeline = Pipeline(steps=steps)

# Fit and apply to the CMS dataset in a single transform
X_smote, y_smote = pipeline.fit_resample(X_train, y_train)

#### 3.3 Check for imbalance
<a id=section_3_3></a>

Review the target (fraudulent_provider) value counts to check for imbalance

In [None]:
print(sorted(Counter(y_smote).items()))

#### 3.4 Split SMOTE augmented dataset
<a id=section_3_4></a>

In [None]:
X_smote_train, X_smote_validation, y_smote_train, y_smote_validation = train_test_split(
    X_smote, y_smote, test_size=0.1, stratify=y_smote)

#### 3.5 Prepare training data for uploading
<a id=section_3_5></a>

We first copy the data to an in-memory buffer and then upload the data to S3 in libsvm format (XGBoost can take either libsvm or csv files as input)

In [None]:
# Use an in-memory buffer instead of a file
buf = io.BytesIO()

# Transform the dataset in libsvm/svmlight file format
sklearn.datasets.dump_svmlight_file(X_smote_train, y_smote_train, buf)

# Set pointer to the beginning of the file
buf.seek(0);

#### 3.6 Upload training data to S3
<a id=section_3_6></a>

In [None]:
# Set the directory path in S3
key = 'fraud-dataset'
subdir = 'base'

In [None]:
# Upload data to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', subdir, key)).upload_fileobj(buf)

# Display the S3 training data location
s3_train_data = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)
print('Uploaded training data location: {}'.format(s3_train_data))

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

#### 3.7 Prepare validation data for uploading
<a id=section_3_7></a>

In [None]:
# Use an in-memory buffer instead of a file
buf = io.BytesIO()

# Transform the dataset in libsvm/svmlight file format
sklearn.datasets.dump_svmlight_file(X_smote_validation, y_smote_validation, buf)

# Set pointer to the beginning of the file
buf.seek(0);

#### 3.8 Upload validation data to S3
<a id=section_3_8></a>

In [None]:
# Upload data to S3
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation', subdir, key)).upload_fileobj(buf)

# Display the S3 validation data location
s3_validation_data = 's3://{}/{}/validation/{}/{}'.format(bucket, prefix, subdir, key)
print('Uploaded validation data location: {}'.format(s3_validation_data))

### 4. Model Training
<a id=section_4_0></a>

#### 4.1 Get the container URI for running XGBoost
<a id=section_4_1></a>

We will use the Amazon XGBoost supervised learning algorithm for classifcation

In [None]:
# Retrieves the ECR URI for the pre-built SageMaker XGBoost Docker image
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")

SageMaker abstracts training via Estimators. We can pass the classifier and parameters along with hyperparameters to the estimator, and fit the estimator to the data in S3. An important parameter here is `scale_pos_weight` which scales the weights of the positive vs. negative class examples. This is crucial to do in an imbalanced dataset like the one we are using here, otherwise the majority class would dominate the learning.

The other hyperparameters seen here were based on the results of the Hyperparameter Optimization performed using SageMaker. We describe that technique in the next section of this notebook

#### 4.2 Rebalance positive and negative weights
<a id=section_4_2></a>

In [None]:
# Because the data set is so highly skewed, we set the scale position weight conservatively,
# as sqrt(num_nonfraud/num_fraud).
# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).
scale_pos_weight = sqrt(np.count_nonzero(y_train==0)/np.count_nonzero(y_train))

In [None]:
scale_pos_weight

#### 4.3 Train the model
<a id=section_4_3></a>

Estimators are a high level interface for SageMaker training for handling end-to-end Amazon SageMaker training and deployment tasks.

In [None]:
%%time

hyperparams = {
    "max_depth":7,
    "subsample":0.8,
    "num_round":145,
    "eta":0.82,
    "gamma":4,
    "min_child_weight":41.08,
    "silent":0,
    "objective":'binary:logistic',
    "eval_metric":'auc',
    "scale_pos_weight": scale_pos_weight
}

clf = sagemaker.estimator.Estimator(container,
                                    get_execution_role(),
                                    hyperparameters=hyperparams,
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',
                                    output_path=output_location,
                                    sagemaker_session=session)


clf.fit({'train': s3_train_data, 'validation': s3_validation_data })

### 5. Model Hosting
<a id=section_5_0></a>

#### 5.1 Create a real-time inference endpoint
<a id=section_5_1></a>

Now we deploy the estimator to an endpoint.

In [None]:
%%time
# Serialize data to a CSV-formatted string
csv_serializer = CSVSerializer()

# Create a real-time inference endpoint that hosts our trained model 
xgb_predictor = clf.deploy(initial_instance_count=1,
                       instance_type='ml.m4.xlarge', 
                       serializer=csv_serializer)

### 6. Model Evaluation
<a id=section_6_0></a>

Once we have trained the model we can use it to make predictions for the test set.

#### 6.1 Create a wrapper function for model testing
<a id=section_6_1></a>

In [None]:
# Because we have a large test set, we call predict on smaller batches
def predict(current_predictor, df, rows=500):
    """
    A wrapper function to invoke the Estimator's predict function using
    a for loop. 
    
    Parameters:
        current_predictor: The sagemaker.estimator.Estimator object
        df: a DataFrame object containing observations without the target feature
        rows: number of observations passed to the predict function per batch
      
    Returns:
        predictions: An array of predictions (of dtype float64)
    """
    
    # Split an array into multiple sub-arrays by dividing num of observations by rows parameter
    split_array = np.array_split(df, int(df.shape[0] / float(rows) + 1))
    
    # Initialize variable to store prediction results
    predictions = ''
    
    # Call the Estimator's predict function
    for array in split_array:
        predictions = ','.join([predictions, current_predictor.predict(array).decode('utf-8')])

    # Return
    return np.fromstring(predictions[1:], sep=',')

#### 6.2 Test the model
<a id=section_6_2></a>

In [None]:
%%time
# Test the model by invoking the real-time inference endpoint with observations from the test dataset
raw_preds = predict(xgb_predictor, X_test)

#### 6.3 Calculate balanced accuracy scores
<a id=section_6_3></a>

We will use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.

We will use [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score)

We can bring a balance between the metrics again by adjusting our classification threshold (threshold between labeling a point as fraud or not). We can try different thresholds to see if they affect the result of the classification.

In [None]:
# Calculate balanced accuracy scores for different threshold values
proposed_treshold = 0.0
proposed_score = 0.0

for thres in np.linspace(0.1, 0.99, num=10):
    smote_thres_preds = np.where(raw_preds > thres, 1, 0)
    score = balanced_accuracy_score(y_test, smote_thres_preds)
    print("Threshold: {:.1f}".format(thres))
    print("Balanced accuracy = {:.3f}".format(score))
    
    # Set the max score    
    if proposed_score <= score:
        proposed_score = score
        proposed_treshold = thres

In [None]:
# use the best thresholds from the above
y_preds = np.where(raw_preds >= proposed_treshold, 1, 0)
print("Balanced accuracy = {}".format(balanced_accuracy_score(y_test, y_preds)))

#### 6.4 Plot results in a confusion matrix
<a id=section_6_4></a>

Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance.

In [None]:
def plot_confusion_matrix(y_true, y_predicted):

    cm  = confusion_matrix(y_true, y_predicted)
    # Get the per-class normalized value for each cell
    cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    # We color each cell according to its normalized value, annotate with exact counts.
    ax = sns.heatmap(cm_norm, annot=cm, fmt="d")
    ax.set(xticklabels=["non-fraud", "fraud"], yticklabels=["non-fraud", "fraud"])
    ax.set_ylim([0,2])
    plt.title('Confusion Matrix')
    plt.ylabel('Real Classes')
    plt.xlabel('Predicted Classes')
    plt.show()

In [None]:
plot_confusion_matrix(y_test, y_preds)

#### 6.5 Display Classification Report
<a id=section_6_5></a>

In [None]:
print(classification_report(
    y_test, y_preds, target_names=['non-fraud', 'fraud']))

In [None]:
# Classification Report transposed 
df_classification_report_smote = pd.DataFrame(
    classification_report(y_test, 
                          y_preds, 
                          target_names=['non-fraud', 'fraud'], 
                          output_dict=True)).T

df_classification_report_smote['support'] = df_classification_report_smote.support.apply(int)

df_classification_report_smote

#### 6.6 Training without SMOTE (Optional)
<a id=section_6_6></a>

In this section we'll perform the same training, hosting and model evaluation steps on the original dataset, then compare the classification reports between the original dataset and the SMOTE augmented dataset

In [None]:
%%time

# Converts data to SVM format and returns a in-memory buffer
def convertToSVM(X, y):
    buf = io.BytesIO()
    sklearn.datasets.dump_svmlight_file(X, y, buf)
    buf.seek(0);
    
    # Return    
    return buf


# Uploads an in-memory buffer as a file to a specified S3 location
def uploadToS3(path, buf):
    
    # Set the directory path in S3
    key = 'fraud-dataset-original'
    subdir = 'base'
    
    # Upload to S3
    boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, path, subdir, key)).upload_fileobj(buf)
    s3_input_location = 's3://{}/{}/train/{}/{}'.format(bucket, prefix, subdir, key)
    s3_output_location = 's3://{}/{}/{}/output'.format(bucket, prefix, key)
    
    # Return
    return (s3_input_location, s3_output_location)


# Convert datasets to SVM and load into in-memory buffer
train_svm = convertToSVM(X_train, y_train)
test_svm = convertToSVM(X_test, y_test)

# Upload training and test datasets
s3_orig_train_data, s3_output_location = uploadToS3('train', train_svm)
s3_orig_test_data, s3_output_location = uploadToS3('test', test_svm)

# We'll reuse the Estimator from above for to train
clf.output_path = s3_output_location
clf.fit({'train': s3_orig_train_data, 'validation': s3_orig_test_data })

In [None]:
%%time
# Serialize data to a CSV-formatted string
csv_serializer = CSVSerializer()

# Create a real-time inference endpoint that hosts our trained model 
xgb_predictor_orig = clf.deploy(initial_instance_count=1,
                       instance_type='ml.m4.xlarge', 
                       serializer=csv_serializer)

In [None]:
%%time
# Test the model by invoking the real-time inference endpoint with observations from the test dataset
y_preds_orig = predict(xgb_predictor_orig, X_test)

In [None]:
plot_confusion_matrix(y_test, y_preds_orig.round())

In [None]:
print(classification_report(
    y_test, y_preds_orig.round(), target_names=['non-fraud', 'fraud']))

In [None]:
# Classification Report transposed 
df_classification_report_orig = pd.DataFrame(
    classification_report(y_test, 
                          y_preds_orig.round(), 
                          target_names=['non-fraud', 'fraud'], 
                          output_dict=True)).T

df_classification_report_orig['support'] = df_classification_report_orig.support.apply(int)

df_classification_report_orig

In [None]:
# Compare the original (no SMOTE) and SMOTE classification reports
df_classification_report_orig.compare(df_classification_report_smote, align_axis=0).rename(index={'self': 'No SMOTE', 'other': 'SMOTE'})

Given that the CMS dataset is **highly imbalanced** and the objective of the model is to correctly identify **fradulent** transactions (minority class), we will focus on the **recall** and **f-1 metrics** for evaluating model performance. By applying SMOTE to the minority class and RandomUnderSampler to the majority class, the recall score increased from .26 to .98. Additionally, the f1-score improved from .35 to .57. Interestingly, the precision score decreased from .54 to .40 for the minority class, but increased from .93 to .99 for the majority class.

## Clean up

In [None]:
# Uncomment to clean up endpoints
# xgb_predictor.delete_endpoint()
# xgb_predictor_orig.delete_endpoint()

## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS 

https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service

