# In this notebook, we use unsupervised machine learning with anomaly detection to identify Fraudulent Medicare providers using data from CMS that has been preprocessed using Data Wrangler. 

## Setup

Import required libraries (install imblearn using pip if not present)

In [None]:
!pip install imblearn

In [None]:
import numpy as np 
import pandas as pd
import boto3
import os
import sagemaker
import seaborn as sns
import matplotlib.pyplot as plt
import io
import sklearn
from math import sqrt
from sagemaker import get_execution_role
from sagemaker import RandomCutForest
from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer
from sagemaker.amazon.amazon_estimator import get_image_uri
from sklearn.datasets import dump_svmlight_file  
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.datasets import dump_svmlight_file   
from collections import Counter

Enable the ability to see all columns and rows of data if the data size is big

In [None]:
pd.set_option('max_columns', None)
pd.set_option('max_rows', None)

In [None]:
session = sagemaker.Session()
bucket = session.default_bucket()
prefix = 'fraud-detect-demo'
role = get_execution_role()
s3_client = boto3.client("s3")

Let's start by reading in the entire preprocessed medicare data set prepared for anomaly detection. This dataset has a lot more data elements than the dataset prepared for classification

In [None]:
!gzip -d processed_data_anomaly_detection1.csv.gz
!gzip -d processed_data_anomaly_detection2.csv.gz

In [None]:
data1 = pd.read_csv('processed_data_anomaly_detection1.csv', delimiter=',')
data2 = pd.read_csv('processed_data_anomaly_detection2.csv', delimiter=',')

In [None]:
data = data1.append(data2)

In [None]:
data.head()

## Investigate and process the data

Check data for any nulls

In [None]:
data.isnull().values.any()

Remove column headers from data as SageMaker does not need headers for processing csv files

In [None]:
feature_columns = data.columns[1:]
label_column = data.columns[0]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

We will split our dataset into a train and test to evaluate the performance of our models. It's important to do so _before_ any techniques meant to alleviate the class imbalance are used. This ensures that we don't leak information from the test set into the train set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42)

## Training and Prediction - Unsupervised Learning (Anomaly Detection)

We will use Anomaly Detection, an unsupervised Learning, to determine fraud

In a fraud detection scenario, commonly we will have very few labeled examples, and it's possible that labeling fraud takes a very long time. We would like then to extract information from the unlabeled data we have at hand as well. _Anomaly detection_ is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. Random Cut Forest is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. We will train such a model on our training data and evaluate its performance on our test set.

In [None]:
# specify general training job information
rcf = RandomCutForest(role=get_execution_role(),
                      instance_count=1,
                      instance_type='ml.c4.xlarge',
                      data_location='s3://{}/{}/'.format(bucket, prefix),
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      num_samples_per_tree=1024,
                      num_trees=50)

In [None]:
rcf.fit(rcf.record_set(X_train, channel='train'))

### Host Random Cut Forest

Once we have a trained model we can deploy it and get some predictions for our test set. 

In [None]:
rcf_predictor = rcf.deploy(
    endpoint_name='random-cut-forest-endpoint',
    initial_instance_count=1,
    instance_type='ml.c4.xlarge',
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer()
)

If predictor was already deployed use the code below
#endpoint_name="random-cut-forest-endpoint"
rcf_predictor=sagemaker.predictor.Predictor(endpoint_name, sagemaker_session=session)

To update endpoint with changes in configurations use 
rcf_predictor.update_endpoint(model_name="model name")

smclient = boto3.client(service_name='sagemaker')
smclient.list_models()


### Full Test Random Cut Forest

With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions.

In [None]:
def predict_rcf(current_predictor, d, rows=500):
    split_array = np.array_split(d, int(d.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        array_preds = [s['score'] for s in current_predictor.predict(array)['scores']]
        predictions.append(array_preds)

    return np.concatenate([np.array(batch) for batch in predictions])

In [None]:
positives = X_train[y_train == 1]
positives_scores = predict_rcf(rcf_predictor, positives)

In [None]:
negatives = X_train[y_train == 0]
negatives_scores = predict_rcf(rcf_predictor, negatives)

In [None]:
sns.set(color_codes=True)

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [None]:
sns.distplot(positives_scores, label='fraud', bins=20, discrete=True)
sns.distplot(negatives_scores, label='not-fraud', bins=20, discrete=True)
plt.legend()

From the above, we can see that the unsupervised model already can achieve some separation between the classes, with higher anomaly scores (>1) being correlated to fraud. However, the technique is clearly not sufficient enough to identify all fraud cases accurately. It is meant more as the first step to identify outliers. For more accurate results, we need to use  additional techniques such as classification

## Clean up

In [None]:
# Uncomment to clean up endpoints
# rcf_predictor.delete_endpoint()


## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed from CMS 

https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service

