# Credit Card Fraud Detector

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook. 

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-2/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

---

In this solution we build the core of a credit card fraud detection system using SageMaker. We start by training an unsupervised anomaly detection algorithm [Random Cut Forecast](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html), then proceed to train two [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) models for supervised training. To deal with the highly imbalanced data common in fraud detection, our first model uses XGBoost's weighting schema, and the second uses a re-sampling technique, SMOTE for oversampling the rare fraudulent examples. Lastly, we train an optimal XGBoost model with Hyper-parameter Optimization (HPO) to further improve the model performance. 

You can select Run->Run All from the menu to run all cells in Studio (or Cell->Run All in a SageMaker Notebook Instance).

**Note**: When running this notebook on SageMaker Classic notebooko, you should make sure the 'conda_python3' image/kernel is used.

The credit card fraud detector solution contains five stages.

* Stage I: Investigate and process the data (time duration: ~0.5 min)
* Stage II: Train an unsupervised RandomCutForest model (time duration: ~6 min)
* Stage III: Train a XGBoost model with the built-in weighting schema (time duration: ~7.5 min)
* Stage IV: Train a XGBoost model with the over-sampling technique SMOTE (time duration: ~9 min)
* Stage V: Train an optimal XGBoost model with hyper-parameter optimization (HPO) (time duration: ~15 min)
* Stage VI: Evaluate and compare all model performance on the same test data (time duration: ~0.5 min)

## Stage I: Investigate and process the data

### Set up environment

In [None]:
import sys

sys.path.insert(0, "./src/")

### Install dependency

In [None]:
!pip install -U seaborn aws_requests_auth imblearn

Let's start by reading in the credit card fraud data set.

In [None]:
import boto3
from package import config
import uuid

instance_type_train = "ml.m5.4xlarge"
instance_type_inference = "ml.g4dn.2xlarge"

s3 = boto3.resource("s3", region_name=config.AWS_REGION)
s3.Object(
    f"{config.SOLUTIONS_S3_BUCKET}-{config.AWS_REGION}",
    f"{config.SOLUTION_NAME}/data/creditcardfraud.zip",
).download_file("creditcardfraud.zip")
unique_hash = str(uuid.uuid4())[:6]

In [None]:
from zipfile import ZipFile

with ZipFile("creditcardfraud.zip", "r") as zf:
    zf.extractall()

In [None]:
import numpy as np
import pandas as pd

data = pd.read_csv("creditcard.csv", delimiter=",")

In [None]:
data

Let's take a peek at our data (we only show a subset of the columns in the table):

In [None]:
print(data.columns)
data[["Time", "V1", "V2", "V27", "V28", "Amount", "Class"]].describe()

The dataset contains
only numerical features, because the original features have been transformed using PCA, to protect user privacy. As a result,
the dataset contains 28 PCA components, V1-V28, and two features that haven't been transformed, _Amount_ and _Time_.
_Amount_ refers to the transaction amount, and _Time_ is the seconds elapsed between any transaction in the data
and the first transaction.

The class column corresponds to whether or not a transaction is fraudulent. We see that the majority of data is non-fraudulent with only $492$ ($0.173\%$) of the data corresponding to fraudulent examples, out of the total of 284,807 examples in the data. This corresponds to the extreme class imbalance scenario.

In [None]:
nonfrauds, frauds = data.groupby("Class").size()
print("Number of frauds: ", frauds)
print("Number of non-frauds: ", nonfrauds)
print("Percentage of fradulent data:", 100.0 * frauds / (frauds + nonfrauds))

We already know that the columns $V_i$ have been normalized to have $0$ mean and unit standard deviation as the result of a PCA.

In [None]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype("float32")
labels = (data[label_column].values).astype("float32")

Next, we prepare our data for loading and training.

We split our dataset into a train and test to evaluate the performance of our models. It's important to do so _before_ any techniques meant to alleviate the class imbalance are used. This ensures that we don't leak information from the test set into the train set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42
)

In [None]:
np.mean(y_test), np.mean(y_train), X_train.shape, X_test.shape

> Note: If you are bringing your own data to this solution and they include categorical data, that have strings as values, you'd need to one-hot encode these values first using for example sklearn's [OneHotEncoder](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features), as XGBoost only supports numerical data.

## Stage II: Train an unsupervised RandomCutForest model

In a fraud detection scenario, commonly we have very few labeled examples, and it's possible that labeling fraudulent examples takes a very long time. We would like then to extract information from the unlabeled data we have at hand as well. _Anomaly detection_ is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. [Random Cut Forest](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html) is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. We train such a model on our training data and evaluate its performance on our test set.

In [None]:
import os
import sagemaker
from package import config

session = sagemaker.Session()
bucket = config.MODEL_DATA_S3_BUCKET
prefix = "fraud-classifier"

In [None]:
from sagemaker import RandomCutForest

training_job_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-rcf"
print(
    f"You can go to SageMaker -> Training -> Training jobs -> a job name started with {training_job_name} to monitor training status and details."
)

# specify general training job information
rcf = RandomCutForest(
    role=config.SAGEMAKER_IAM_ROLE,
    instance_count=1,
    instance_type=instance_type_train,
    data_location="s3://{}/{}/".format(bucket, prefix),
    output_path="s3://{}/{}/output".format(bucket, prefix),
    base_job_name=training_job_name,
    num_samples_per_tree=512,
    num_trees=50,
)

Now we are ready to fit the model. The below cell should take around 5 minutes to complete.

In [None]:
rcf.fit(rcf.record_set(X_train))

### Host Random Cut Forest

Once we have a trained model we can deploy it and get some predictions for our test set. SageMaker spins up an instance for us and deploy the model, the whole process should take around 10 minutes, you can see progress being made with each `-` and an exclamation point when the process is finished.

In [None]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer


endpoint_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-rcf"
print(
    f"You can go to SageMaker -> Inference -> Endpoints --> an endpoint with name {endpoint_name} to monitor the deployment status."
)

rcf_predictor = rcf.deploy(
    model_name=f"{config.SOLUTION_PREFIX}-{unique_hash}-rcf",
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type_inference,  # use a smaller instance for endpoint deployment
    serializer=CSVSerializer(),
    deserializer=JSONDeserializer(),
)

### Test Random Cut Forest

With the model deployed, let's see how it performs in terms of separating fraudulent from legitimate transactions.

In [None]:
def predict_rcf(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = []
    for array in split_array:
        array_preds = [s["score"] for s in current_predictor.predict(array)["scores"]]
        predictions.append(array_preds)

    return np.concatenate([np.array(batch) for batch in predictions])

With each data example, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal" (non-fraudulent in our case). High values indicate the presence of an anomaly in the data (fraudulent). The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

We get the predictions of anomaly scores for the test examples by querying the deployed endpoint.

In [None]:
scores = predict_rcf(rcf_predictor, X_test)

First, we examine the anomaly scores for positive (fraudulent) and negative (non-fraudulent) examples separately. The expectation is that positive (fraudulent) example has relatively high score, and negative (non-fraudulent) example has relatively low score.

In [None]:
positive_samples_scores = scores[y_test == 1]
negative_samples_scores = scores[y_test == 0]

We plot histograms for scores of positive (left histogram) and negative examples (right histogram). Because the number of positive and negative examples differs significantly (which you can tell from the y axis of each histogram), we plot them separately. 
From the histograms, we can see following patterns:

1. Almost half of positive examples (left histogram) have scores more than 0.9, while most of the negative examples (right histogram) have anomaly scores less than 0.85.
2. Because of not using label information, the unsupervised learning method RCF has limitations to identify fraudulent and non-fraudulent examples accurately. This issue can be resolved by collecting label information and using a supervised learning method, as demonstrated in the following stages.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)

plt.rcParams["figure.figsize"] = [7.00, 3.50]
plt.rcParams["figure.autolayout"] = True
f, axes = plt.subplots(1, 2)
plot1 = sns.histplot(positive_samples_scores, label="fraud", bins=20, color="red", ax=axes[0])
plot2 = sns.histplot(negative_samples_scores, label="not-fraud", bins=20, color="blue", ax=axes[1])
axes[0].set_xlim(0.5, 1.5)
axes[0].set_xlabel("Anomaly Score")
axes[1].set_xlim(0.5, 1.5)
axes[1].set_xlabel("Anomaly Score")
axes[0].legend()
axes[1].legend()
plt.show()

Next, let's assume in a real-world scenario where we want to classify each test example into either positive (fraudulent) or negative (non-fraudulent) example based on its anomaly score. We plot the histogram of scores for all the test examples as below. Then we choose a cutoff score `1.0` (based on the pattern shown in the histogram) for classification. Specifically, if the anomaly score is less than or equal to `1.0`, the example is classified as negative (non-fraudulent) example. Otherwise, the example is classified as positive (fraudulent) example. Next, we compare the classification result with the ground truth labels and compute the evaluation metrics. The evaluation metrics are [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html), [Cohen's Kappa score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html), [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), and [ROC_AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). For all of the metrics, a large value indicates a better predictive performance.

In [None]:
n, bins, patches = plt.hist(scores, 50, density=False, facecolor="g", alpha=0.75)

plt.xlabel("Anomaly Score")
plt.ylabel("Count")
plt.title("Histogram of Scores for Test Examples")
plt.xlim(0.8, 1.4)
plt.grid(True)
plt.axvline(x=1.0, color="r")
plt.show()

In [None]:
y_preds_rcf = np.where(scores > 1.0, 1, 0)

Note. Because there is no estimated probability for positive and negative classes from the RCF model on each example, we cannot compute the [ROC_AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html) score. This metric is computed in following stages using supervised learning methods.

In [None]:
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score, f1_score, roc_auc_score

result_rcf = [
    balanced_accuracy_score(y_test, y_preds_rcf),
    cohen_kappa_score(y_test, y_preds_rcf),
    f1_score(y_test, y_preds_rcf),
]
result_rcf.append("-")
result_rcf = pd.DataFrame(
    result_rcf, index=["Balanced accuracy", "Cohen's Kappa", "F1", "ROC_AUC"], columns=["RCF"]
)

print(result_rcf)

The unsupervised model already can achieve some separation between the classes, with higher anomaly scores being correlated to fraudulent.

## Stage III: Train a XGBoost model with the built-in weighting schema

Once we have gathered an adequate amount of labeled training data, we can use a supervised learning algorithm that discovers relationships between the features and the dependent class.

We use Gradient Boosted Trees as our model, as they have a proven track record, are highly scalable and can deal with missing data, reducing the need to pre-process datasets.

### Prepare Data and Upload to S3

First we copy the data to an in-memory buffer.

In [None]:
import io
import sklearn
from sklearn.datasets import dump_svmlight_file

buf = io.BytesIO()

sklearn.datasets.dump_svmlight_file(X_train, y_train, buf)
buf.seek(0);

Now we upload the data to S3 using boto3.

In [None]:
key = "fraud-dataset"
subdir = "base"
boto3.resource("s3", region_name=config.AWS_REGION).Bucket(bucket).Object(
    os.path.join(prefix, "train", subdir, key)
).upload_fileobj(buf)

s3_train_data = "s3://{}/{}/train/{}/{}".format(bucket, prefix, subdir, key)
print("Uploaded training data location: {}".format(s3_train_data))

output_location = "s3://{}/{}/output".format(bucket, prefix)
print("Training artifacts are uploaded to: {}".format(output_location))

We can now train using SageMaker's built-in XGBoost algorithm. To specify the XGBoost algorithm, we use a utility function to obtain its URI. A complete list of built-in algorithms is found here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Moving onto training, first we need to specify the locations of the XGBoost algorithm containers.

In [None]:
container = sagemaker.image_uris.retrieve("xgboost", boto3.Session().region_name, "latest")
display(container)

SageMaker abstracts training via Estimators. We can pass the classifier and parameters along with hyperparameters to the estimator, and fit the estimator to the data in S3. An important parameter here is `scale_pos_weight` which scales the weights of the positive vs. negative class examples. This is crucial to do in an imbalanced dataset like the one we are using here, otherwise the majority class would dominate the learning.

In [None]:
from math import sqrt

# Because the data set is so highly skewed, we set the scale position weight conservatively,
# as sqrt(num_nonfraud/num_fraud).
# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).
scale_pos_weight = sqrt(np.count_nonzero(y_train == 0) / np.count_nonzero(y_train))
hyperparams = {
    "max_depth": 5,
    "subsample": 0.8,
    "num_round": 100,
    "eta": 0.9,
    "gamma": 10,
    "min_child_weight": 16,
    "silent": 0,
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "scale_pos_weight": scale_pos_weight,
}

Let us explain the hyper-parameters used above. The one that's very relevant for learning from skewed data is `scale_pos_weight`. This is a ratio that weighs the examples of the positive class (fraud) against the negative class (legitimate). Commonly this is set to `(num_nonfraud/num_fraud)`, but our data is exteremely skewed so we set it to `sqrt(num_nonfraud/num_fraud)`.  For the data in this example, this would be `sqrt(284,807/492)` which would give our fraud examples a weight of ~24.

The rest of the hyper-parameters are as follows:

* `max_depth`: This is the maximum depth of the trees that are built for our ensemble. A max depth of 5 gives us trees with up to 32 leaves. Note that tree size grows exponentially when increasing this parameter (`num_leaves=2^max_depth`), so a max depth of 10 would give us trees with 1024 leaves, which are likely to overfit.
* `subsample`: The subsample ratio that we use to select a subset of the complete data to train each tree in the ensemble. With a value of 0.8, each tree is trained on a random sample containing 80% of the complete data. This is used to prevent overfitting.
* `num_round`: This is the size of the ensemble. We use 100 "rounds", each training round adding a new tree to the ensemble.
* `eta`: This is the step size shrinkage applied at each update. This value shrinks the weights of new features to prevent overfitting.
* `gamma`: This is the minimum loss reduction to reach before splitting a leaf. Splitting a leaf can sometimes have a small benefit, and splitting such leaves can lead to overfitting. By setting `gamma` to values larger than zero, we ensure that there should be at least some non-negligible amount of accuracy gain before splitting a leaf.
* `min_child_weight`: This parameter has a similar effect to gamma, setting it to higher values means we wait until enough gain is possible before splitting a leaf.
* `objective`: We are doing binary classification, so we use a logistic loss objective.
* `eval_metric`: Having a good evaluation metric is crucial when dealing with imbalanced data (see discussion below). We use AUC here.

In [None]:
training_job_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb"
print(
    f"You can go to SageMaker -> Training -> Training jobs -> a job name started with {training_job_name} to monitor training status and details."
)

clf = sagemaker.estimator.Estimator(
    container,
    role=config.SAGEMAKER_IAM_ROLE,
    hyperparameters=hyperparams,
    instance_count=1,
    instance_type=instance_type_train,
    output_path=output_location,
    sagemaker_session=session,
    base_job_name=training_job_name,
)

We can now fit our supervised training model, the call to fit below should take around 5 minutes to complete.

In [None]:
clf.fit({"train": s3_train_data})

### Host Classifier

Now we deploy the estimator to and endpoint. As before progress are indicated by `-`, and the deployment should be done after 10 minutes.

In [None]:
from sagemaker.serializers import CSVSerializer

endpoint_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb"
print(
    f"You can go to SageMaker -> Inference -> Endpoints --> an endpoint with name {endpoint_name} to monitor the deployment status."
)

predictor = clf.deploy(
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    instance_type=instance_type_inference,  # use a smaller instance for endpoint deployment
    serializer=CSVSerializer(),
)

## Evaluation

Once we have trained the model we can use it to make predictions for the test set.

In [None]:
# Because we have a large test set, we call predict on smaller batches
def predict(current_predictor, data, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ""
    for array in split_array:
        predictions = ",".join([predictions, current_predictor.predict(array).decode("utf-8")])

    return np.fromstring(predictions[1:], sep=",")

In [None]:
raw_preds = predict(predictor, X_test)

We use a few measures from the scikit-learn package to evaluate the performance of our model. When dealing with an imbalanced dataset, we need to choose metrics that take into account the frequency of each class in the data.

Four such metrics are the [balanced accuracy score](https://scikit-learn.org/stable/modules/model_evaluation.html#balanced-accuracy-score), [Cohen's Kappa](https://scikit-learn.org/stable/modules/model_evaluation.html#cohen-s-kappa), [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), [ROC_AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html).

In [None]:
from sklearn.metrics import balanced_accuracy_score, cohen_kappa_score, f1_score, roc_auc_score

# scikit-learn expects 0/1 predictions, so we threshold our raw predictions
y_preds = np.where(raw_preds > 0.5, 1, 0)
result_xgboost = [
    balanced_accuracy_score(y_test, y_preds),
    cohen_kappa_score(y_test, y_preds),
    f1_score(y_test, y_preds),
]
result_xgboost.append(roc_auc_score(y_test, raw_preds))
result_xgboost = pd.DataFrame(
    result_xgboost, index=["Balanced accuracy", "Cohen's Kappa", "F1", "ROC_AUC"], columns=["XGB"]
)

In [None]:
result_rcf_xgboost = result_rcf.join(result_xgboost)
print(result_rcf_xgboost)

We can see that a supervised learning method XGBoost with the weighting schema (using `scale_pos_weight` to scale the weights of the positive vs. negative class examples) achieves significantly better performance than the unsupervised learning method RCF. We can also see that there still are rooms to improve the model performance. In particular, Cohen's Kappa scores above 0.8 are generally very favorable.

Apart from single-value metrics, it's also useful to look at metrics that indicate performance per class. A confusion matrix, and per-class precision, recall and f1-score can also provide more information about the model's performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix


def plot_confusion_matrix(y_true, y_predicted):
    cm = confusion_matrix(y_true, y_predicted)
    # Get the per-class normalized value for each cell
    cm_norm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]

    # We color each cell according to its normalized value, annotate with exact counts.
    ax = sns.heatmap(cm_norm, annot=cm, fmt="d")
    ax.set(xticklabels=["non-fraud", "fraud"], yticklabels=["non-fraud", "fraud"])
    ax.set_ylim([0, 2])
    plt.title("Confusion Matrix")
    plt.ylabel("Real Classes")
    plt.xlabel("Predicted Classes")
    plt.show()

In [None]:
plot_confusion_matrix(y_test, y_preds)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_preds, target_names=["non-fraud", "fraud"]))

## Stage IV: Train a XGBoost model with the over-sampling technique SMOTE

Now that we have a baseline model using XGBoost, we can try to see if sampling techniques that are designed specifically for imbalanced problems can improve the performance of the model.

For that purpose we use the [imbalanced-learn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) package that works well with scikit-learn. We have pre-installed the package for this kernel, but if you need it for a different Jupyter kernel you can install it by running `pip install --upgrade imbalanced-learn` within the conda environment you need.

We use [Sythetic Minority Over-sampling](https://arxiv.org/abs/1106.1813) (SMOTE), which oversamples the minority class by interpolating new data points between existing ones.

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

We can see that SMOTE has now balanced the two classes:

In [None]:
from collections import Counter

print(sorted(Counter(y_smote).items()))

We note that this is a case of extreme oversampling of the the minority class, we went from ~0.17% to 50%. An alternative would be to use a smaller resampling ratio, such as having one minority cl
ass sample for every `sqrt(non_fraud/fraud)` majority samples, or using more advanced resampling techniques. See the [comparison](https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py) provided by imbalanced-learn for more over-sampling options.

In our case we use the SMOTE dataset we just created and upload it to S3 for training.

In [None]:
smote_buf = io.BytesIO()

# Dump the SMOTE data into a buffer
sklearn.datasets.dump_svmlight_file(X_smote, y_smote, smote_buf)
smote_buf.seek(0)

# Upload from the buffer to S3
key = "fraud-dataset-smote"
subdir = "smote"
boto3.resource("s3", region_name=config.AWS_REGION).Bucket(bucket).Object(
    os.path.join(prefix, "train", subdir, key)
).upload_fileobj(smote_buf)

s3_smote_train_data = "s3://{}/{}/train/{}/{}".format(bucket, prefix, subdir, key)
print("Uploaded training data location: {}".format(s3_smote_train_data))

smote_output_location = "s3://{}/{}/smote-output".format(bucket, prefix)
print("Training artifacts are uploaded to: {}".format(smote_output_location))

In [None]:
# No need to scale weights after SMOTE resampling, so we remove that parameter
hyperparams.pop("scale_pos_weight", None)

training_job_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb-smote"
print(
    f"You can go to SageMaker -> Training -> Training jobs -> a job name started with {training_job_name} to monitor training status and details."
)

smote_xgb = sagemaker.estimator.Estimator(
    container,
    role=config.SAGEMAKER_IAM_ROLE,
    hyperparameters=hyperparams,
    instance_count=1,
    instance_type=instance_type_train,
    output_path=smote_output_location,
    sagemaker_session=session,
    base_job_name=training_job_name,
)

We are now ready to fit the model, which should take around 5 minutes to complete.

In [None]:
smote_xgb.fit({"train": s3_smote_train_data})

After fitting the model we can check its performance to compare it against the base XGBoost model. The deployment takes around 10 minutes.

In [None]:
endpoint_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb-smote"
print(
    f"You can go to SageMaker -> Inference -> Endpoints --> an endpoint with name {endpoint_name} to monitor the deployment status."
)

smote_predictor = smote_xgb.deploy(
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    instance_type=instance_type_inference,  # use a smaller instance for endpoint deployment
    serializer=CSVSerializer(),
)

In [None]:
smote_raw_preds = predict(smote_predictor, X_test)
smote_preds = np.where(
    smote_raw_preds > 0.5, 1, 0
)  # generate predicted labels using a cutoff threshold 0.5

In [None]:
result_xgboost_smote = [
    balanced_accuracy_score(y_test, smote_preds),
    cohen_kappa_score(y_test, smote_preds),
    f1_score(y_test, smote_preds),
]
result_xgboost_smote.append(roc_auc_score(y_test, smote_raw_preds))
result_xgboost_smote = pd.DataFrame(
    result_xgboost_smote,
    index=["Balanced accuracy", "Cohen's Kappa", "F1", "ROC_AUC"],
    columns=["XGB_SMOTE"],
)

result_rcf_xgboost_all = result_rcf_xgboost.join(result_xgboost_smote)
print(result_rcf_xgboost_all)

We can see that with over-sampling technique SMOTE, XGBoost achieves a better performance on [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html), but not on [Cohen's Kappa](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) and [F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) scores. We present the confusion matrix to examine the results in details and explain the reason as below.

In [None]:
plot_confusion_matrix(y_test, smote_preds)

In [None]:
print(classification_report(y_test, smote_preds, target_names=["non-fraud", "fraud"]))

Due to the randomness of XGBoost your results may vary, but overall, you should see a large increase in non-fraud cases being classified as fraud (false positives). The reason this happens is because SMOTE has oversampled the fraud class so much that it's increased its overlap in feature space with the non-fraud cases.
Since Cohen's Kappa gives more weight to false positives than balanced accuracy does, the metric drops significantly, as does the precision and F1 score for fraud cases. However, we can bring a balance between the metrics again by adjusting our classification threshold.

So far we've been using 0.5 as the threshold between labeling a point as fraudulent or not. We can try different thresholds to see if they affect the result of the classification. To evaluate, we use the balanced accuracy and Cohen's Kappa metrics.

In [None]:
for thres in np.linspace(0.1, 0.9, num=9):
    smote_thres_preds = np.where(smote_raw_preds > thres, 1, 0)
    print("Threshold: {:.1f}".format(thres))
    print("Balanced accuracy = {:.3f}".format(balanced_accuracy_score(y_test, smote_thres_preds)))
    print("Cohen's Kappa = {:.3f}\n".format(cohen_kappa_score(y_test, smote_thres_preds)))

We see that Cohen's Kappa keeps increasing along with the threshold, without a significant loss in balanced accuracy. This adds a useful knob to our model: We can keep a low threshold if we care more about not missing any fraudulent cases, or we can increase the threshold to try to minimize the number of false positives.

## Stage V: Train an optimal XGBoost model with hyper-parameter optimization (HPO)

In this stage, we demonstrate that model performance can be further improved by training an optimal XGBoost model with Hyper-parameter Optimization (HPO). To overcome the highly imbalance data problem, we use the XGBoost's weighting schema with hyper-parameter`scale_pos_weight` to scale the weights of the positive vs. negative class examples.

First, we further split the training data into training and validation data using the stratified sampling. The HPO process selects an optimal model that has the best performance on the validation data. 

In [None]:
X_train_hpo, X_valid_hpo, y_train_hpo, y_valid_hpo = train_test_split(
    X_train, y_train, test_size=0.4, random_state=42, stratify=y_train
)

In [None]:
X_train_hpo.shape, X_train.shape

In the last few stages, we prepare the input data as the `libsvm` format to train a XGBoost model. In this stage, we demonstrate how to prepare the training and validation data as the `csv` format. The input formats supported by XGBoost are documented in [SageMaker XGBoost Document](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html).

We combine the target variable (label) with the features variables for both training and validation data, save them into `csv` formats, and upload them to S3 buckets.
>Note. The algorithm assumes that the target variable is in the first column.

A standard approach for creating training and validation data is to make sure that the training and validation data are disjoint (i.e., they don't have any example in common). As a result, the model can be evaluated on the hold-out validation data, and thus the optimal model, which performs the best on the validation data, is selected by the HPO process. 

In our experiment, however, we found that using entire data (`X_train` and `y_train` instead of `X_train_hpo` and `y_train_hpo`) as the training set gave a significantly better performance (i.e., the validation data are subset of the training data). The reason is that due to highly imbalance data, there are very few number of positive (fraudulent) examples for model to learn their underlying pattern. If we use the standard approach to split the training data into two disjoint training and validation subsets, there are even smaller number of training examples for the model to learn. On the other hand, having the training and validation data non-disjoint can make the model trained with more iterations, which allows the model to learn the underlying pattern deeper. The reason is that the evaluation score on the validation data is similar to that on the training data, and tends to become better when iteration increases. Under such scenario, the early stopping are likely not be triggered.

In [None]:
X_train_combine = np.concatenate((np.reshape(y_train, (-1, 1)), X_train), axis=1)
# X_train_combine = np.concatenate((np.reshape(y_train_hpo, (-1, 1)), X_train_hpo), axis=1) # this is the standard approach to create disjoint training and validation data
X_valid_combine = np.concatenate((np.reshape(y_valid_hpo, (-1, 1)), X_valid_hpo), axis=1)

In [None]:
!mkdir -p xgboost_hpo_input
np.savetxt("xgboost_hpo_input/X_train_hpo.csv", X_train_combine, delimiter=",")
np.savetxt("xgboost_hpo_input/X_valid_hpo.csv", X_valid_combine, delimiter=",")

In [None]:
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

prefix = "fraud-classifier-hpo"

s3_train_data = S3Uploader.upload(
    "xgboost_hpo_input/X_train_hpo.csv", "s3://{}/{}/{}".format(bucket, prefix, "train")
)
print("Uploaded training data location: {}".format(s3_train_data))

s3_validation_data = S3Uploader.upload(
    "xgboost_hpo_input/X_valid_hpo.csv", "s3://{}/{}/{}".format(bucket, prefix, "validation")
)
print("Uploaded training data location: {}".format(s3_validation_data))

output_location = "s3://{}/{}/output".format(bucket, prefix)
print("Training artifacts are uploaded to: {}".format(output_location))

Because we're training with the `csv` file format, we create TrainingInputs that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = TrainingInput(
    s3_data="s3://{}/{}/train/".format(bucket, prefix), content_type="csv"
)
s3_input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv"
)

In [None]:
from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

# Because the data set is so highly skewed, we set the scale position weight conservatively,
# as sqrt(num_nonfraud/num_fraud).
# Other recommendations for the scale_pos_weight are setting it to (num_nonfraud/num_fraud).
scale_pos_weight = sqrt(
    np.count_nonzero(y_train == 0) / np.count_nonzero(y_train)
)  # sqrt(np.count_nonzero(y_train==0)/np.count_nonzero(y_train))

xgb = sagemaker.estimator.Estimator(
    container,
    config.SAGEMAKER_IAM_ROLE,
    instance_count=1,
    instance_type=instance_type_train,
    output_path=output_location,
    sagemaker_session=session,
)

xgb.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=1000,
    early_stopping_rounds=10,
    silent=0,
    scale_pos_weight=scale_pos_weight,
)

In [None]:
# Define the hyper-parameters search ranges.
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 0.5),
    "min_child_weight": ContinuousParameter(1, 10),
    "gamma": ContinuousParameter(2, 7),
    # "alpha": ContinuousParameter(0, 2),
    "max_depth": IntegerParameter(1, 10),
    "subsample": ContinuousParameter(0.6, 1),
}

Define objective metric name and objective type.

In [None]:
objective_metric_name = "validation:auc"
objective_type = "Maximize"

In [None]:
tuning_job_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-tuning"
print(
    f"You can go to SageMaker -> Training -> Hyperparameter tuning jobs -> a job name started with {tuning_job_name} to monitor HPO tuning status and details.\n"
    f"Note. You are unable to successfully run the following cells until the tuning job completes. This step may take around 15 min."
)

tuner = HyperparameterTuner(
    xgb,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=30,
    max_parallel_jobs=3,
    objective_type=objective_type,
    base_tuning_job_name=tuning_job_name,
)

tuner.fit({"train": s3_input_train, "validation": s3_input_validation})

Check the Status of HPO tuning jobs

In [None]:
boto3.client("sagemaker").describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuner.latest_tuning_job.job_name
)["HyperParameterTuningJobStatus"]

Retrieve the tuning job name

In [None]:
sm_client = boto3.Session().client("sagemaker")

tuning_job_name = tuner.latest_tuning_job.name
tuning_job_name

In [None]:
tuning_job_result = sm_client.describe_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=tuning_job_name
)

status = tuning_job_result["HyperParameterTuningJobStatus"]
if status != "Completed":
    print("Reminder: the tuning job has not been completed.")

job_count = tuning_job_result["TrainingJobStatusCounters"]["Completed"]
print("%d training jobs have completed" % job_count)

is_maximize = (
    tuning_job_result["HyperParameterTuningJobConfig"]["HyperParameterTuningJobObjective"]["Type"]
    != "Maximize"
)
objective_name = tuning_job_result["HyperParameterTuningJobConfig"][
    "HyperParameterTuningJobObjective"
]["MetricName"]

In [None]:
import pandas as pd

tuner_analytics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner_analytics.dataframe()

if len(full_df) > 0:
    df = full_df[full_df["FinalObjectiveValue"] > -float("inf")]
    if len(df) > 0:
        df = df.sort_values("FinalObjectiveValue", ascending=False)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest": min(df["FinalObjectiveValue"]), "highest": max(df["FinalObjectiveValue"])})
        pd.set_option("display.max_colwidth", -1)  # Don't truncate TrainingJobName
    else:
        print("No training jobs have reported valid results yet.")

df

### Deploy endpoint of the best tuning job

In [None]:
from sagemaker.serializers import CSVSerializer

endpoint_name = f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb-tuning"
print(
    f"You can go to SageMaker -> Inference -> Endpoints --> an endpoint with name {endpoint_name} to monitor the deployment status."
)

predictor_hpo = tuner.deploy(
    initial_instance_count=1,
    instance_type=instance_type_train,
    serializer=CSVSerializer(),
    endpoint_name=endpoint_name,
)  # Note. Because of csv-format input takes larger memory than the libsvm-format input,
# we use the large instance type `instance_type_train` for deploying an endpoint to avoid out-of-memory.

In [None]:
raw_preds_hpo = predict(predictor_hpo, X_test)
preds_hpo = np.where(raw_preds_hpo > 0.5, 1, 0)

In [None]:
result_xgboost_hpo = [
    balanced_accuracy_score(y_test, preds_hpo),
    cohen_kappa_score(y_test, preds_hpo),
    f1_score(y_test, preds_hpo),
]
result_xgboost_hpo.append(roc_auc_score(y_test, raw_preds_hpo))
result_xgboost_hpo = pd.DataFrame(
    result_xgboost_hpo,
    index=["Balanced accuracy", "Cohen's Kappa", "F1", "ROC_AUC"],
    columns=["XGB_HPO"],
)

## Stage VI: Evaluate and compare all model performance on the same test data

We present the evaluation results from four models, Random Cut Forest (RCF), XGBoost (XGB), XGBoost with over-sampling method SMOTE, and XGBoost with HPO. The evaluations are conducted on the same test data, which are created at the beginning of the notebook. The evaluation metrics are [Balanced Accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.balanced_accuracy_score.html), [Cohen's Kappa score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html), [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html), and [ROC_AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). For all of the metrics, a large value indicates a better predictive performance.

In [None]:
result_xgboost_all_hpo = result_rcf_xgboost_all.join(result_xgboost_hpo)
print(result_xgboost_all_hpo)

We can see that XGBoost with HPO achieves even better performance than that with SMOTE method. In particular, Cohen's Kappa scores and F1 are over 0.8, indicating optimal model performance.

## Clean up

We leave the unsupervised and base XGBoost endpoints running at the end of this notebook so we can handle incoming event streams using the Lambda function. The solution automatically cleans up the endpoints when deleted, however, don't forget to ensure the prediction endpoints are deleted when you're done. You can do that at the Amazon SageMaker console in the Endpoints page. Or you can run `predictor_name.delete_endpoint()` here.

In [None]:
# Uncomment to clean up endpoints
# rcf_predictor.delete_model()
# rcf_predictor.delete_endpoint()
# predictor.delete_model()
# predictor.delete_endpoint()
# smote_predictor.delete_model()
# smote_predictor.delete_endpoint()
# predictor_hpo.delete_model()
# predictor_hpo.delete_endpoint()
# sm_client = boto3.client("sagemaker", region_name=config.AWS_REGION)
# waiter = sm_client.get_waiter("endpoint_deleted")
# waiter.wait(EndpointName=f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb-smote")
# waiter.wait(EndpointName=f"{config.SOLUTION_PREFIX}-{unique_hash}-xgb")
# waiter.wait(EndpointName=f"{config.SOLUTION_PREFIX}-{unique_hash}-rcf")


## Data Acknowledgements

The dataset used to demonstrated the fraud detection solution has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project
We cite the following works:
* Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015
* Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon
* Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
* Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)
* Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier
* Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.

![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-east-2/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/us-west-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ca-central-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/sa-east-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-2/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-west-3/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-central-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/eu-north-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-southeast-2/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-northeast-2/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://prod.us-west-2.tcx-beacon.docs.aws.dev/sagemaker-nb/ap-south-1/introduction_to_applying_machine_learning|credit_card_fraud_detector|credit_card_fraud_detector.ipynb)
