# Training and Deploying the Fraud Detection Model

In this notebook, we will take the outputs from the Processing Job in the previous step and use it and train and deploy an XGBoost model. Our historic transaction dataset is initially comprised of data like timestamp, card number, and transaction amount and we enriched each transaction with features about that card number's recent history, including:

- `num_trans_last_10m`
- `num_trans_last_1w`
- `avg_amt_last_10m`
- `avg_amt_last_1w`

Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

### Imports 

In [21]:
from sklearn.model_selection import train_test_split
from sagemaker.inputs import TrainingInput
from sagemaker.session import Session
from sagemaker import image_uris
import pandas as pd
import numpy as np
import sagemaker
import boto3
import io

### Essentials 

In [22]:
LOCAL_DIR = './data'
BUCKET = sagemaker.Session().default_bucket()
PREFIX = 'training'

sagemaker_role = sagemaker.get_execution_role()
s3_client = boto3.Session().client('s3')

First, let's load the results of the SageMaker Processing Job ran in the previous step into a Pandas dataframe. 

In [23]:
df = pd.read_csv(f'{LOCAL_DIR}/aggregated/processing_output.csv')
#df.dropna(inplace=True)
df['cc_num'] = df['cc_num'].astype(np.int64)
df['fraud_label'] = df['fraud_label'].astype(np.int64)
df.head()
len(df)

5400000

### Split DataFrame into Train & Test Sets

The artifically generated dataset contains transactions from `2020-01-01` to `2020-06-01`. We will create a training and validation set out of transactions from `2020-01-15` and `2020-05-15`, discarding the first two weeks in order for our aggregated features to have built up sufficient history for cards and leaving the last two weeks as a holdout test set. 

In [24]:
training_start = '2020-01-15'
training_end = '2020-05-15'

training_df = df[(df.datetime > training_start) & (df.datetime < training_end)]
test_df = df[df.datetime >= training_end]

test_df.to_csv(f'{LOCAL_DIR}/test.csv', index=False)

Although we now have lots of information about each transaction in our training dataset, we don't want to pass everything as features to the XGBoost algorithm for training because some elements are not useful for detecting fraud or creating a performant model:
- A transaction ID and timestamp is unique to the transaction and never seen again. 
- A card number, if included in the feature set at all, should be a categorical variable. But we don't want our model to learn that specific card numbers are associated with fraud as this might lead to our system blocking genuine behaviour. Instead we should only have the model learn to detect shifting patterns in a card's spending history. 
- Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

Given all of the above, we drop all columns except for the normalised ratio features and transaction amount from our training dataset.

In [25]:
training_df.drop(['tid','datetime','cc_num','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1, inplace=True)

The [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) requires the label to be the first column in the training data:

In [26]:
training_df = training_df[['fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio']]
training_df.head()

Unnamed: 0,fraud_label,amount,amt_ratio1,amt_ratio2,count_ratio
46,0,74.26,0.273769,0.273769,0.038462
47,0,55.88,0.205358,0.205358,0.038462
48,0,1711.35,5.257197,5.257197,0.04
49,0,86.5,0.255466,0.255466,0.041667
50,0,27.94,0.175429,0.085661,0.08


In [27]:
train, val = train_test_split(training_df, test_size=0.3)
train.to_csv(f'{LOCAL_DIR}/train.csv', header=False, index=False)
val.to_csv(f'{LOCAL_DIR}/val.csv', header=False, index=False)

In [28]:
!aws s3 cp {LOCAL_DIR}/train.csv s3://{BUCKET}/{PREFIX}/
!aws s3 cp {LOCAL_DIR}/val.csv s3://{BUCKET}/{PREFIX}/

upload: data/train.csv to s3://sagemaker-us-east-2-105242341581/training/train.csv
upload: data/val.csv to s3://sagemaker-us-east-2-105242341581/training/val.csv


In [29]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"100"}

output_path = 's3://{}/{}/output'.format(BUCKET, PREFIX)

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", sagemaker.Session().boto_region_name, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'train.csv'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'val.csv'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

2020-12-07 22:19:43 Starting - Starting the training job...
2020-12-07 22:19:46 Starting - Launching requested ML instances......
2020-12-07 22:20:52 Starting - Preparing the instances for training......
2020-12-07 22:21:56 Downloading - Downloading input data...
2020-12-07 22:22:34 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[

Ideally we would perform hyperparameter tuning before deployment, but for the purposes of this example will deploy the model that resulted from the Training Job directly to a SageMaker hosted endpoint.

In [31]:
predictor = estimator.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    serializer=sagemaker.serializers.CSVSerializer(), wait=True)

In [34]:
endpoint_name=predictor.endpoint_name
#Store the endpoint name for later cleanup 
%store endpoint_name
endpoint_name

Stored 'endpoint_name' (str)


'sagemaker-xgboost-2020-12-07-23-25-29-104'

Now to check that our endpoint is working, let's call it directly with a record from our test hold-out set. 

In [35]:
payload_df = test_df.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)
payload = payload_df.head(1).to_csv(index=False, header=False).strip()
payload

'15.61,0.014041421952583129,0.014041421952583129,0.038461538461538464'

In [36]:
predictor.predict(payload)

b'0.0002712476998567581'

In [37]:
len(test_df)

603208

In [38]:
test_df.head()

Unnamed: 0,tid,datetime,cc_num,amount,fraud_label,num_trans_last_10m,avg_amt_last_10m,num_trans_last_1w,avg_amt_last_1w,amt_ratio1,amt_ratio2,count_ratio
467,b66d85a66671654ee1ce894f472ad0ed,2020-05-15T02:15:48.000Z,4001487383951324,15.61,0,1,15.61,26,1111.710769,0.014041,0.014041,0.038462
468,ea24ea401ee3e8359f8b391ca57bfaf8,2020-05-16T13:11:43.000Z,4001487383951324,940.21,0,1,940.21,23,1225.213043,0.767385,0.767385,0.043478
469,a8830365a7b881167c8e5c9daf717f89,2020-05-16T20:57:35.000Z,4001487383951324,865.45,0,1,865.45,23,1259.573913,0.687097,0.687097,0.043478
470,c6bbd001c206c1214e5aa2fdd12a7e35,2020-05-17T13:51:00.000Z,4001487383951324,89.21,0,1,89.21,23,1259.2,0.070847,0.070847,0.043478
471,2b2bb2583079a6b8f737b40140d7754a,2020-05-17T16:41:19.000Z,4001487383951324,25.71,0,1,25.71,24,1207.804583,0.021287,0.021287,0.041667


In [45]:
frauds=test_df[test_df.fraud_label==1.0]

In [50]:
payload_df = frauds.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)
payload_df.head()
payload = payload_df.head(1).to_csv(index=False, header=False).strip()
payload

'6336.79,10.03688920567039,10.03688920567039,0.03448275862068965'

In [53]:
payload_df.iloc[4]

amount         343.700000
amt_ratio1       2.639644
amt_ratio2       0.568465
count_ratio      0.151515
Name: 33185, dtype: float64

In [65]:
payload = payload_df.iloc[4].to_csv(index=False,header=False).strip().replace('\n', ',')
float(predictor.predict(payload).decode('utf-8'))

0.8377783298492432

In [63]:
print(sagemaker.__version__)

2.16.4.dev0


In [89]:
preds = []
acts = []

frauds=test_df[test_df.fraud_label==1.0]
payload_df = frauds.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)

for t in range(200):
    acts.append(1)
    payload = payload_df.iloc[t].to_csv(index=False,header=False).strip().replace('\n', ',')
    amt_ratio1 = float(payload.split(',')[1])
    amt_ratio2 = float(payload.split(',')[2])
    count_ratio = float(payload.split(',')[3])
    is_fraud = float(predictor.predict(payload).decode('utf-8'))
    if is_fraud > 0.50:
        preds.append(1)
        print(f'FRAUD,     count ratio: {count_ratio:.3f}, amt ratio1: {amt_ratio1:.3f}, amt ratio2: {amt_ratio2:.3f}')
    else:
        preds.append(0)
        print(f'NOT FRAUD, count ratio: {count_ratio:.3f}, amt ratio1: {amt_ratio1:.3f}, amt ratio2: {amt_ratio2:.3f}')

NOT FRAUD, count ratio: 0.034, amt ratio1: 10.037, amt ratio2: 10.037
NOT FRAUD, count ratio: 0.067, amt ratio1: 5.664, amt ratio2: 1.446
NOT FRAUD, count ratio: 0.097, amt ratio1: 3.903, amt ratio2: 0.003
NOT FRAUD, count ratio: 0.125, amt ratio1: 3.115, amt ratio2: 0.604
FRAUD,     count ratio: 0.152, amt ratio1: 2.640, amt ratio2: 0.568
FRAUD,     count ratio: 0.176, amt ratio1: 2.282, amt ratio2: 0.161
FRAUD,     count ratio: 0.200, amt ratio1: 2.025, amt ratio2: 0.131
NOT FRAUD, count ratio: 0.036, amt ratio1: 1.488, amt ratio2: 1.488
NOT FRAUD, count ratio: 0.069, amt ratio1: 0.822, amt ratio2: 0.109
NOT FRAUD, count ratio: 0.100, amt ratio1: 0.737, amt ratio2: 0.540
FRAUD,     count ratio: 0.129, amt ratio1: 0.614, amt ratio2: 0.187
FRAUD,     count ratio: 0.156, amt ratio1: 0.525, amt ratio2: 0.094
FRAUD,     count ratio: 0.182, amt ratio1: 0.475, amt ratio2: 0.157
FRAUD,     count ratio: 0.206, amt ratio1: 0.576, amt ratio2: 1.200
FRAUD,     count ratio: 0.229, amt ratio1: 1.4

NOT FRAUD, count ratio: 0.120, amt ratio1: 1.854, amt ratio2: 3.365
FRAUD,     count ratio: 0.154, amt ratio1: 1.465, amt ratio2: 0.095
FRAUD,     count ratio: 0.185, amt ratio1: 1.252, amt ratio2: 0.227
FRAUD,     count ratio: 0.214, amt ratio1: 1.124, amt ratio2: 0.327
FRAUD,     count ratio: 0.241, amt ratio1: 1.038, amt ratio2: 0.373
FRAUD,     count ratio: 0.267, amt ratio1: 0.986, amt ratio2: 0.499
FRAUD,     count ratio: 0.290, amt ratio1: 0.936, amt ratio2: 0.366
FRAUD,     count ratio: 0.281, amt ratio1: 0.910, amt ratio2: 0.004
NOT FRAUD, count ratio: 0.034, amt ratio1: 0.128, amt ratio2: 0.128
NOT FRAUD, count ratio: 0.067, amt ratio1: 4.467, amt ratio2: 8.839
NOT FRAUD, count ratio: 0.097, amt ratio1: 3.548, amt ratio2: 2.014
NOT FRAUD, count ratio: 0.125, amt ratio1: 2.871, amt ratio2: 0.757
FRAUD,     count ratio: 0.152, amt ratio1: 3.007, amt ratio2: 4.975
FRAUD,     count ratio: 0.176, amt ratio1: 2.583, amt ratio2: 0.015
FRAUD,     count ratio: 0.200, amt ratio1: 2.279

In [88]:
from sklearn.metrics import classification_report
target_names = ['NOT FRAUD', 'FRAUD']
print(classification_report(acts, preds, target_names=target_names))

              precision    recall  f1-score   support

   NOT FRAUD       0.00      0.00      0.00         0
       FRAUD       1.00      0.45      0.62       200

    accuracy                           0.45       200
   macro avg       0.50      0.22      0.31       200
weighted avg       1.00      0.45      0.62       200



In [84]:
preds = []
acts = []

payload_df = test_df.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)

for t in range(2000):
    act_label = test_df['fraud_label'].iloc[t]
    acts.append(act_label)
    payload = payload_df.iloc[t].to_csv(index=False,header=False).strip().replace('\n', ',')
    amt_ratio1 = float(payload.split(',')[1])
    amt_ratio2 = float(payload.split(',')[2])
    count_ratio = float(payload.split(',')[3])
    is_fraud = float(predictor.predict(payload).decode('utf-8'))
    if is_fraud > 0.25:
        preds.append(1)
        print(act_label)
        print(f'FRAUD,     count ratio: {count_ratio:.3f}, amt ratio1: {amt_ratio1:.3f}, amt ratio2: {amt_ratio2:.3f}')
    else:
        preds.append(0)
#        print(f'NOT FRAUD, count ratio: {count_ratio:.3f}, amt ratio1: {amt_ratio1:.3f}, amt ratio2: {amt_ratio2:.3f}')

from sklearn.metrics import classification_report
target_names = ['NOT FRAUD', 'FRAUD']
#print(acts)
print(classification_report(acts, preds, target_names=target_names))

0
FRAUD,     count ratio: 0.103, amt ratio1: 0.109, amt ratio2: 0.297
0
FRAUD,     count ratio: 0.097, amt ratio1: 0.190, amt ratio2: 0.155
              precision    recall  f1-score   support

   NOT FRAUD       1.00      1.00      1.00      2000
       FRAUD       0.00      0.00      0.00         0

    accuracy                           1.00      2000
   macro avg       0.50      0.50      0.50      2000
weighted avg       1.00      1.00      1.00      2000



In [40]:
len(test_df[test_df.datetime<'2020-05-15T00:30:00'])

716

In [41]:
fraud_cc = frauds.sort_values(by='datetime').head(1).cc_num.values[0]

In [42]:
%store fraud_cc

Stored 'fraud_cc' (int64)


In [43]:
frauds.sort_values(by='datetime').head()

Unnamed: 0,tid,datetime,cc_num,amount,fraud_label,num_trans_last_10m,avg_amt_last_10m,num_trans_last_1w,avg_amt_last_1w,amt_ratio1,amt_ratio2,count_ratio
5241815,4e62503e378ab1882f4a2c534ebd00aa,2020-05-15T00:00:37.000Z,4594494719397878,4962.6,1,8,1510.84375,31,1118.065161,1.351302,4.438561,0.258065
5241816,6eae20058a6992e0536b8137a403fddb,2020-05-15T00:02:09.000Z,4594494719397878,74.18,1,8,1513.235,32,1085.44375,1.394116,0.068341,0.25
5241817,7570a6dbcdb097fe30002dd43a87a238,2020-05-15T00:03:43.000Z,4594494719397878,75.09,1,8,1504.7625,33,1054.82697,1.426549,0.071187,0.242424
3805567,fb15ecfb20e8be2f4789c9c36c5b76a4,2020-05-15T00:20:50.000Z,4834624440192608,71.68,1,1,71.68,19,640.638421,0.111888,0.111888,0.052632
3805568,ec2aab2411fb524524712eaf72f3811b,2020-05-15T00:21:46.000Z,4834624440192608,85.31,1,2,78.495,20,612.872,0.128077,0.139197,0.1
