# Training and Deploying the Fraud Detection Model

In this notebook, we will take the outputs from the Processing Job in the previous step and use it and train and deploy an XGBoost model. Our historic transaction dataset is initially comprised of data like timestamp, card number, and transaction amount and we enriched each transaction with features about that card number's recent history, including:

- `num_trans_last_10m`
- `num_trans_last_1w`
- `avg_amt_last_10m`
- `avg_amt_last_1w`

Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

### Imports 

In [8]:
from sklearn.model_selection import train_test_split
from sagemaker.inputs import TrainingInput
from sagemaker.session import Session
from sagemaker import image_uris
import pandas as pd
import numpy as np
import sagemaker
import boto3
import io

from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import get_execution_role
import sagemaker
import logging
import boto3
import pandas as pd
import time
import re
import os
import sys

In [9]:
logger = logging.getLogger('__name__')
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

### Essentials 

In [11]:
LOCAL_DIR = './data'
BUCKET = 'chime-fs-demo'
PREFIX = 'training'

sagemaker_role = sagemaker.get_execution_role()
s3_client = boto3.Session().client('s3')
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
train_feature_group_name = 'cc_train_fg'

In [12]:
boto_session = boto3.Session(region_name=region)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=region)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=region)

feature_store_session = sagemaker.Session(boto_session=boto_session, 
                                          sagemaker_client=sagemaker_client, 
                                          sagemaker_featurestore_runtime_client=featurestore_runtime)

In [13]:
train_fg = FeatureGroup(name=train_feature_group_name, sagemaker_session=feature_store_session)  

In [15]:
train_query = train_fg.athena_query()
train_table = train_query.table_name

In [21]:
query_string = f'SELECT * FROM "{train_table}"'
%store query_string
query_string

Stored 'query_string' (str)


'SELECT * FROM "cc_train_fg_1680709766"'

In [22]:
query_results= 'sagemaker-fs-demo'
output_location = f's3://{BUCKET}/{query_results}/query_results/'
print(f'Athena query output location: \n{output_location}')

Athena query output location: 
s3://chime-fs-demo/sagemaker-fs-demo/query_results/


In [23]:
train_query.run(query_string=query_string, output_location=output_location)
train_query.wait()
query_df = train_query.as_dataframe()
query_df.head(5)

Unnamed: 0,write_time,api_invocation_time,is_deleted,tid,datetime,fraud_label,amount,amt_ratio1,amt_ratio2,count_ratio
0,2023-04-05 15:50:02.164 UTC,2023-04-05 15:50:02.164 UTC,False,01d85107fb32540768c2403c5d1da0c7,2020-03-01T00:05:50.000Z,0,804.4,1.705822,1.705822,0.034483
1,2023-04-05 15:50:02.164 UTC,2023-04-05 15:50:02.164 UTC,False,dc090421a69f9a7aaed7ff4282509a9d,2020-03-01T00:11:12.000Z,0,58.14,0.085644,0.085644,0.043478
2,2023-04-05 15:50:02.164 UTC,2023-04-05 15:50:02.164 UTC,False,4b260f55d100b9ce29416db4a9dc6aff,2020-03-01T00:11:25.000Z,0,4.56,0.048174,0.007007,0.083333
3,2023-04-05 15:50:02.164 UTC,2023-04-05 15:50:02.164 UTC,False,f81ed0282ddf81d523c2ac3fbd498d79,2020-03-01T00:13:27.000Z,0,295.96,0.821689,0.821689,0.037037
4,2023-04-05 15:50:02.164 UTC,2023-04-05 15:50:02.164 UTC,False,a0f38e7ca03665ca7a0b85218bcf9c92,2020-03-01T00:14:04.000Z,0,486.5,0.602198,0.602198,0.041667


In [20]:
train_df = query_df.columns

Index(['write_time', 'api_invocation_time', 'is_deleted', 'tid', 'datetime',
       'fraud_label', 'amount', 'amt_ratio1', 'amt_ratio2', 'count_ratio'],
      dtype='object')

First, let's load the results of the SageMaker Processing Job ran in the previous step into a Pandas dataframe. 

In [3]:
# df = pd.read_csv(f'{LOCAL_DIR}/aggregated/processing_output.csv')
# #df.dropna(inplace=True)
# df['cc_num'] = df['cc_num'].astype(np.int64)
# df['fraud_label'] = df['fraud_label'].astype(np.int64)
# df.head()
# len(df)

5400000

### Split DataFrame into Train & Test Sets

The artifically generated dataset contains transactions from `2020-01-01` to `2020-06-01`. We will create a training and validation set out of transactions from `2020-01-15` and `2020-05-15`, discarding the first two weeks in order for our aggregated features to have built up sufficient history for cards and leaving the last two weeks as a holdout test set. 

In [33]:
training_start = '2020-01-15'
training_end = '2020-05-15'

training_df = query_df[(query_df.datetime > training_start) & (query_df.datetime < training_end)]
test_df = query_df[query_df.datetime >= training_end]

test_df.to_csv(f'{LOCAL_DIR}/test.csv', index=False)

In [35]:
training_df.count()

write_time             4299147
api_invocation_time    4299147
is_deleted             4299147
tid                    4299147
datetime               4299147
fraud_label            4299147
amount                 4299147
amt_ratio1             4299147
amt_ratio2             4299147
count_ratio            4299147
dtype: int64

In [36]:
test_df.count()

write_time             603210
api_invocation_time    603210
is_deleted             603210
tid                    603210
datetime               603210
fraud_label            603210
amount                 603210
amt_ratio1             603210
amt_ratio2             603210
count_ratio            603210
dtype: int64

Although we now have lots of information about each transaction in our training dataset, we don't want to pass everything as features to the XGBoost algorithm for training because some elements are not useful for detecting fraud or creating a performant model:
- A transaction ID and timestamp is unique to the transaction and never seen again. 
- A card number, if included in the feature set at all, should be a categorical variable. But we don't want our model to learn that specific card numbers are associated with fraud as this might lead to our system blocking genuine behaviour. Instead we should only have the model learn to detect shifting patterns in a card's spending history. 
- Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

Given all of the above, we drop all columns except for the normalised ratio features and transaction amount from our training dataset.

In [5]:
# training_df.drop(['tid','datetime','cc_num','num_trans_last_10m', 'avg_amt_last_10m',
#        'num_trans_last_1w', 'avg_amt_last_1w'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_df.drop(['tid','datetime','cc_num','num_trans_last_10m', 'avg_amt_last_10m',


In [37]:
training_df.drop(['tid','datetime'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  training_df.drop(['tid','datetime'], axis=1, inplace=True)


The [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) requires the label to be the first column in the training data:

In [38]:
training_df = training_df[['fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio']]
training_df.head()

Unnamed: 0,fraud_label,amount,amt_ratio1,amt_ratio2,count_ratio
0,0,804.4,1.705822,1.705822,0.034483
1,0,58.14,0.085644,0.085644,0.043478
2,0,4.56,0.048174,0.007007,0.083333
3,0,295.96,0.821689,0.821689,0.037037
4,0,486.5,0.602198,0.602198,0.041667


In [39]:
train, val = train_test_split(training_df, test_size=0.3)
train.to_csv(f'{LOCAL_DIR}/train.csv', header=False, index=False)
val.to_csv(f'{LOCAL_DIR}/val.csv', header=False, index=False)

In [40]:
!aws s3 cp {LOCAL_DIR}/train.csv s3://{BUCKET}/{PREFIX}/
!aws s3 cp {LOCAL_DIR}/val.csv s3://{BUCKET}/{PREFIX}/

upload: data/train.csv to s3://chime-fs-demo/training/train.csv     
upload: data/val.csv to s3://chime-fs-demo/training/val.csv      


In [41]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"100"}

output_path = 's3://{}/{}/output'.format(BUCKET, PREFIX)

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", sagemaker.Session().boto_region_name, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'train.csv'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'val.csv'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

2023-04-05 19:51:47 Starting - Starting the training job...ProfilerReport-1680724306: InProgress
...
2023-04-05 19:52:47 Starting - Preparing the instances for training...
2023-04-05 19:53:11 Downloading - Downloading input data...
2023-04-05 19:53:47 Training - Downloading the training image...
2023-04-05 19:54:07 Training - Training image download completed. Training in progress..[34m[2023-04-05 19:54:16.142 ip-10-0-242-183.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined

Ideally we would perform hyperparameter tuning before deployment, but for the purposes of this example will deploy the model that resulted from the Training Job directly to a SageMaker hosted endpoint.

In [42]:
predictor = estimator.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    serializer=sagemaker.serializers.CSVSerializer(), wait=True)

----------!

In [13]:
endpoint_name=predictor.endpoint_name
#Store the endpoint name for later cleanup 
%store endpoint_name
endpoint_name

Stored 'endpoint_name' (str)


'sagemaker-xgboost-2023-04-03-02-05-40-978'

Now to check that our endpoint is working, let's call it directly with a record from our test hold-out set. 

In [14]:
payload_df = test_df.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)
payload = payload_df.head(1).to_csv(index=False, header=False).strip()
payload

'485.75,0.4152517518202095,0.4152517518202095,0.0384615384615384'

In [15]:
float(predictor.predict(payload).decode('utf-8'))

0.0002807091223075986

## Show that the model predicts FRAUD / NOT FRAUD

In [16]:
count_ratio = 0.30
payload = f'1.00,1.0,1.0,{count_ratio:.2f}'
is_fraud = float(predictor.predict(payload).decode('utf-8'))
print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')

With transaction count ratio of: 0.30, fraud score: 0.978


In [17]:
count_ratio = 0.06
payload = f'1.00,1.0,1.0,{count_ratio:.2f}'
is_fraud = float(predictor.predict(payload).decode('utf-8'))
print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')

With transaction count ratio of: 0.06, fraud score: 0.004
