# Training and Deploying the Fraud Detection Model

In this notebook, we will take the outputs from the Processing Job in the previous step and use it and train and deploy an XGBoost model. Our historic transaction dataset is initially comprised of data like timestamp, card number, and transaction amount and we enriched each transaction with features about that card number's recent history, including:

- `num_trans_last_10m`
- `num_trans_last_1w`
- `avg_amt_last_10m`
- `avg_amt_last_1w`

new table:
- `orders_last_5m`
- `page_views_last_5m`
- `clicks_last_5m`
- `user_id`
- `event_time`



Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud.

**Recommended settings to run this notebook in SageMaker Studio:**

- Image: Data Science
- Kernel: Python3
- Instance type: <font color='blue'>ml.m5.large (2 vCPU + 8 GiB)</font>

<font color='red'>Do not proceed with this notebook unless 1_setup.ipynb and 2_batch_ingestion.ipynb are fully executed including the manual steps.</font>

### Imports 

In [1]:
from sklearn.model_selection import train_test_split
from sagemaker.inputs import TrainingInput
from sagemaker.session import Session
from sagemaker import image_uris
import pandas as pd
import numpy as np
import sagemaker
import boto3
import io



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


### Essentials 

In [2]:
LOCAL_DIR = './data'
BUCKET = sagemaker.Session().default_bucket()
PREFIX = 'training_clicks'

sagemaker_role = sagemaker.get_execution_role()
s3_client = boto3.Session().client('s3')

First, let's load the results of the SageMaker Processing Job ran in the previous step into a Pandas dataframe. 

In [5]:
df = pd.read_csv(f'{LOCAL_DIR}/aggregated_clicks/processing_output.csv')
#df.dropna(inplace=True)
# df['cc_num'] = df['cc_num'].astype(np.int64)
# df['fraud_label'] = df['fraud_label'].astype(np.int64)
df['total_orders_last_1w'] = df['total_orders_last_1w'].astype(np.int64)
df['avg_order_value_last_1w'] = df['avg_order_value_last_1w'].astype(np.float32)
df['event_time'] = df['event_time'].values.astype('datetime64[ns]')
df.head()
len(df)

805

### Split DataFrame into Train & Test Sets

The artifically generated dataset contains transactions from `2020-01-01` to `2020-06-01`. We will create a training and validation set out of transactions from `2020-01-15` and `2020-05-15`, discarding the first two weeks in order for our aggregated features to have built up sufficient history for cards and leaving the last two weeks as a holdout test set. 

In [None]:
training_start = '2022-01-15'
training_end = '2022-05-15'

training_df = df[(df.datetime > training_start) & (df.datetime < training_end)]
test_df = df[df.datetime >= training_end]

test_df.to_csv(f'{LOCAL_DIR}/test.csv', index=False)

Although we now have lots of information about each transaction in our training dataset, we don't want to pass everything as features to the XGBoost algorithm for training because some elements are not useful for detecting fraud or creating a performant model:
- A transaction ID and timestamp is unique to the transaction and never seen again. 
- A card number, if included in the feature set at all, should be a categorical variable. But we don't want our model to learn that specific card numbers are associated with fraud as this might lead to our system blocking genuine behaviour. Instead we should only have the model learn to detect shifting patterns in a card's spending history. 
- Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

Given all of the above, we drop all columns except for the normalised ratio features and transaction amount from our training dataset.

In [None]:
training_df.drop(['tid','datetime','cc_num','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1, inplace=True)

The [built-in XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) requires the label to be the first column in the training data:

In [None]:
training_df = training_df[['fraud_label', 'amount', 'amt_ratio1','amt_ratio2','count_ratio']]
training_df.head()

In [None]:
train, val = train_test_split(training_df, test_size=0.3)
train.to_csv(f'{LOCAL_DIR}/train.csv', header=False, index=False)
val.to_csv(f'{LOCAL_DIR}/val.csv', header=False, index=False)

In [None]:
!aws s3 cp {LOCAL_DIR}/train.csv s3://{BUCKET}/{PREFIX}/
!aws s3 cp {LOCAL_DIR}/val.csv s3://{BUCKET}/{PREFIX}/

In [None]:
# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"binary:logistic",
        "num_round":"100"}

output_path = 's3://{}/{}/output'.format(BUCKET, PREFIX)

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", sagemaker.Session().boto_region_name, "1.2-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'train.csv'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}".format(BUCKET, PREFIX, 'val.csv'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

Ideally we would perform hyperparameter tuning before deployment, but for the purposes of this example will deploy the model that resulted from the Training Job directly to a SageMaker hosted endpoint.

In [None]:
predictor = estimator.deploy(
    initial_instance_count=1, 
    instance_type='ml.t2.medium',
    serializer=sagemaker.serializers.CSVSerializer(), wait=True)

In [None]:
endpoint_name=predictor.endpoint_name
#Store the endpoint name for later cleanup 
%store endpoint_name
endpoint_name

Now to check that our endpoint is working, let's call it directly with a record from our test hold-out set. 

In [None]:
payload_df = test_df.drop(['tid','datetime','cc_num','fraud_label','num_trans_last_10m', 'avg_amt_last_10m',
       'num_trans_last_1w', 'avg_amt_last_1w'], axis=1)
payload = payload_df.head(1).to_csv(index=False, header=False).strip()
payload

In [None]:
float(predictor.predict(payload).decode('utf-8'))

## Show that the model predicts FRAUD / NOT FRAUD

In [None]:
count_ratio = 0.30
payload = f'1.00,1.0,1.0,{count_ratio:.2f}'
is_fraud = float(predictor.predict(payload).decode('utf-8'))
print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')

In [None]:
count_ratio = 0.06
payload = f'1.00,1.0,1.0,{count_ratio:.2f}'
is_fraud = float(predictor.predict(payload).decode('utf-8'))
print(f'With transaction count ratio of: {count_ratio:.2f}, fraud score: {is_fraud:.3f}')

In [17]:
!pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25ldone
[?25h  Created wheel for lightfm: filename=lightfm-1.17-cp311-cp311-linux_x86_64.whl size=448300 sha256=739a8ae0a1272ad3ba60932a8adcf3063441e8b6d8f519004f3bcbb9983eb0b9
  Stored in directory: /home/sagemaker-user/.cache/pip/wheels/b9/0d/8a/0729d2e6e3ca2a898ba55201f905da7db3f838a33df5b3fcdd
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


In [18]:
# !pwd
print(BUCKET)

sagemaker-ap-southeast-1-850995562355


In [28]:
from sagemaker.sklearn.estimator import SKLearn
import sagemaker

# Path to your training script that uses LightFM
script_path = './train_lightfm.py'

# Optionally, if your training script requires additional packages, you can include a requirements.txt file
# (or add installation steps in your script)
hyperparameters = {
    "no_components": 30,
    "epochs": 30,
    "num_threads": 4
}

# Construct a SageMaker estimator for your LightFM training script
estimator = SKLearn(
    entry_point=script_path,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    framework_version='0.23-1',  # choose a version compatible with your code
    py_version='py3',
    hyperparameters=hyperparameters,  # any hyperparameters you want to pass
    output_path=f"s3://{BUCKET}/{PREFIX}/model/",
    source_dir= "../custom/"
)


In [29]:
!aws s3 cp ../data/train_clicks.csv s3://{BUCKET}/{PREFIX}/
!aws s3 cp ../data/val_clicks.csv s3://{BUCKET}/{PREFIX}/

upload: ../data/train_clicks.csv to s3://sagemaker-ap-southeast-1-850995562355/training/train_clicks.csv
upload: ../data/val_clicks.csv to s3://sagemaker-ap-southeast-1-850995562355/training/val_clicks.csv


In [30]:

# define the data type and paths to the training and validation datasets
content_type = "csv"
train_input = TrainingInput(f"s3://{BUCKET}/training/train_clicks.csv", content_type=content_type)
validation_input = TrainingInput(f"s3://{BUCKET}/training/val_clicks.csv", content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})

INFO:sagemaker:Creating training-job with name: sagemaker-scikit-learn-2025-03-03-15-26-13-320


2025-03-03 15:26:13 Starting - Starting the training job......
2025-03-03 15:27:06 Downloading - Downloading input data...
2025-03-03 15:27:47 Training - Training image download completed. Training in progress....[34m2025-03-03 15:28:11,072 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2025-03-03 15:28:11,075 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2025-03-03 15:28:11,117 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2025-03-03 15:28:11,298 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/miniconda3/bin/python -m pip install -r requirements.txt[0m
[34mCollecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 316.4/316.4 kB 23.7 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'[0m
[34m

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2025-03-03-15-26-13-320: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python ./train_lightfm.py --epochs 30 -

In [35]:
import argparse
import pandas as pd
import numpy as np
from lightfm import LightFM
from scipy.sparse import coo_matrix
import joblib

def load_data(csv_file):
    # Read CSV into a DataFrame
    df = pd.read_csv(csv_file)
    return df

def build_interaction_matrix(df):
    # Get unique users and items
    user_ids = df['user_id'].unique()
    item_ids = df['item_id'].unique()
    
    # Build mapping from original IDs to 0-indexed values
    user_map = {user: idx for idx, user in enumerate(user_ids)}
    item_map = {item: idx for idx, item in enumerate(item_ids)}
    
    # Map user and item IDs to indices
    df['user_idx'] = df['user_id'].map(user_map)
    df['item_idx'] = df['item_id'].map(item_map)
    
    # Create a sparse matrix for interactions
    interactions = coo_matrix(
        (df['interaction'], (df['user_idx'], df['item_idx'])),
        shape=(len(user_ids), len(item_ids))
    )
    return interactions, user_map, item_map

def main(args):
    # Load CSV file from training input channel (SageMaker downloads it to /opt/ml/input/data/train)
    csv_file = args.interactions_data
    print(f"Loading data from: {csv_file}")
    df = load_data(csv_file)
    
    # Build the interactions matrix and mappings
    interactions, user_map, item_map = build_interaction_matrix(df)
    
    # Initialize and train the LightFM model (using WARP loss for implicit data)
    model = LightFM(loss='warp', no_components=args.no_components)
    model.fit(interactions, epochs=args.epochs, num_threads=args.num_threads)
    
    # Save the trained model and mappings using joblib
    joblib.dump({
        'model': model,
        'user_map': user_map,
        'item_map': item_map
    }, args.model_output)
    print(f"Model saved to {args.model_output}")

def main2():
    hyperparameters = {
            "no_components": 30,
            "epochs": 30,
            "num_threads": 4
        }

    # Load CSV file from training input channel (SageMaker downloads it to /opt/ml/input/data/train)
    csv_file = "../data/train_clicks.csv"
    print(f"Loading data from: {csv_file}")
    df = load_data("../data/train_clicks.csv")
    
    # Build the interactions matrix and mappings
    interactions, user_map, item_map = build_interaction_matrix(df)
    
    # Initialize and train the LightFM model (using WARP loss for implicit data)
    model = LightFM(loss='warp', no_components=hyperparameters["no_components"])
    model.fit(interactions, epochs=hyperparameters["epochs"], num_threads=hyperparameters["num_threads"])
    
    # Save the trained model and mappings using joblib
    joblib.dump({
        'model': model,
        'user_map': user_map,
        'item_map': item_map
    }, "../data/lightfm_model.pkl")
    print(f"Model saved to ../data")


if __name__ == '__main__':
    # parser = argparse.ArgumentParser()
    # # Set default to the local channel directory for train data
    # parser.add_argument('--interactions_data', type=str, default='/opt/ml/input/data/train/train_clicks.csv')
    # parser.add_argument('--no_components', type=int, default=30)
    # parser.add_argument('--epochs', type=int, default=30)
    # parser.add_argument('--num_threads', type=int, default=4)
    # # The model will be saved to the directory provided by SageMaker's /opt/ml/model channel
    # parser.add_argument('--model_output', type=str, default='/opt/ml/model/lightfm_model.pkl')
    
    # args = parser.parse_args()
    # main(args)
    main2()


Loading data from: ../data/train_clicks.csv
Model saved to ../data


In [32]:
load_data("../data/train_clicks.csv")

Unnamed: 0,user_id,item_id,interaction
0,1,1,10
1,1,2,3
2,1,3,0
3,1,4,1
4,2,1,0
5,2,2,7
6,2,3,1
7,2,4,14
8,3,1,2
9,3,2,0
