# Training and Deploying the Fraud Detection Model

In this notebook, we will take the outputs from the Processing Job in the previous step and use it and train and deploy an XGBoost model. Our historic transaction dataset is initially comprised of data like timestamp, card number, and transaction amount and we enriched each transaction with features about that card number's recent history, including:

- `num_trans_last_10m`
- `num_trans_last_1w`
- `avg_amt_last_10m`
- `avg_amt_last_1w`

Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

### Imports 

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import sagemaker
import boto3
import io

### Essentials 

First, let's load the results of the SageMaker Processing Job ran in the previous step into a Pandas dataframe. 

In [None]:
df = pd.read_csv(f'{LOCAL_DIR}/aggregated/processing_output.csv')
#df.dropna(inplace=True)
df['cc_num'] = df['cc_num'].astype(np.int64)
df['fraud_label'] = df['fraud_label'].astype(np.int64)
df.head()
len(df)

### Split DataFrame into Train & Test Sets

The artifically generated dataset contains transactions from `2020-01-01` to `2020-06-01`. We will create a training and validation set out of transactions from `2020-01-15` and `2020-05-15`, discarding the first two weeks in order for our aggregated features to have built up sufficient history for cards and leaving the last two weeks as a holdout test set. 

In [None]:
training_start = '2020-01-15'
training_end = '2020-05-15'

training_df = df[(df.datetime > training_start) & (df.datetime < training_end)]
test_df = df[df.datetime >= training_end]

test_df.to_csv(f'{LOCAL_DIR}/test.csv', index=False)

Although we now have lots of information about each transaction in our training dataset, we don't want to pass everything as features to the XGBoost algorithm for training because some elements are not useful for detecting fraud or creating a performant model:
- A transaction ID and timestamp is unique to the transaction and never seen again. 
- A card number, if included in the feature set at all, should be a categorical variable. But we don't want our model to learn that specific card numbers are associated with fraud as this might lead to our system blocking genuine behaviour. Instead we should only have the model learn to detect shifting patterns in a card's spending history. 
- Individual card numbers may have radically different spending patterns, so we will want to use normalized ratio features to train our XGBoost model to detect fraud. 

Given all of the above, we drop all columns except for the normalised ratio features and transaction amount from our training dataset.