Card fraud is a massive source of financial loss for businesses. As there are so many transactions, detecting them manually is an impossible task. We need to rely on automated models to do so.

A model that detects credit card fraud is an example of a classification model. Given some data what is the most likely class (or label) to assign to the data? In the case of credit card fraud, or data is transactions and associated properties (time of day, amount, location, etc.) and the classes are fraudulent or legitimate.

This notebook will walk through how to build a classification model for detecting credit card fraud, by:

1. Obtaining some sample data.
2. Cleaning the sample data.
3. Splitting the data up into training, validation, and test sets.
4. Creating a feed-forward neural network using TensorFlow and Keras, accounting for imbalanced data.
5. Evaluating the model by looking at its ROC curve

**Two very important sources**

1. Main Source: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html
2. Secondary Source: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

In [23]:
import warnings
warnings.filterwarnings('ignore')

In [24]:
#import os
#import pandas as pd

#directory = 'C:\Abhinav\Test\DL\data'
#dfs = []

#for filename in os.listdir(directory):
#    if filename.endswith('.pkl'):
#        filepath = os.path.join(directory, filename)
#        dfs.append(pd.read_pickle(filepath))
        
#combined_df = pd.concat(dfs, ignore_index=True)
#combined_df.to_csv('final_data.csv')

In [25]:
from datetime import date, datetime, timedelta
import os
import math

import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import matplotlib as mpl

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import sklearn
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [26]:
mpl.rcParams['figure.figsize'] = (12, 10)
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

In [27]:
df = pd.read_csv("C:/Abhinav/Test/DL/final_data.csv")

In [28]:
df

Unnamed: 0.1,Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2,2018-04-01 00:07:56,2,1365,146.00,476,0,0,0
3,3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0
...,...,...,...,...,...,...,...,...,...,...
288057,288057,288057,2018-04-30 23:56:58,818,7690,48.92,2591818,29,0,0
288058,288058,288058,2018-04-30 23:57:38,3763,7460,99.06,2591858,29,0,0
288059,288059,288059,2018-04-30 23:57:39,2000,8998,83.24,2591859,29,0,0
288060,288060,288060,2018-04-30 23:58:01,2566,6688,109.55,2591881,29,0,0


Each row of data has the following parameters:

- TRANSACTION_ID: a unique ID identifying the transaction
- TX_DATETIME: when the transaction occurred
- CUSTOMER_ID: the unique ID of the customer performing the transaction
- TERMINAL_ID: the unique ID of the terminal (aka merchant) where the transaction took place
- TX_AMOUNT: the amount of the transaction
- TX_TIME_SECONDS: when the transaction occurred, in seconds since the start of the simulation (2018-04-02)
- TX_TIME_DAYS: when the transaction occurred, in days, since the start of the simulation
- TX_FRAUD: whether this transaction was fraudulent or not, where 1 indicates a fraudulent transaction and 0 indicates a legitimate one
- TX_FRAUD_SCENARIO: this is unused

The data has far more legitimate transactions than fraudulent ones. This is a phenomenon called imbalanced data and will become very important later. For now, we can see what proportion of the dataset is fraudulent (less than 1%).

In [29]:
not_fraud_count, fraud_count = np.bincount(df['TX_FRAUD'])

total_count = not_fraud_count + fraud_count
print(
    (
        f"Data:\n"
        f"Total: {total_count}\n"
        f"Fraud: {fraud_count} ({100 * fraud_count / total_count:.2f}% of total)\n"
    )
)

Data:
Total: 288062
Fraud: 1702 (0.59% of total)



Taking a quick look at the data, we can see there is a difference in distributions between fraudulent and legitimate transactions. For example, the distribution of amounts seems to be much wider for fraudulent transactions in the chart below.

If we can find enough of these statistical differences (rather, if our model can), we can exploit them to detect fraudulent transactions.

In [30]:
df = pd.concat(
    [
        df[df['TX_FRAUD'] == 0].sample(1000, random_state=14),
        df[df['TX_FRAUD'] == 1].sample(1000, random_state=14)
    ]
)

fig = px.histogram(df, title='Transaction count for different amount', x='TX_AMOUNT', color='TX_FRAUD', marginal="box")
fig.update_traces(opacity=0.75)
fig.update_layout(barmode='overlay')
fig.show()

**Feature engineering**:
The model we are building is going to be a "feed-forward neural network". We could just use each transaction row as-is, but the model will perform better if we can extract more features that a naive neural network won't be able to learn.

In [110]:
clean_df = pd.DataFrame()

First, let's add some straightforward features

- amount: the amount of the transaction
- is_fraud: whether the transaction was fraudulent or not (this is technically the label, not a feature)
- is_weekend: whether the transaction was on a weekend or not
- is_night: whether the transaction was at night or not (we take "night" to mean late at night: after midnight and before 6am)

In [31]:
df["TX_DATETIME"] = pd.to_datetime(df["TX_DATETIME"])

clean_df["amount"] = df["TX_AMOUNT"]
clean_df["is_fraud"] = df["TX_FRAUD"]

clean_df["is_weekend"] = df["TX_DATETIME"].dt.weekday >= 5
clean_df["is_night"] = df["TX_DATETIME"].dt.hour <=6

In [32]:
clean_df

Unnamed: 0,amount,is_fraud,is_weekend,is_night
145348,4.39,0,False,True
56044,105.25,0,False,False
59646,36.99,0,True,False
60594,41.67,0,True,False
195052,83.81,0,True,False
...,...,...,...,...
113799,206.95,1,False,False
101028,102.56,1,False,False
266749,3.85,1,True,False
282067,295.00,1,False,False


One thing that this neural network definitely can't learn is customer behavior over multiple transactions. We need to look at a window over some period and, for each transaction, calculate the customer behavior we care about up to that point.

- customer_num_transactions_1_day: the number of transactions made by this customer over the past day (24 hours). There are similar features for a 7-day and 30-day period.
- customer_avg_amount_1_day: the average value of each transaction over the past day (24 hours). Just like the number of transactions, there are extra features for 7-day and 30-day periods.

These are calculated using the Pandas groupby and rolling functions.

In [37]:
clean_df["customer_num_transactions_1_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("1d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)
clean_df["customer_num_transactions_7_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("7d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)
clean_df["customer_num_transactions_30_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("30d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)

In [39]:
clean_df["customer_avg_amount_1_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("1d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)
clean_df["customer_avg_amount_7_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("7d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)
clean_df["customer_avg_amount_30_day"] = df.groupby("CUSTOMER_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("30d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)

For terminals, we do something similar.

- terminal_num_transactions_1_day: the number of transactions made over this period, just like the equivalent customer feature.
- terminal_fraud_risk_1_day: this is the fraud risk for the terminal. More on this below 👇

**Fraud risk**: 
The fraud risk is the proportion of transactions for this terminal that were fraudulent or not—over the past 1, 7, or 30 days.

However, we assume that we can only know for sure whether a transaction was fraudulent or not after some delay (7 days in this case) after some manual review. Therefore, **risk calculations are delayed by 7 days**. This means the risk calculation for a transaction on day N is actually for day N-7.

The function get_count_risk_rolling_window below shows how this is calculated.

For a deeper description of fraud risk, check out
https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/BaselineFeatureTransformation.html#terminal-id-transformations

In [64]:
DAY_DELAY = 7

def get_count_risk_rolling_window(terminal_transactions, window_size, delay_period=DAY_DELAY):
    terminal_transactions = terminal_transactions.sort_values("TX_DATETIME")
    frauds_in_delay = terminal_transactions.rolling(str(delay_period) + "d", on="TX_DATETIME")["TX_FRAUD"].sum()
    transactions_in_delay = terminal_transactions.rolling(str(delay_period) + "d", on="TX_DATETIME")["TX_FRAUD"].count()
    
    frauds_until_window = terminal_transactions.rolling(str(delay_period + window_size) + "d", on="TX_DATETIME")["TX_FRAUD"].sum()
    transactions_until_window = terminal_transactions.rolling(str(delay_period + window_size) + "d", on="TX_DATETIME")["TX_FRAUD"].count()
    
    frauds_in_window = frauds_until_window - frauds_in_delay
    transactions_in_window = transactions_until_window - transactions_in_delay
    
    terminal_transactions["fraud_risk"] = (frauds_in_window / transactions_in_window).fillna(0)

    return terminal_transactions

# Group-wise operations for terminal_num_transactions_X_day
clean_df["terminal_num_transactions_1_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("1d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)

clean_df["terminal_num_transactions_7_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("7d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)

clean_df["terminal_num_transactions_30_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: x.sort_values("TX_DATETIME")[["TX_DATETIME", "TX_AMOUNT"]].rolling("30d", on="TX_DATETIME").count()
)["TX_AMOUNT"].reset_index(level=0, drop=True)

# Group-wise operations for terminal_fraud_risk_X_day
clean_df["terminal_fraud_risk_1_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: get_count_risk_rolling_window(x, 1, 7)
)["fraud_risk"].reset_index(level=0, drop=True)

clean_df["terminal_fraud_risk_7_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: get_count_risk_rolling_window(x, 7, 7)
)["fraud_risk"].reset_index(level=0, drop=True)

clean_df["terminal_fraud_risk_30_day"] = df.groupby("TERMINAL_ID").apply(
    lambda x: get_count_risk_rolling_window(x, 30, 7)
)["fraud_risk"].reset_index(level=0, drop=True)

The remaining features are important for slicing the data later, and aren't actually used for training.

In [65]:
clean_df["day"] = df["TX_TIME_DAYS"]
clean_df["datetime"] = df["TX_DATETIME"]
clean_df["customer_id"] = df["CUSTOMER_ID"]
clean_df["id"] = df["TRANSACTION_ID"]

Let's take a look at a sample of our transformed rows:

In [66]:
pd.concat(
    [
        clean_df[clean_df["is_fraud"] == 1].sample(5, random_state=14),
        clean_df[clean_df["is_fraud"] == 0].sample(5, random_state=14),
    ]
).sample(10, random_state=14)

Unnamed: 0,amount,is_fraud,is_weekend,is_night,customer_num_transactions_1_day,customer_num_transactions_7_day,customer_num_transactions_30_day,customer_avg_amount_1_day,customer_avg_amount_7_day,customer_avg_amount_30_day,terminal_num_transactions_1_day,terminal_num_transactions_7_day,terminal_num_transactions_30_day,terminal_fraud_risk_1_day,terminal_fraud_risk_7_day,terminal_fraud_risk_30_day,day,datetime,customer_id,id
121848,90.39,1,False,False,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0,0.0,12,2018-04-13 15:00:20,2318,121848
149034,15.56,0,False,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,15,2018-04-16 12:38:45,3061,149034
19200,46.44,1,False,True,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,2,2018-04-03 01:20:37,4672,19200
11977,162.93,0,False,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1,2018-04-02 08:30:46,2807,11977
287437,13.63,1,False,False,1.0,1.0,2.0,1.0,1.0,2.0,4.0,7.0,13.0,0.0,1.0,1.0,29,2018-04-30 20:00:44,1423,287437
61592,129.98,1,True,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,0.0,0.0,0.0,6,2018-04-07 10:50:42,211,61591
230975,98.1,1,False,True,1.0,3.0,3.0,1.0,3.0,3.0,1.0,1.0,1.0,0.0,0.0,0.0,24,2018-04-25 04:19:09,2439,230975
202412,31.4,0,True,True,1.0,2.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,0.0,0.0,0.0,21,2018-04-22 05:48:33,860,202412
75523,104.76,0,True,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,7,2018-04-08 18:52:31,1843,75523
62457,28.81,0,True,False,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,6,2018-04-07 12:05:29,2505,62455


In [80]:
df = clean_df

In [92]:
#def get_train_test_set(
#    df,
#    start_date_training,
#    delta_train=7,
#    delta_delay=DAY_DELAY,
#    delta_test=7,
#    random_state=14,
#):

    # Get the training set data
#    train_df = df[
#        (df["datetime"] >= start_date_training)
#        & (df["datetime"] < start_date_training + timedelta(days=delta_train))
#    ]
    
#    print("Number of rows in train_df:", len(train_df))

    # Get the test set data
#    test_df = []

    # Note: Cards known to be compromised after the delay period are removed from the test set
    # That is, for each test day, all frauds known at (test_day-delay_period) are removed

    # First, get known defrauded customers from the training set
#    known_defrauded_customers = set(train_df[train_df["is_fraud"] == 1]["customer_id"])

    # Get the relative starting day of training set (easier than TX_DATETIME to collect test data)
#    start_tx_time_days_training = train_df["day"].min()

    # Then, for each day of the test set
#    for day in range(delta_test):

        # Get test data for that day
#        test_df_day = df[
#            df["day"] == start_tx_time_days_training + delta_train + delta_delay + day
#        ]
        
#        print("Number of rows in test_df_day for day", day, ":", len(test_df_day))

        # Compromised cards from that test day, minus the delay period, are added to the pool of known defrauded customers
#        test_df_day_delay_period = df[
#            df["day"] == start_tx_time_days_training + delta_train + day - 1
#        ]

#        new_defrauded_customers = set(
#            test_df_day_delay_period[test_df_day_delay_period["is_fraud"] == 1][
#                "customer_id"
#            ]
#        )
#        known_defrauded_customers = known_defrauded_customers.union(
#            new_defrauded_customers
#        )

#        test_df_day = test_df_day[
#            ~test_df_day["customer_id"].isin(known_defrauded_customers)
#        ]

#        test_df.append(test_df_day)

#    test_df = pd.concat(test_df)
#    print("Number of rows in test_df:", len(test_df))

    # Sort data sets by ascending order of transaction ID
#    train_df = train_df.sort_values("id")
#    test_df = test_df.sort_values("id")

#    return (train_df, test_df)

#train_df, test_df = get_train_test_set(
#    clean_df, datetime(2018, 4, 1), delta_train=21
#)
#train_df, val_df = get_train_test_set(train_df, datetime(2018, 4, 1))

Slicing the dataset
We need to split the dataset up into train, validation, and test sets.

The training set is used for training the model and is not used to evaluate it.
The validation set is used to track metrics while training but is not used for training.
The test set is not used at all during the training process but used at the end to see how the model generalizes to new data.
Most of the time, you can randomly sample from the dataset to split it up into each of these sets. In this case, we want to make sure each set comes from different points in time to reduce the chances of overfitting. We also take out customers known to be fraudulent in earlier sets to prevent overfitting.

This is all in the function get_train_test_set below.

![alt text](image.png "Title")

In [93]:
# this is adapted from get_train_test_set at
# https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_References/shared_functions.html#get-train-test-set
def get_train_test_set(
    df,
    start_date_training,
    delta_train=7,
    delta_delay=DAY_DELAY,
    delta_test=7,
    random_state=14,
):

    # Get the training set data
    train_df = df[
        (df["datetime"] >= start_date_training)
        & (df["datetime"] < start_date_training + timedelta(days=delta_train))
    ]

    # Get the test set data
    test_df = []

    # Note: Cards known to be compromised after the delay period are removed from the test set
    # That is, for each test day, all frauds known at (test_day-delay_period) are removed

    # First, get known defrauded customers from the training set
    known_defrauded_customers = set(train_df[train_df["is_fraud"] == 1]["customer_id"])

    # Get the relative starting day of training set (easier than TX_DATETIME to collect test data)
    start_tx_time_days_training = train_df["day"].min()

    # Then, for each day of the test set
    for day in range(delta_test):

        # Get test data for that day
        test_df_day = df[
            df["day"] == start_tx_time_days_training + delta_train + delta_delay + day
        ]

        # Compromised cards from that test day, minus the delay period, are added to the pool of known defrauded customers
        test_df_day_delay_period = df[
            df["day"] == start_tx_time_days_training + delta_train + day - 1
        ]

        new_defrauded_customers = set(
            test_df_day_delay_period[test_df_day_delay_period["is_fraud"] == 1][
                "customer_id"
            ]
        )
        known_defrauded_customers = known_defrauded_customers.union(
            new_defrauded_customers
        )

        test_df_day = test_df_day[
            ~test_df_day["customer_id"].isin(known_defrauded_customers)
        ]

        test_df.append(test_df_day)

    test_df = pd.concat(test_df)

    # Sort data sets by ascending order of transaction ID
    train_df = train_df.sort_values("id")
    test_df = test_df.sort_values("id")

    return (train_df, test_df)


train_df, test_df = get_train_test_set(
    clean_df, datetime(2018, 4, 1), delta_train=21
)
train_df, val_df = get_train_test_set(train_df, datetime(2018, 4, 1))

For each of these sets, we make arrays of features (the properties we want to train on) and labels (the things we want to predict).

In [94]:
label_columns = ["is_fraud"]
feature_columns = [
    "amount",
    "is_weekend",
    "is_night",
    "customer_num_transactions_1_day",
    "customer_num_transactions_7_day",
    "customer_num_transactions_30_day",
    "customer_avg_amount_1_day",
    "customer_avg_amount_7_day",
    "customer_avg_amount_30_day",
    "terminal_num_transactions_1_day",
    "terminal_num_transactions_7_day",
    "terminal_num_transactions_30_day",
    "terminal_fraud_risk_1_day",
    "terminal_fraud_risk_7_day",
    "terminal_fraud_risk_30_day",
]

train_labels = np.array(train_df[label_columns])
val_labels = np.array(val_df[label_columns])
test_labels = np.array(test_df[label_columns])

train_features = np.array(train_df[feature_columns])
val_features = np.array(val_df[feature_columns])
test_features = np.array(test_df[feature_columns])

Finally, we just want to make sure all of the values are in a similar scale. This makes learning a little more predictable. The scaler is tuned on the training data, and then used to scale the validation and test data.

In [95]:
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)

val_features = scaler.transform(val_features)
test_features = scaler.transform(test_features)

print('Training labels shape:', train_labels.shape)
print('Validation labels shape:', val_labels.shape)
print('Test labels shape:', test_labels.shape)

print('Training features shape:', train_features.shape)
print('Validation features shape:', val_features.shape)
print('Test features shape:', test_features.shape)

Training labels shape: (329, 1)
Validation labels shape: (462, 1)
Test labels shape: (161, 1)
Training features shape: (329, 15)
Validation features shape: (462, 15)
Test features shape: (161, 15)


**THE MODEL**: Our model needs to take each row of features and make a prediction of whether a transaction is fradulent or not.

A simple feed-forward neural network is good for this. It performs a multitude of mathematical calculations on each input row. The parameters of each calculations are what end up getting tuned during training.

This model will be built with *Keras*, which is a user-friendly deep learning API on top of *TensorFlow*

**Imabalanced Dataset**
We are working with *imbalamced data*. Less than 1% of our dataset contains fradulent transactions. THis makes naive techniques for learning problematic. A model predicts everything as legitimate would be correct over 99% of the time , but that would be a useless model.

One method to accomodate for this is class weighting. This means penalizing misclassifications of one class (in this case, fradulent transactions) than the other. Below, we calculate the weights for each class that we will pass to *Keras*. The weights make fradulent labels 120 times "more important" than the non-fraudulent labels.

In [96]:
weight_for_not_fraud = (1.0 / not_fraud_count) * total_count / 2.0
weight_for_fraud = (1.0 / fraud_count) * total_count / 2.0

class_weight = {0: weight_for_not_fraud, 1: weight_for_fraud}
class_weight

{0: 0.5029717837686828, 1: 84.62455934195064}

**MODEL STRUCTURE**

This model is made up of two hidden layers with 500 nodes each. There is a dropout layer to prevent overfitting. This means that for each node in the final hidden layer, there is a 0.2 chance in each training run of it not being used.

The loss function (what training will aim to minimize) is binary cross entropy. This is standard for binary classification problems.

**Where did these hyperparameters come from?**

Hyperparameters determine how our model learns, rather than the parameters of the model that are learned through training. Often, hyperparameters can be chosen arbitrarily by their model's authors. There are more quantitative approaches to finding the correct ones. In this case, several hyperparameters have come from analysis in the Handbook.

1. Batch size: 64
2. Epochs: 40
3. Number of hidden layers: 2
4. Nodes per hidden layer: 500
5. Probability of dropout layer: 0.2
6. Learning rate: 0.001

Finding the correct hyperparameters can be a lot of work, so I'm very thankful that I don't have to. If you're interested in the process, see Chapter 7.2.7 of the Handbook.

In [97]:
output_bias = tf.keras.initializers.Constant(np.log([fraud_count / not_fraud_count]))

model = keras.Sequential(
    [
        keras.layers.Dense(
            500, activation="relu", input_shape=(train_features.shape[-1],)
        ),
        keras.layers.Dense(
            500, activation="relu", input_shape=(train_features.shape[-1],)
        ),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(1, activation="sigmoid", bias_initializer=output_bias)
    ]
)

model.compile(
    optimizer = keras.optimizers.Adam(learning_rate=1e-3),
    loss = keras.losses.BinaryCrossentropy(),
    metrics=[
        keras.metrics.Precision(name="precision"),
        keras.metrics.Recall(name="recall"),
        keras.metrics.AUC(name="AUC"),
        keras.metrics.AUC(name="prc", curve="PR")
    ]
)
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 500)               8000      
                                                                 
 dense_1 (Dense)             (None, 500)               250500    
                                                                 
 dropout (Dropout)           (None, 500)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 501       
                                                                 
Total params: 259001 (1011.72 KB)
Trainable params: 259001 (1011.72 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


**MODEL TRAINING**: 
Everything has been working towards this. Now we can train the model on our training set. Additionally, we use early stopping to prevent overfitting. This just means that if the model starts performing worse on the validation set, we revert to an earlier epoch when the performance was better.

In [98]:
BATCH_SIZE = 64

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_prc", verbose=1, patience=10, mode="max", restore_best_weights=True
)

training_history = model.fit(
    train_features,
    train_labels,
    batch_size = BATCH_SIZE,
    epochs = 40,
    callbacks = [early_stopping],
    validation_data = (val_features, val_labels),
    class_weight = class_weight,
)

Epoch 1/40

Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
1/6 [====>.........................] - ETA: 0s - loss: 1.6408 - precision: 0.2812 - recall: 1.0000 - AUC: 0.8321 - prc: 0.5951Restoring model weights from the end of the best epoch: 1.
Epoch 11: early stopping



Below is a plot of each of the metrics we are tracking over the course of the training. We'll dive into some of these metrics in the next section when we evaluate how the model performs on the test set.

In [103]:
res = []

metrics_to_plot = [
    ("loss", "Loss"),
    ("precision", "Precision"),
    ("recall", "Recall"),
    ("AUC", "Area under ROC curve"),
    ("prc", "Area under PR curve"),
]
fig = make_subplots(rows=len(metrics_to_plot), cols=1)

for metric, name in metrics_to_plot:
    fig = go.Figure(
        data=[
            go.Scatter(
                x=training_history.epoch,
                y=training_history.history[metric],
                mode="lines",
                name="Training",
            ),
            go.Scatter(
                x=training_history.epoch,
                y=training_history.history["val_" + metric],
                mode="lines",
                line={"dash": "dash"},
                name="Validation",
            ),
        ]
    )
    fig.update_yaxes(title=name)
    fig.update_xaxes(title="Epoch")

    if (metric, name) == metrics_to_plot[0]:
        fig.update_layout(
            height=250, title="Training history", margin={"b": 0, "t": 50}
        )
    else:
        fig.update_layout(height=200, margin={"b": 0, "t": 0})
    fig.show()

**Model performance**: 
The model has been trained to maximize its performance on our training data. How does it actually perform when making predictions on the training and test (aka unseen) data?

This isn't a matter of looking at a single number like the loss function.

The model outputs a number between 0 and 1. Although it isn't technically true, it helps to think of this as the probability that a transaction is fraudulent. A histogram of the outputs of predictions on the training set is below. The correct label is also shown (0 for legitimate, 1 for fraudulent). The amount of samples for each label is slightly biased to make the scales similar.

In [108]:
train_predictions = model.predict(train_features, batch_size=BATCH_SIZE)
test_predictions = model.predict(test_features, batch_size=BATCH_SIZE)

predictions_df = pd.DataFrame(
    {"Prediction": train_predictions.ravel(), "Label": train_labels.ravel()}
)

legitimate_sample_size = min(5000, len(predictions_df[predictions_df["Label"] == 0]))
fraudulent_sample_size = min(500, len(predictions_df[predictions_df["Label"] == 1]))

predictions_df = pd.concat([
    predictions_df[predictions_df["Label"] == 0].sample(legitimate_sample_size, random_state=0),
    predictions_df[predictions_df["Label"] == 1].sample(fraudulent_sample_size, random_state=0)
])

fig = px.histogram(
    predictions_df,
    x="Prediction",
    title="Prediction values",
    color="Label",
    marginal="box",
    labels={"0": "Legitimate", "1": "Fraudulent"},
)
fig.update_traces(opacity=0.75)
fig.update_layout(barmode="overlay")
fig.show()



In reality, we'd need to decide a threshold for a transaction being marked as fraudulent. Rather than working with this number, we can use a metric we actually care about. This is where something like the receiver operating characteristic (ROC) curve comes in.

The ROC curve maps the rate of true positives against the rate of false positives. This allows us to ask the question: "how many false positives are we willing to tolerate?" or "how many false negatives are we willing to tolerate?". In the context of fraud, this means either restricting a legitimate person's credit card, or letting a fraudster get away.

This ROC curve is one way to evaluate models. A perfect ROC curve starts at 100% true positives with 0% false negatives. In practice, models can't do this.

Looking at the ROC curve below, some things are immediately obvious:

1. The training set performs considerably better than the test set (this is expected)
2. We can catch a majority of fraudulent transactions with a very low false positive rate.
3. If we set the highest false-positive rate we are willing to tolerate to 1% (which is really high for something like credit card transactions), we can catch around 65% of fraudulent transactions. Anything beyond this point is diminishing returns.

In [109]:
def make_roc_df(name, predictions, labels):
    fp, tp, _ = sklearn.metrics.roc_curve(labels, predictions)
    return pd.DataFrame({"fp": fp*100, "tp": tp*100, "Dataset": name})

roc_df = pd.concat(
    [
        make_roc_df("Training", train_predictions, train_labels),
        make_roc_df("Test", test_predictions, test_labels),
    ]
)

fig = px.line(
    roc_df,
    title="ROC Curve",
    x="fp",
    y="tp",
    color="Dataset",
    labels={"fp": "False Positives (%)", "tp": "True Positives (%)"},
)

fig.update_yaxes(range=[60, 100])
fig.update_traces(line={"dash": "dash"}, selector={"name": "test"})
fig.show()