# Credit Card Fraud Detection
## Final Project for the Codigo Facilito's Machine Learning 2023 Bootcamp

## Problem Definition and Objectives

### Problem Definition

Nowadays it is very easy for malicious actors to gain access to illegaly obtained banking accounts authentication databases that allows them to access unsuspected victim's financial assets without their knowledge, until it's too late. To minimize the impact of this, different techniques can be applied to detect when a user's identity has been comprompised and their assets are being accessed illegaly.

One of them are fraud detection systems, which are able to learn users' baking transactions behaviour - which means learning when (at what time of day), where (the type of stores they commonly visit) and how (purchasing online vs swiping a physical card) they usually perform transactions with their credit cards - in order to detect when a new transaction doen't follow the pattern previously learned, flagging such transactions as fraudulent, and require the user to perform additional verification for the transaction to go through.

### Problem Relevance


Just to highlight the importance of fraud detection systems, according to the [Security.org 2023 Credit Card Fraud Report](https://www.security.org/digital-safety/credit-card-fraud-report/):
- 65% of credit and credit card holders have been fraud victims at some point in their lives, up from 58 percent in 2022. This equates to about 151 million victims of fraud in the United States alone.
- An increasing number of Americans have been victimized multiple times: in 2022, 44 percent of credit card users reported having two or more fraudulent charges, compared to 35 percent in 2021.
- Since 2021, the median fraudulent charge has climbed by about 27 percent (rising to $79 in 2023). This equates to about $12 billion in total attempted fraudulent charges.



### Key Stakeholders

The main stakeholders in this project are:

1) The banking institution(s) that would provide banking transaction data required to train the machine learning model.
2) The user(s) allowing for their banking transactions data to be used to train the model.
3) The FTC (in the US) and other regulatory institutions that would need to verify and approve the use of the users data to train the model, and approve the use of the model.

### Objectives

The goal of this project is to build a machine learning model that allows the detection of fraudulent credit card transactions by training it with a credit card transaction dataset, and build a feature engineering and training pipeline that will allow the model to be re-trained in the future.

- The final machine learning model should provide at least 80% of fraud detection accuracy.
- A feature egineering and training pipeline should be used to allow future training of the model.
- An application that allows a user to enter a dummy transaction and verify its authenticity.

### Preparation Steps

1) Identify a public credit card transaction dataset suitable for an Exploratory Data Analysis, that allows the clear and easy identification of each column's information. Some datasets available in Kaggle contain columns that were already scaled or processed using PCA analysis, and therefore are not useful for this project's goals.
2) Research the different machine learning models that are best suited for detecting fraudulent credit card transactions.
3) Select a suitable online platform to deploy the machine learning model that is free to use.

### Dataset

- We will use the [Credit Card Transactions Kaggle Dataset](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions), because it contains a good amount of data to train our model - approximately 24M transactions of 2000 users generated by a multi-agent virtual world simulation performed by IBM - and its columns are easy to identify and work with because they are not scaled or obfuscated in any way that could result in us not being able to find correlations in the data.

- The dataset contains the following columns:
    1) 'User': An ID of the user.
    2) 'Card': An ID for the user's card, some users have multiple cards.
    3) 'Year', 'Month', 'Day', 'Time': The timestamp of the transaction. 
    4) 'Amount': The amount of the transaction.
    5) 'Use Chip': 'Swipe Transaction' if a physical card was used to perform the transaction, or 'Online Transaction' if the transaction was performed online.
    6) 'Merchant Name': The ID of the store where the transaction was made.
    7) 'Merchant City', 'Merchant State', 'Zip': The store's location.
    8) 'MCC': The [Merchant Category Code](https://www.investopedia.com/terms/m/merchant-category-codes-mcc.asp).
    9) 'Errors?': Any error(s) during the transaction, eg. 'Insufficient Balance', 'Technical Glitch', etc.
    10) 'Is Fraud?: A label indicating if the transaction was fraudulent or not.

### Deployment Plan

For deploying our model we will use BentoML, because it provides a very robust framework to serve and deploy machine learning models in the cloud. We will deploy our model into a free-tier virtual machine in the Google Cloud's Compute Engine, provided it has enough resources to run our model and server our model. In case it doesn't, then we will not deploy our model to the cloud, and we will store our model locally instead.

## Exploratory Data Analysis

We will begin by loading the credit card transaction dataset into a polars DataFrame and confirm the contents of the file have been loaded successfully.

In [1]:
import polars as pl
pl.Config(tbl_rows=10)

data_df = pl.read_csv("../data/credit_card_transactions-ibm_v2.csv")
data_df

User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
i64,i64,i64,i64,i64,str,str,str,i64,str,str,f64,i64,str,str
0,0,2002,9,1,"""06:21""","""$134.09""","""Swipe Transact…",3527213246127876953,"""La Verne""","""CA""",91750.0,5300,,"""No"""
0,0,2002,9,1,"""06:42""","""$38.48""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""06:22""","""$120.34""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""17:45""","""$128.95""","""Swipe Transact…",3414527459579106770,"""Monterey Park""","""CA""",91754.0,5651,,"""No"""
0,0,2002,9,3,"""06:23""","""$104.71""","""Swipe Transact…",5817218446178736267,"""La Verne""","""CA""",91750.0,5912,,"""No"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1999,1,2020,2,27,"""22:23""","""$-54.00""","""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""",3054.0,5541,,"""No"""
1999,1,2020,2,27,"""22:24""","""$54.00""","""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""",3054.0,5541,,"""No"""
1999,1,2020,2,28,"""07:43""","""$59.15""","""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""",3054.0,4121,,"""No"""
1999,1,2020,2,28,"""20:10""","""$43.12""","""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""",3054.0,4121,,"""No"""


First we can identify a few columns that need some work.
- The Time column needs to be transformed into "Hour" and "Minute" columns, instead of a string.
- We need to convert the Amount column into an actual number instead of a string.
- Let's make the Merchant Name a string, and then make it categorical.
- The Merchant State is empty when the transaction was online, so we'll fill the nulls with "ONLINE".
- The Zip column is empty when the transaction was online, so let's convert it to a string and replace null with "ONLINE".
- Let's change the null values in Errors? to "N/A".
- Finally, let's convert the IsFraud? column to 1 if it's fraud, and 0 if not.

In [2]:
new_data_df = data_df.with_columns(
    Hour = pl.col("Time").map_elements(
        lambda x: x[:2]).cast(pl.Int64, strict=True),
    Minute = pl.col("Time").map_elements(
        lambda x: x[3:]).cast(pl.Int64, strict=True),
).drop("Time").with_columns(
    pl.col("Amount").map_elements(
        lambda x: x.replace("$", "")).cast(pl.Float64, strict=True)
).with_columns(
    pl.col("Merchant State").fill_null("ONLINE")
).with_columns(
    pl.col("Zip").cast(pl.String, strict=True).fill_null("ONLINE")
).with_columns(
    pl.col("Errors?").fill_null(value="No")
).with_columns(
    pl.col("Is Fraud?").map_elements(
        lambda x: 0 if x == "No" else 1
    )
)
new_data_df

User,Card,Year,Month,Day,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?,Hour,Minute
i64,i64,i64,i64,i64,f64,str,i64,str,str,str,i64,str,i64,i64,i64
0,0,2002,9,1,134.09,"""Swipe Transact…",3527213246127876953,"""La Verne""","""CA""","""91750.0""",5300,"""No""",0,6,21
0,0,2002,9,1,38.48,"""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""","""91754.0""",5411,"""No""",0,6,42
0,0,2002,9,2,120.34,"""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""","""91754.0""",5411,"""No""",0,6,22
0,0,2002,9,2,128.95,"""Swipe Transact…",3414527459579106770,"""Monterey Park""","""CA""","""91754.0""",5651,"""No""",0,17,45
0,0,2002,9,3,104.71,"""Swipe Transact…",5817218446178736267,"""La Verne""","""CA""","""91750.0""",5912,"""No""",0,6,23
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1999,1,2020,2,27,-54.0,"""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""","""3054.0""",5541,"""No""",0,22,23
1999,1,2020,2,27,54.0,"""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""","""3054.0""",5541,"""No""",0,22,24
1999,1,2020,2,28,59.15,"""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""","""3054.0""",4121,"""No""",0,7,43
1999,1,2020,2,28,43.12,"""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""","""3054.0""",4121,"""No""",0,20,10


Now, let's see how the Amount varies between legitimate and fraudulent transactions. For that we will use altair to chart the top 100 amounts used for transactions, with a function called amount_counts_chart.

In [3]:
import altair as alt
from altair import Chart
from polars import DataFrame


def amount_counts_chart(data_df: DataFrame, source: str) -> Chart:
    data_df = data_df.top_k(100, by="Count")
    top_amount = data_df.top_k(1, by="Count").to_numpy()[0][1]
    return alt.Chart(
        data_df,
        title=f"Top 100 {source} Amounts").mark_bar().encode(
        x=alt.X('Amount', axis=alt.Axis(labelAngle=-45)),
        y="Count",
        color=alt.condition(
            alt.datum.Count == top_amount,
            alt.value('orange'),
            alt.value('steelblue'))
    ).properties(width=400)

legit = new_data_df.select(
    pl.col("Amount"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).select(
        pl.col("Amount").unique(), pl.col("Amount").unique_counts().alias("Count")
    )

amount_counts_chart(legit, "Legitimate")

In [4]:
fraud = new_data_df.select(
    pl.col("Amount"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).select(
        pl.col("Amount").unique(), pl.col("Amount").unique_counts().alias("Count")
    )
amount_counts_chart(fraud, "Fraudulent")

As we can see, the amounts in fraudulent transactions are rarely repeated in comparison with the legitimate ones.

Now let's see how the Use Chip column varies between legitimate and fraudulent transactions with the following use_chip_chart function.

In [5]:
def use_chip_chart(data_df: DataFrame, source: str) -> Chart:
    return alt.Chart(
        data_df,
        title=f"{source} Use Chip").mark_arc(innerRadius=70).encode(
        color=alt.Color("Use Chip", title="Use Chip", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_chip = new_data_df.select(
    pl.col("Use Chip"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Use Chip").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_chip = new_data_df.select(
    pl.col("Use Chip"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Use Chip").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
use_chip_chart(legit_chip, "Legitimate")

In [6]:
use_chip_chart(fraud_chip, "Fraudulent")

As we can see, cards are used online more frequently for fraudulent transactions than for legitimate ones, where swipe transactions are more common.

Now let's see if the locations vary between legitimate and fraudulent transactions with the following top_10_merchant_state_chart and top_10_merchant_city_chart functions.

In [7]:
def top_10_merchant_state_chart(data_df: DataFrame, source: str) -> Chart:
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} Merchant State").mark_arc(innerRadius=70).encode(
        color=alt.Color("Merchant State", title="Merchant State", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

def top_10_merchant_city_chart(data_df: DataFrame, source: str) -> Chart:
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} Merchant City").mark_arc(innerRadius=70).encode(
        color=alt.Color("Merchant City", title="Merchant State", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_state = new_data_df.select(
    pl.col("Merchant State"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Merchant State").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
legit_city = new_data_df.select(
    pl.col("Merchant City"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Merchant City").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_state = new_data_df.select(
    pl.col("Merchant State"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Merchant State").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_city = new_data_df.select(
    pl.col("Merchant City"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Merchant City").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
print(legit_state.sort("Count", descending=True))
top_10_merchant_state_chart(legit_state, "Legitimate")

shape: (223, 2)
┌────────────────┬─────────┐
│ Merchant State ┆ Count   │
│ ---            ┆ ---     │
│ str            ┆ u32     │
╞════════════════╪═════════╡
│ ONLINE         ┆ 2702472 │
│ CA             ┆ 2591079 │
│ TX             ┆ 1792993 │
│ FL             ┆ 1458385 │
│ NY             ┆ 1446624 │
│ …              ┆ …       │
│ Togo           ┆ 2       │
│ Tonga          ┆ 2       │
│ Kiribati       ┆ 1       │
│ Paraguay       ┆ 1       │
│ Botswana       ┆ 1       │
└────────────────┴─────────┘


In [8]:
print(legit_city.sort("Count", descending=True))
top_10_merchant_city_chart(legit_city, "Legitimate")

shape: (13_397, 2)
┌───────────────┬─────────┐
│ Merchant City ┆ Count   │
│ ---           ┆ ---     │
│ str           ┆ u32     │
╞═══════════════╪═════════╡
│ ONLINE        ┆ 2702472 │
│ Houston       ┆ 246027  │
│ Los Angeles   ┆ 180460  │
│ Miami         ┆ 178628  │
│ Brooklyn      ┆ 155411  │
│ …             ┆ …       │
│ Lyndon Center ┆ 1       │
│ Allerton      ┆ 1       │
│ Asuncion      ┆ 1       │
│ Poyen         ┆ 1       │
│ Hungry Horse  ┆ 1       │
└───────────────┴─────────┘


In [9]:
print(fraud_state.sort("Count", descending=True))
top_10_merchant_state_chart(fraud_state, "Fraudulent")

shape: (61, 2)
┌────────────────┬───────┐
│ Merchant State ┆ Count │
│ ---            ┆ ---   │
│ str            ┆ u32   │
╞════════════════╪═══════╡
│ ONLINE         ┆ 18349 │
│ Italy          ┆ 4682  │
│ OH             ┆ 878   │
│ CA             ┆ 751   │
│ Algeria        ┆ 629   │
│ …              ┆ …     │
│ WV             ┆ 6     │
│ ME             ┆ 5     │
│ India          ┆ 5     │
│ ND             ┆ 2     │
│ DE             ┆ 2     │
└────────────────┴───────┘


In [10]:
print(fraud_city.sort("Count", descending=True))
top_10_merchant_city_chart(fraud_city, "Fraudulent")

shape: (1_973, 2)
┌────────────────┬───────┐
│ Merchant City  ┆ Count │
│ ---            ┆ ---   │
│ str            ┆ u32   │
╞════════════════╪═══════╡
│ ONLINE         ┆ 18349 │
│ Rome           ┆ 4683  │
│ Algiers        ┆ 629   │
│ Port au Prince ┆ 375   │
│ Strasburg      ┆ 322   │
│ …              ┆ …     │
│ Sioux Falls    ┆ 1     │
│ Roscoe         ┆ 1     │
│ Munroe Falls   ┆ 1     │
│ Westland       ┆ 1     │
│ Dekalb         ┆ 1     │
└────────────────┴───────┘


With the previous charts we can see that online transactions are dominant for the merchant state and city in both leigitimate and fraudulent transactions. However we can also see that for fraudulent transactions, the states and cities include locations outside of the US, while for legitimate transactions they are mostly in the US except for the online transactions. This can be mostly explained because this dataset was generated to simulate US based people, so we can expect most of their legitimate transactions to come from US locations.

Now let's see if the Errors? column varies significantly between legitimate and fraudulent transactions, using an errors_chart function.

In [26]:
def errors_chart(data_df: DataFrame, source: str) -> Chart:
    return alt.Chart(
        data_df,
        title=f"{source} Errors?").mark_arc(innerRadius=70).encode(
        color=alt.Color("Errors?", title="Errors?", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_errors = new_data_df.select(
    pl.col("Errors?"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0).filter(pl.col("Errors?") != "No"
    ).group_by("Errors?").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_errors = new_data_df.select(
    pl.col("Errors?"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1).filter(pl.col("Errors?") != "No"
    ).group_by("Errors?").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
print(legit_errors.sort("Count", descending=True))
errors_chart(legit_errors, "Legitimate")

shape: (23, 2)
┌───────────────────────────────────┬────────┐
│ Errors?                           ┆ Count  │
│ ---                               ┆ ---    │
│ str                               ┆ u32    │
╞═══════════════════════════════════╪════════╡
│ Insufficient Balance              ┆ 242387 │
│ Bad PIN                           ┆ 58616  │
│ Technical Glitch                  ┆ 48094  │
│ Bad Card Number                   ┆ 13216  │
│ Bad Expiration                    ┆ 10596  │
│ …                                 ┆ …      │
│ Bad CVV,Technical Glitch          ┆ 20     │
│ Bad Zipcode,Insufficient Balance  ┆ 13     │
│ Bad Zipcode,Technical Glitch      ┆ 7      │
│ Bad Card Number,Bad Expiration,I… ┆ 2      │
│ Bad Card Number,Bad Expiration,T… ┆ 1      │
└───────────────────────────────────┴────────┘


In [27]:
print(fraud_errors.sort("Count", descending=True))
errors_chart(fraud_errors, "Fraudulent")

shape: (14, 2)
┌───────────────────────────────────┬───────┐
│ Errors?                           ┆ Count │
│ ---                               ┆ ---   │
│ str                               ┆ u32   │
╞═══════════════════════════════════╪═══════╡
│ Insufficient Balance              ┆ 396   │
│ Bad PIN                           ┆ 302   │
│ Bad CVV                           ┆ 280   │
│ Bad Expiration                    ┆ 120   │
│ Bad Card Number                   ┆ 105   │
│ …                                 ┆ …     │
│ Bad Expiration,Bad CVV            ┆ 2     │
│ Bad Expiration,Technical Glitch   ┆ 2     │
│ Bad PIN,Technical Glitch          ┆ 1     │
│ Bad CVV,Technical Glitch          ┆ 1     │
│ Bad Expiration,Insufficient Bala… ┆ 1     │
└───────────────────────────────────┴───────┘


With the previous charts we can see that Insufficient Balance is the most common error for both legitimate and fraudulent transactions.

Now let's see how the Merchant Category Codes (MCC) vary between fraudulent and legitimate transactions using a top_10_mcc_chart function.

In [13]:
def top_10_mcc_chart(data_df: DataFrame, source: str) -> Chart:
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} MCC").mark_arc(innerRadius=70).encode(
        color=alt.Color("MCC", title="MCC", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_mcc = new_data_df.select(
    pl.col("MCC"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("MCC").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_mcc = new_data_df.select(
    pl.col("MCC"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("MCC").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )

print(legit_mcc.sort("Count", descending=True))
top_10_mcc_chart(legit_mcc, "Legitimate")

shape: (109, 2)
┌──────┬─────────┐
│ MCC  ┆ Count   │
│ ---  ┆ ---     │
│ i64  ┆ u32     │
╞══════╪═════════╡
│ 5411 ┆ 2859795 │
│ 5499 ┆ 2680349 │
│ 5541 ┆ 2638628 │
│ 5812 ┆ 1797593 │
│ 5912 ┆ 1406579 │
│ …    ┆ …       │
│ 3075 ┆ 626     │
│ 3007 ┆ 625     │
│ 3144 ┆ 579     │
│ 5733 ┆ 369     │
│ 4411 ┆ 317     │
└──────┴─────────┘


In [14]:
print(fraud_mcc.sort("Count", descending=True))
top_10_mcc_chart(fraud_mcc, "Fraudulent")

shape: (98, 2)
┌──────┬───────┐
│ MCC  ┆ Count │
│ ---  ┆ ---   │
│ i64  ┆ u32   │
╞══════╪═══════╡
│ 5311 ┆ 4824  │
│ 5300 ┆ 2201  │
│ 5310 ┆ 2152  │
│ 4829 ┆ 1607  │
│ 5912 ┆ 1057  │
│ …    ┆ …     │
│ 7802 ┆ 15    │
│ 7230 ┆ 14    │
│ 7531 ┆ 11    │
│ 8041 ┆ 5     │
│ 8049 ┆ 3     │
└──────┴───────┘


We can see that the most used Merchant Category Codes vary between fraudulent and legitimate transactions.

## Model Creation and Evaluation

### Splitting our Dataset

To be able to create, train and deploy our machine learning model, first we need to create a function that will split our dataset in training, validation and test samples.

For that purpose we will create a split_dataset function that receives our dataset, and the percentages of data that we will use to train our model, and to validate and test our model's performance.

The first thing this function does is separating our dependent variable in the "Is Fraud?" column from the rest of the columns.

This function uses the train_test_split function from sklearn.model_selection, and we will call it with a random_state=0 value so it always returns the same data split every time we call it.

We will need to convert the columns with categorical data that are not integer values into their categorical "physical" (integer) values.

In [15]:
from sklearn.model_selection import train_test_split


def split_dataset(data_df: DataFrame, train_proportion: float, test_proportion: float):
    data_df = data_df.with_columns(
        pl.col("Use Chip").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant Name").cast(pl.String).cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant City").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant State").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Zip").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Errors?").cast(pl.Categorical).to_physical()
    )
    is_fraud = data_df.select(pl.col('Is Fraud?'))
    features = data_df.drop('Is Fraud?')

    original_count = len(data_df)
    training_size = int(original_count * train_proportion)
    test_size = int((1 - train_proportion) * test_proportion * training_size)

    train_x, rest_x, train_y, rest_y = train_test_split(features,
                                                        is_fraud,
                                                        train_size=training_size,
                                                        random_state=0)
    validate_x, test_x, validate_y, test_y  = train_test_split(rest_x,
                                                               rest_y,
                                                               train_size=test_size,
                                                               random_state=0)

    return (train_x, train_y), (validate_x, validate_y), (test_x, test_y)

### Pipeline Creation

Next, we will create our feature engineering and model training pipeline.

#### Feature Engineering

- We will use a RobustScaler to scale the Amount column.
- Normally, we would use OneHotEncoder for the categorical columns, however, using it would add too many columns to our data because the categorical columns have way too many categories, so to keep things more manageable for our model we will use a [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder) to scale the Card, Merchant Name, Merchant City, Merchant State, Zip, MCC, Errors?, Hour, Minute, and Use Chip colums.

#### Modeling
- For our model, we will use a RandomForestClassifier with 10 estimators.
- We will measure both the accuracy and recall of our model, however, the most important metric will be the recall, because we really need to be able to find all the positive fraudulent transactions in our dataset, accuracy alone wouldn't be enough for this case.

In [28]:
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import RobustScaler, TargetEncoder


def build_pipeline():
    # Target encoder
    internal_target_encoding = TargetEncoder(smooth="auto")
    columns_to_encode = [
        "Card",
        "Merchant Name",
        "Merchant State",
        "Merchant City",
        "Zip",
        "MCC",
        "Errors?",
        "Hour",
        "Minute",
        "Use Chip"
    ]

    target_encoding = ColumnTransformer([
        (
            'target_encode',
            internal_target_encoding,
            columns_to_encode
        )
    ])

    # Scaler
    internal_scaler = RobustScaler()
    columns_to_scale = ["Amount"]

    scaler = ColumnTransformer([
        ("scaler", internal_scaler, columns_to_scale)
    ])

    # Full pipeline
    feature_engineering_pipeline  = Pipeline([
        (
            "features",
            FeatureUnion([
                ('categories', target_encoding),
                ('scaled', scaler)
            ])
        )
    ])

    # Machine learning model
    model = RandomForestClassifier(n_estimators=10, verbose=1, n_jobs=10)


    # Full pipeline
    final_pipeline = Pipeline([
        ("feature_engineering", feature_engineering_pipeline),
        ("model", model)
    ])

    return final_pipeline

Before creating a function that will take our pipeline to train and validate our model, let's see how many cases of fraud we have vs no fraud in our dataset, to see if there's an imbalance of cases.

In [29]:
new_data_df['Is Fraud?'].value_counts()

Is Fraud?,count
i64,u32
0,24357143
1,29757


As we can see the legitimate transactions are way overreperesented, so we need to fix that in our model_training_validation function. For that we will use SMOTE to oversample the fraudulent cases in our dataset.

In [30]:
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, recall_score


def model_training_validation(final_pipeline: Pipeline,
                              train_x: DataFrame, train_y: DataFrame,
                              validate_x: DataFrame, validate_y: DataFrame,
                              test_x: DataFrame, test_y: DataFrame):
    """ Before training and testing our model we will need to fix the dataset's
    fraud/legit case imbalance using SMOTE's fit_resample method.
    """
    sm = SMOTE(random_state=0)

    train_x_res, train_y_res = sm.fit_resample(train_x.to_pandas(), train_y.to_pandas())
    validate_x_res, validate_y_res = sm.fit_resample(validate_x.to_pandas(), validate_y.to_pandas())
    test_x_res, test_y_res = sm.fit_resample(test_x.to_pandas(), test_y.to_pandas())

    final_pipeline.fit(train_x_res, train_y_res.to_numpy().ravel())

    train_pred_y = final_pipeline.predict(train_x_res)
    validate_pred_y = final_pipeline.predict(validate_x_res)
    test_pred_y = final_pipeline.predict(test_x_res)

    train_accuracy = accuracy_score(train_pred_y, train_y_res.to_numpy().ravel())
    train_recall = recall_score(train_pred_y, train_y_res.to_numpy().ravel())

    validate_accuracy = accuracy_score(validate_pred_y, validate_y_res.to_numpy().ravel())
    validate_recall = recall_score(validate_pred_y, validate_y_res.to_numpy().ravel())

    test_accuracy = accuracy_score(test_pred_y, test_y_res.to_numpy().ravel())
    test_recall = recall_score(test_pred_y, test_y_res.to_numpy().ravel())

    print('Train accuracy', train_accuracy)
    print('Train recall', train_recall)

    print('Validate accuracy', validate_accuracy)
    print('Validate recall', validate_recall)

    print('Test accuracy', test_accuracy)
    print('Test recall', test_recall)

    metrics = {
        'train_accuracy': train_accuracy,
        'train_recall': train_recall,
        'validate_accuracy': validate_accuracy,
        'validate_recall': validate_recall,
        'test_accuracy': test_accuracy,
        'test_recall': test_recall,
    }

    return final_pipeline, metrics

Finally, we will build a full_training_run function that will call our split_dataset, build_pipeline and model_training_validation functions to train our model and obtain the validation and test results. It will also use joblib's dump bethod to save our model under /model/inference_pipeline.joblib when the write_model parameter is True, otherwise the model won't be re-written.

In [31]:
import os

from joblib import dump


def full_training_run(write_model: bool):
    training_data, validate_data, test_data = split_dataset(new_data_df,
                                                            train_proportion=0.6,
                                                            test_proportion=0.5)

    training_pipeline = build_pipeline()

    training_pipeline, metrics = model_training_validation(
        training_pipeline,
        train_x=training_data[0],
        train_y=training_data[1],
        validate_x=validate_data[0],
        validate_y=validate_data[1],
        test_x=test_data[0],
        test_y=test_data[1]
    )
    if write_model:
        if os.path.exists("../model/inference_pipeline.joblib"):
            os.remove("../model/inference_pipeline.joblib")
        dump(training_pipeline, "../model/inference_pipeline.joblib",
             compress=9)

    return training_pipeline

In [32]:
full_training_run(False)

[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:  1.5min remaining:  5.9min
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:  1.5min finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    3.0s remaining:   12.1s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    3.3s finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    0.9s remaining:    3.5s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    1.0s finished
[Parallel(n_jobs=10)]: Using backend ThreadingBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   2 out of  10 | elapsed:    7.1s remaining:   28.3s
[Parallel(n_jobs=10)]: Done  10 out of  10 | elapsed:    7.7s finished


Train accuracy 0.9977676677492584
Train recall 0.9979673410981533
Validate accuracy 0.9873954639936312
Validate recall 0.9968447720311919
Test accuracy 0.9791608691257095
Test recall 0.9967904310433081
