# Credit Card Fraud Detection
## Final Project for the Codigo Facilito's Machine Learning 2023 Bootcamp

## Problem Definition and Objectives

### Problem Definition

Nowadays it is very easy for malicious actors to illegaly obtain banking accounts' authentication information that allows them to access unsuspected victim's financial assets without their knowledge, until it's too late. To minimize the impact of this, different techniques can be applied to detect when a user's identity has been comprompised and their assets are being accessed illegaly.

One of them are fraud detection systems, which are able to learn users' banking transactions' behaviour - which means learning the usual amounts, and when (at what time of day), where (the type of stores they commonly visit), and how (purchasing online vs swiping a physical card) they usually perform transactions with their credit cards - in order to detect when a new transaction doen't follow the previously learned patterns, flagging such transactions as fraudulent, and require the user to perform additional verification for the transaction to go through.

### Problem Relevance


Just to highlight the importance of fraud detection systems, according to the [Security.org 2023 Credit Card Fraud Report](https://www.security.org/digital-safety/credit-card-fraud-report/):
- 65% of credit and credit card holders have been fraud victims at some point in their lives, up from 58 percent in 2022. This equates to about 151 million fraud victims in the United States alone.
- An increasing number of Americans have been victimized multiple times: in 2022, 44 percent of credit card users reported having two or more fraudulent charges, compared to 35 percent in 2021.
- Since 2021, the median fraudulent charge has climbed by about 27 percent (rising to $79 in 2023). This equates to about $12 billion in total attempted fraudulent charges.



### Key Stakeholders

The main stakeholders in this project are:

1) The banking institution(s) that would use their clients' banking transaction data required to train the machine learning model.
2) The banks' clients allowing for their transactions' data to be used to train the model.
3) The FTC (in the US) and other regulatory institutions that would need to verify and approve the use of the banks' clients' data to train the model, and also approve the use of such model.

### Objectives

The goal of this project is to build a machine learning model that allows the detection of fraudulent credit card transactions by training it with a credit card transaction dataset, and build a feature engineering and training pipeline that will allow the model to be re-trained in the future.

- The final machine learning model should provide at least 10% better recall than a baseline model to be defined later in the project.
- A feature egineering and training pipeline should be used to allow future training of the model.

### Preparation Steps

1) Decide if using a machine learning model is the right solution for this problem. In this case, because we want to be more proactive when fraudulent transactions happen and not needing to wait for the user to report them, we believe using a machine learning model is an appropiate solution for this problem.
2) Identify a public credit card transaction dataset suitable for an Exploratory Data Analysis, that allows the clear and easy identification of each column's information. Some datasets available in Kaggle contain columns that were already scaled or processed using PCA analysis, and therefore are not useful for this project's goals.
3) Research the different machine learning models that are best suited for detecting fraudulent credit card transactions.
4) Research different methods to improve the dataset if it is unbalanced in terms of the number of fraudulent transactions vs legitimate ones.
5) Research different feature engineering techniques that would allow us to reduce the size of the datasets for our model.
6) Evaluate the different deployment deployment frequencies and strategies available, and design and build the model's feature engineering and training pipeline model with them in mind.

### Dataset

- We will use the [Credit Card Transactions Kaggle Dataset](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions), because it contains a good amount of data to train our model - approximately 24M transactions of 2000 users generated by a multi-agent virtual world simulation performed by IBM - and its columns are easy to identify and work with because they are not scaled or obfuscated in any way that could result in us not being able to find correlations in the data.

- The dataset contains the following columns:
    1) 'User': An ID of the user.
    2) 'Card': An ID for the user's card, some users have multiple cards.
    3) 'Year', 'Month', 'Day', 'Time': The timestamp of the transaction. 
    4) 'Amount': The amount of the transaction.
    5) 'Use Chip': 'Swipe Transaction' if a physical card was used to perform the transaction, or 'Online Transaction' if the transaction was performed online.
    6) 'Merchant Name': The ID of the store where the transaction was made.
    7) 'Merchant City', 'Merchant State', 'Zip': The store's location.
    8) 'MCC': The [Merchant Category Code](https://www.investopedia.com/terms/m/merchant-category-codes-mcc.asp).
    9) 'Errors?': Any error(s) during the transaction, eg. 'Insufficient Balance', 'Technical Glitch', etc.
    10) 'Is Fraud?: A label indicating if the transaction was fraudulent or not.

### Deployment Plan

#### Deployment Frequency

Because we don't need to process the bank transactions and tell if they are legitimate or fraudulent as soon as they happen, we can use an Streaming Deployment for this solution. This will allow us to have more flexibility when choosing the right model for this problem, and also the response times are better than with other Deployment approaches.

One of the other two options, making a Batch Deployment of the model, would not be fit for this purpose since taking groups of transactions and making predictions about them later would probably take longer than in a streaming deployment, and would probably make for a poor user experience, so using Batch Deployment for this problem would fall short for our needs.

And the remaining option, using Online Deployment, may work for this purpose but it would be an overkill for this scenario, since we really don't need to decide if a transaction is fraudulent or not immediately after it has been made - the final decision could be done even later, after a more through investigation, so using an Online Deployment for this solution would not be a good use of our resources to achieve it.

#### Deployment Strategy

After considering the different deployment strategies available, using a Rolling/Ramped Update strategy should be optimal for this case, because it would allow the users to keep making transactions without downtime, and for this case in particular it's not crucial to guarantee the user is using a particular version of our model (the new one "B" vs the old one "A"), as long as the users can keep performing transactions and them getting evaluated.

## Exploratory Data Analysis

We will begin by loading the credit card transaction dataset into a polars DataFrame and confirm the contents of the file have been loaded successfully.

In [1]:
import polars as pl  # noqa: D100
from polars import DataFrame

pl.Config(tbl_rows=10)

def load_data(filename: str) -> DataFrame:
    """ Read CSV file with our dataset. """
    data_df = pl.read_csv(filename)
    return data_df

data_df = load_data("../data/credit_card_transactions-ibm_v2.csv")
data_df

User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
i64,i64,i64,i64,i64,str,str,str,i64,str,str,f64,i64,str,str
0,0,2002,9,1,"""06:21""","""$134.09""","""Swipe Transact…",3527213246127876953,"""La Verne""","""CA""",91750.0,5300,,"""No"""
0,0,2002,9,1,"""06:42""","""$38.48""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""06:22""","""$120.34""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""17:45""","""$128.95""","""Swipe Transact…",3414527459579106770,"""Monterey Park""","""CA""",91754.0,5651,,"""No"""
0,0,2002,9,3,"""06:23""","""$104.71""","""Swipe Transact…",5817218446178736267,"""La Verne""","""CA""",91750.0,5912,,"""No"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
1999,1,2020,2,27,"""22:23""","""$-54.00""","""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""",3054.0,5541,,"""No"""
1999,1,2020,2,27,"""22:24""","""$54.00""","""Chip Transacti…",-5162038175624867091,"""Merrimack""","""NH""",3054.0,5541,,"""No"""
1999,1,2020,2,28,"""07:43""","""$59.15""","""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""",3054.0,4121,,"""No"""
1999,1,2020,2,28,"""20:10""","""$43.12""","""Chip Transacti…",2500998799892805156,"""Merrimack""","""NH""",3054.0,4121,,"""No"""


First we can identify a few columns that need some work, for that we will create a function called preprocess_dataset that will do the following:
- Transform the Time column into "Hour" and "Minute" columns, instead of a string.
- Convert the Amount column into an actual number instead of a string.
- Make the Merchant Name a string, and then make it categorical.
- Fill the nulls with "ONLINE" in the Merchant State column.
- Convert the Zip to a string and replace the nulls with "ONLINE".
- Fill the nulls in Errors? with "No".
- Convert the IsFraud? column to binary format: 1 if it's fraud, and 0 if legitimate.

In [2]:
def preprocess_dataset(data_df: DataFrame) -> DataFrame:
    """ Preprocess dataset for EDA. """
    new_data_df = data_df.with_columns(
        Hour = pl.col("Time").map_elements(
            lambda x: x[:2]).cast(pl.Int64, strict=True),
        Minute = pl.col("Time").map_elements(
            lambda x: x[3:]).cast(pl.Int64, strict=True),
    ).drop("Time").with_columns(
        pl.col("Amount").map_elements(
            lambda x: x.replace("$", "")).cast(pl.Float64, strict=True)
    ).with_columns(
        pl.col("Merchant State").fill_null("ONLINE")
    ).with_columns(
        pl.col("Zip").cast(pl.String, strict=True).fill_null("ONLINE")
    ).with_columns(
        pl.col("Errors?").fill_null(value="No")
    ).with_columns(
        pl.col("Is Fraud?").map_elements(
            lambda x: 0 if x == "No" else 1
        )
    )
    return new_data_df
new_data_df = preprocess_dataset(data_df)

Now, let's see how the Amount varies between legitimate and fraudulent transactions. For that we will use altair to chart the top 100 amounts used for transactions, with a function called top_100_amount_counts_chart.

In [8]:
import altair as alt  # noqa: E402
from altair import Chart  # noqa: E402
from polars import DataFrame  # noqa: E402


def top_100_amount_counts_chart(data_df: DataFrame, source: str) -> Chart:
    """ Calculate and chart the top 100 Amounts used in transactions. """
    data_df = data_df.top_k(100, by="Count")
    top_amount = data_df.top_k(1, by="Count").to_numpy()[0][1]
    return alt.Chart(
        data_df,
        title=f"Top 100 {source} Amounts").mark_bar().encode(
        x=alt.X('Amount', axis=alt.Axis(labelAngle=-45)),
        y="Count",
        color=alt.condition(
            alt.datum.Count == top_amount,
            alt.value('orange'),
            alt.value('steelblue'))
    ).properties(width=400)

legit = new_data_df.select(
    pl.col("Amount"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).select(
        pl.col("Amount").unique(), pl.col("Amount").unique_counts().alias("Count")
    )

top_100_amount_counts_chart(legit, "Legitimate")

In [4]:
fraud = new_data_df.select(
    pl.col("Amount"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).select(
        pl.col("Amount").unique(), pl.col("Amount").unique_counts().alias("Count")
    )
top_100_amount_counts_chart(fraud, "Fraudulent")

As we can see, the amounts in fraudulent transactions are rarely repeated in comparison with the legitimate ones.

Now let's see how the Use Chip column varies between legitimate and fraudulent transactions with the following use_chip_chart function.

In [None]:
def use_chip_chart(data_df: DataFrame, source: str) -> Chart:
    """ Chart the proportion of Use chip in transactions. """
    return alt.Chart(
        data_df,
        title=f"{source} Use Chip").mark_arc(innerRadius=70).encode(
        color=alt.Color("Use Chip", title="Use Chip", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_chip = new_data_df.select(
    pl.col("Use Chip"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Use Chip").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_chip = new_data_df.select(
    pl.col("Use Chip"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Use Chip").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
use_chip_chart(legit_chip, "Legitimate")

In [None]:
use_chip_chart(fraud_chip, "Fraudulent")

As we can see, cards are used online more frequently for fraudulent transactions than for legitimate ones, where swipe transactions are more common.

Now let's see if the locations vary between legitimate and fraudulent transactions with the following top_10_merchant_state_chart and top_10_merchant_city_chart functions.

In [None]:
def top_10_merchant_state_chart(data_df: DataFrame, source: str) -> Chart:
    """ Chart the proportion of the top 10s Merchant State in transactions. """
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} Merchant State").mark_arc(innerRadius=70).encode(
        color=alt.Color("Merchant State", title="Merchant State", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

def top_10_merchant_city_chart(data_df: DataFrame, source: str) -> Chart:
    """ Chart the proportion of the top 10 Merchant Cities in transactions. """
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} Merchant City").mark_arc(innerRadius=70).encode(
        color=alt.Color("Merchant City", title="Merchant State", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_state = new_data_df.select(
    pl.col("Merchant State"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Merchant State").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
legit_city = new_data_df.select(
    pl.col("Merchant City"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("Merchant City").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_state = new_data_df.select(
    pl.col("Merchant State"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Merchant State").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_city = new_data_df.select(
    pl.col("Merchant City"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("Merchant City").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
print(legit_state.sort("Count", descending=True))
top_10_merchant_state_chart(legit_state, "Legitimate")

In [None]:
print(legit_city.sort("Count", descending=True))
top_10_merchant_city_chart(legit_city, "Legitimate")

In [None]:
print(fraud_state.sort("Count", descending=True))
top_10_merchant_state_chart(fraud_state, "Fraudulent")

In [None]:
print(fraud_city.sort("Count", descending=True))
top_10_merchant_city_chart(fraud_city, "Fraudulent")

With the previous charts we can see that online transactions are dominant for the merchant state and city in both leigitimate and fraudulent transactions. However we can also see that for fraudulent transactions, the states and cities include locations outside of the US, while for legitimate transactions they are mostly in the US except for the online transactions. This can be mostly explained because this dataset was generated to simulate US based people, so we can expect most of their legitimate transactions to come from US locations.

Now let's see if the Errors? column varies significantly between legitimate and fraudulent transactions, using an errors_chart function.

In [None]:
def errors_chart(data_df: DataFrame, source: str) -> Chart:
    """ Chart the proportion of Errors? in transactions. """
    return alt.Chart(
        data_df,
        title=f"{source} Errors?").mark_arc(innerRadius=70).encode(
        color=alt.Color("Errors?", title="Errors?", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_errors = new_data_df.select(
    pl.col("Errors?"), pl.col("Is Fraud?")).filter(
        pl.col("Is Fraud?") == 0).filter(pl.col("Errors?") != "No"
    ).group_by("Errors?").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_errors = new_data_df.select(
    pl.col("Errors?"), pl.col("Is Fraud?")).filter(
        pl.col("Is Fraud?") == 1).filter(pl.col("Errors?") != "No"
    ).group_by("Errors?").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
print(legit_errors.sort("Count", descending=True))
errors_chart(legit_errors, "Legitimate")

In [None]:
print(fraud_errors.sort("Count", descending=True))
errors_chart(fraud_errors, "Fraudulent")

With the previous charts we can see that Insufficient Balance is the most common error for both legitimate and fraudulent transactions.

Now let's see how the Merchant Category Codes (MCC) vary between fraudulent and legitimate transactions using a top_10_mcc_chart function.

In [None]:
def top_10_mcc_chart(data_df: DataFrame, source: str) -> Chart:
    """ Chart the proportion of Merchant Category Codes in transactions. """
    data_df = data_df.top_k(10, by="Count")
    return alt.Chart(
        data_df,
        title=f"Top 10 {source} MCC").mark_arc(innerRadius=70).encode(
        color=alt.Color("MCC", title="MCC", type="nominal",
                    sort='ascending',
                    scale=alt.Scale(scheme='plasma')),
        theta="Count"
    ).properties(width=400)

legit_mcc = new_data_df.select(
    pl.col("MCC"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 0
    ).group_by("MCC").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )
fraud_mcc = new_data_df.select(
    pl.col("MCC"), pl.col("Is Fraud?")).filter(pl.col("Is Fraud?") == 1
    ).group_by("MCC").agg(
        pl.col("Is Fraud?").count().alias("Count")
    )

print(legit_mcc.sort("Count", descending=True))
top_10_mcc_chart(legit_mcc, "Legitimate")

In [None]:
print(fraud_mcc.sort("Count", descending=True))
top_10_mcc_chart(fraud_mcc, "Fraudulent")

We can see that the most used Merchant Category Codes vary between fraudulent and legitimate transactions.

## Model Creation and Evaluation

### Baseline Model

Before creating our final model, we will develop a very simple model to decide if a transaction is legitimate or not, solely based on the merchant state, city, zip, MCC and Error values in transactions that were previously marked as fraudulent, and then calculate the model's recall. The final model should improve this baseline model's recall by at least 70%.

In [10]:
from sklearn.metrics import accuracy_score, recall_score  # noqa: E402
from sklearn.model_selection import train_test_split  # noqa: E402


def predict_evaluate(fraud_merchant_states: list, fraud_merchant_cities: list,
            fraud_zips: list, fraud_mccs: list, fraud_errors: list,
            x_values: DataFrame, y_values: DataFrame) -> (float, float):
    """ Take the learned fraudulent values and predict based on them, calculate
    and return metrics.
    """

    pred_train_y = x_values.select(
        pred_train_y = (
            pl.col("Merchant State").is_in(fraud_merchant_states) &
            pl.col("Merchant City").is_in(fraud_merchant_cities) &
            pl.col("Zip").is_in(fraud_zips) &
            pl.col("MCC").is_in(fraud_mccs) &
            pl.col("Errors?").is_in(fraud_errors)
        ).map_elements(lambda x: 1 if x is True else 0)
    )
    train_accuracy = accuracy_score(pred_train_y, y_values)
    train_recall = recall_score(pred_train_y, y_values)
    return train_accuracy, train_recall


def baseline_model(data_df: DataFrame) -> dict:
    """ This model learns the values used in fraudulent transactions and if the
    values in the provided transactions are in the learned ones the transaction
    is deemed as fraudulent, otherwise it is deemed legitimate.

    Args:
        data_df (DataFrame): A polars DataFrame with the transaction database.

    Returns:
        metrics (dict): The baseline model's metrics.
    """

    is_fraud = data_df.select(pl.col('Is Fraud?'))
    features = data_df.drop('Is Fraud?')

    original_count = len(data_df)
    training_size = int(original_count * .6)
    test_size = int((1 - .6) * .5 * training_size)

    train_x, rest_x, train_y, rest_y = train_test_split(features,
                                                        is_fraud,
                                                        train_size=training_size,
                                                        random_state=0)
    validate_x, test_x, validate_y, test_y  = train_test_split(rest_x,
                                                               rest_y,
                                                               train_size=test_size,
                                                               random_state=0)
    # Learn fraudulent values
    train_x = train_x.with_row_index()
    fraud_index = train_y.with_row_index().filter(
        pl.col("Is Fraud?") == 1).select(
            pl.col('index')).to_series().to_list()
    fraud_merchant_states = train_x.filter(
        pl.col("index").is_in(fraud_index)
    ).select(
        pl.col("Merchant State").unique()
    ).to_series().to_list()
    fraud_merchant_cities = train_x.filter(
        pl.col("index").is_in(fraud_index)
    ).select(
        pl.col("Merchant City").unique()
    ).to_series().to_list()
    fraud_zips = train_x.filter(
        pl.col("index").is_in(fraud_index)
    ).select(
        pl.col("Zip").unique()
    ).to_series().to_list()
    fraud_mccs = train_x.filter(
        pl.col("index").is_in(fraud_index)
    ).select(
        pl.col("MCC").unique()
    ).to_series().to_list()
    fraud_errors = train_x.filter(
        pl.col("index").is_in(fraud_index)
    ).select(
        pl.col("Errors?").unique()
    ).to_series().to_list()

    # Predict and evaluate vs train, validate and test data
    train_accuracy, train_recall = predict_evaluate(
        fraud_merchant_states, fraud_merchant_cities, fraud_zips, fraud_mccs,
        fraud_errors, train_x, train_y
    )
    validate_accuracy, validate_recall = predict_evaluate(
        fraud_merchant_states, fraud_merchant_cities, fraud_zips, fraud_mccs,
        fraud_errors, validate_x, validate_y
    )
    test_accuracy, test_recall = predict_evaluate(
        fraud_merchant_states, fraud_merchant_cities, fraud_zips, fraud_mccs,
        fraud_errors, test_x, test_y
    )

    metrics = {
        'train_accuracy': train_accuracy,
        'train_recall': train_recall,
        'validate_accuracy': validate_accuracy,
        'validate_recall': validate_recall,
        'test_accuracy': test_accuracy,
        'test_recall': test_recall,
    }
    return metrics

baseline_model(new_data_df)

{'train_accuracy': 0.7799038281481725,
 'train_recall': 0.005539741935722937,
 'validate_accuracy': 0.7795626613742077,
 'validate_recall': 0.0050274208929613024,
 'test_accuracy': 0.7799762811767208,
 'test_recall': 0.005025468557859475}

As we can see, our baseline model's recall is very low, so we should expect our final model to improve the recall by at least 70%.

### Splitting our Dataset

To be able to create, train and deploy our final machine learning model, first we need to create a function that will split our dataset in training, validation and test samples.

For that purpose we will create a split_dataset function that receives our dataset, and the percentages of data that we will use to train our model, and to validate and test our model's performance.

The first thing this function does is convert the columns with categorical data that are not integer values into their categorical "physical" (integer) values.

Then it separates our dependent variable in the "Is Fraud?" column from the rest of the columns.

Finally, it uses the train_test_split function from sklearn.model_selection, and we will call it with a random_state=0 value so it returns the same data split every time we call it.

In [3]:
from sklearn.model_selection import train_test_split  # noqa: E402


def split_dataset(data_df: DataFrame, train_proportion: float,
                  test_proportion: float):
    """ Convert categorical columns and split dataset. """
    data_df = data_df.with_columns(
        pl.col("Use Chip").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant Name").cast(pl.String).cast(
            pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant City").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Merchant State").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Zip").cast(pl.Categorical).to_physical()
    ).with_columns(
        pl.col("Errors?").cast(pl.Categorical).to_physical()
    )
    is_fraud = data_df.select(pl.col('Is Fraud?'))
    features = data_df.drop('Is Fraud?')

    original_count = len(data_df)
    training_size = int(original_count * train_proportion)
    test_size = int((1 - train_proportion) * test_proportion * training_size)

    train_x, rest_x, train_y, rest_y = train_test_split(features,
                                                        is_fraud,
                                                        train_size=training_size,
                                                        random_state=0)
    validate_x, test_x, validate_y, test_y  = train_test_split(rest_x,
                                                               rest_y,
                                                               train_size=test_size,
                                                               random_state=0)

    return (train_x, train_y), (validate_x, validate_y), (test_x, test_y)

### Feature Engineering and Training Pipeline

Next, we will create our feature engineering and model training pipeline. For that we will create a build_pipeline function that will peform both the feature engineering and train our model.

#### Feature Engineering

- We will use an scikit-learn [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) to scale the Amount column, this will help us reduce the effect of amount outliers (values that are too high or too low) in our model.
- At the beginning we were using OneHotEncoder for the categorical columns, however, we noticed it was adding too many columns to our dataset because the categorical columns have too many categories, so after investigating alternatives we decided to use a scikit-learn [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html#sklearn.preprocessing.TargetEncoder) instead, to encode the Card, Merchant Name, Merchant City, Merchant State, Zip, MCC, Errors?, Hour, Minute, and Use Chip colums without adding new ones. The use of TargetEncoder does have its [trade-offs](https://www.pythonprog.com/sklearn-preprocessing-targetencoder/), so we need to keep them in mind as we build and test our model.

#### Modeling
- For our model, we will use an scikit-learn [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) with 10 estimators, 10 parallel jobs, and set it to run in verbose mode. We added 10 estimators only so the model doesn't take too long to train (going from the default of 100 to 10 saves us around 50 minutes in training with this dataset), 10 parallel jobs to speed up the training process, and we run it in verbose mode so we can follow the training and predicting process as it happens.
- We will measure both the accuracy and recall of our model, however, the most important metric will be the recall, because for this case in particular we need to be able to find all the positive fraudulent transactions in our dataset, which means reducing the number of false negatives, so accuracy alone would not be enough.

In [4]:
from sklearn.compose import ColumnTransformer  # noqa: E402
from sklearn.ensemble import RandomForestClassifier  # noqa: E402
from sklearn.pipeline import FeatureUnion, Pipeline  # noqa: E402
from sklearn.preprocessing import RobustScaler, TargetEncoder  # noqa: E402


def build_pipeline():
    """ Build pipeline with feature encoders and model. """
    # Target encoder
    internal_target_encoding = TargetEncoder(smooth="auto")
    columns_to_encode = [
        "Card",
        "Merchant Name",
        "Merchant State",
        "Merchant City",
        "Zip",
        "MCC",
        "Errors?",
        "Hour",
        "Minute",
        "Use Chip"
    ]

    target_encoding = ColumnTransformer([
        (
            'target_encode',
            internal_target_encoding,
            columns_to_encode
        )
    ])

    # Scaler
    internal_scaler = RobustScaler()
    columns_to_scale = ["Amount"]

    scaler = ColumnTransformer([
        ("scaler", internal_scaler, columns_to_scale)
    ])

    # Full pipeline
    feature_engineering_pipeline  = Pipeline([
        (
            "features",
            FeatureUnion([
                ('categories', target_encoding),
                ('scaled', scaler)
            ])
        )
    ])

    # Machine learning model
    model = RandomForestClassifier(n_estimators=10, verbose=1, n_jobs=10)


    # Full pipeline
    final_pipeline = Pipeline([
        ("feature_engineering", feature_engineering_pipeline),
        ("model", model)
    ])

    return final_pipeline

### Model Training and Validation

- We will create a model_training_validation function that will use the pipeline created with the build_pipeline function, and with it we will call the model's fit function to train the model with the training split of the dataset. 
- After that, we will use the pipeline to predict if the transactions included in the training, validation and testing splits are fraudulent or legitimate.
- With the predicted values, we will calculate the accuracy and recall obtained with each split and save all the metrics in a 'metrics' dictionary that we will return along with our trained pipeline.

In [6]:
from sklearn.metrics import accuracy_score, recall_score  # noqa: E402


def model_training_validation(final_pipeline: Pipeline,
                              train_x: DataFrame, train_y: DataFrame,
                              validate_x: DataFrame, validate_y: DataFrame,
                              test_x: DataFrame, test_y: DataFrame):
    """ Train and use the model to predict values for the validation and test
    splits.
    """
    final_pipeline.fit(train_x, train_y.to_numpy().ravel())

    train_pred_y = final_pipeline.predict(train_x)
    validate_pred_y = final_pipeline.predict(validate_x)
    test_pred_y = final_pipeline.predict(test_x)

    train_accuracy = accuracy_score(train_pred_y, train_y.to_numpy().ravel())
    train_recall = recall_score(train_pred_y, train_y.to_numpy().ravel())

    validate_accuracy = accuracy_score(validate_pred_y, validate_y.to_numpy().ravel())
    validate_recall = recall_score(validate_pred_y, validate_y.to_numpy().ravel())

    test_accuracy = accuracy_score(test_pred_y, test_y.to_numpy().ravel())
    test_recall = recall_score(test_pred_y, test_y.to_numpy().ravel())

    metrics = {
        'train_accuracy': train_accuracy,
        'train_recall': train_recall,
        'validate_accuracy': validate_accuracy,
        'validate_recall': validate_recall,
        'test_accuracy': test_accuracy,
        'test_recall': test_recall,
    }

    return final_pipeline, metrics

### Full Training Run

- Finally, we will build a full_training_run function that will call our split_dataset, build_pipeline and model_training_validation functions to train our model and obtain the validation and test results.
- It will also use joblib's dump bethod to save our model under /model/inference_pipeline.joblib when the write_model parameter is True, otherwise the model won't be re-written.

In [29]:
import os  # noqa: E402

from joblib import dump  # noqa: E402


def full_training_run(write_model: bool):
    """ Split dataset, build pipeline, train and validate model, and write
    model if write_model is True.
    """
    training_data, validate_data, test_data = split_dataset(new_data_df,
                                                            train_proportion=0.6,
                                                            test_proportion=0.5)

    training_pipeline = build_pipeline()

    training_pipeline, metrics = model_training_validation(
        training_pipeline,
        train_x=training_data[0],
        train_y=training_data[1],
        validate_x=validate_data[0],
        validate_y=validate_data[1],
        test_x=test_data[0],
        test_y=test_data[1]
    )

    print(metrics)

    if write_model:
        if os.path.exists("../model/inference_pipeline.joblib"):
            os.remove("../model/inference_pipeline.joblib")
        dump(training_pipeline, "../model/inference_pipeline.joblib",
             compress=9)

    return training_pipeline

Now let's run our full_training_run function with write_model set to True once, so it saves our final model to disk. After that, all subsequents runs should be run using write_model set to False so the saved model doesn't get overwritten.

In [None]:
full_training_run(write_model=False)

We can see that our final model's recall is very good compared to our baseline model, however, we can also see that it seems to suffer a bit from overfitting because the recall with the validation and testing splits is worse than that with the training one.

## Error Analysis

Let's take a look at the cases where the model misclassified legitimate transactions as fraudulent (false positives) and fraudulent transactions as legitimate (false negatives) and see if we can find a trend that could help us improve the model.

For that we will modify our model_training_validation function a bit so it returns the predicted values and the values it should have predicted for each split of our dataset, so we can compare them and find the errors in the predictions.

In [None]:
def model_training_validation(final_pipeline: Pipeline,
                              train_x: DataFrame, train_y: DataFrame,
                              validate_x: DataFrame, test_x: DataFrame):

    final_pipeline.fit(train_x, train_y.to_numpy().ravel())

    train_pred_y = final_pipeline.predict(train_x)
    validate_pred_y = final_pipeline.predict(validate_x)
    test_pred_y = final_pipeline.predict(test_x)

    return train_pred_y, validate_pred_y, test_pred_y

training_data, validate_data, test_data = split_dataset(new_data_df,
                                                            train_proportion=0.6,
                                                            test_proportion=0.5)

training_pipeline = build_pipeline()

train_pred_y, validate_pred_y, test_pred_y = model_training_validation(
    training_pipeline,
    train_x=training_data[0],
    train_y=training_data[1],
    validate_x=validate_data[0],
    test_x=test_data[0],
)


Now that we have the predicted values for the validation and testing splits, let's compare them with the real values and find the false positives and false negatives of each split.

In [13]:
validate_x = validate_data[0].with_row_index()
validate_y = validate_data[1]
test_x = test_data[0].with_row_index()
test_y = test_data[1]

# transform predicted values into DataFrames
validate_pred_y = pl.DataFrame({"predicted_Is_Fraud?": validate_pred_y})
test_pred_y = pl.DataFrame({"predicted_Is_Fraud?": test_pred_y})

# get new DataFrame with predicted vs real values
compare_validate_y = pl.concat([validate_y, validate_pred_y], how="horizontal")
compare_test_y = pl.concat([test_y, test_pred_y], how="horizontal")

# errors = where the predicted value is different than real one
errors_validate_y = compare_validate_y.select(
    pl.col("predicted_Is_Fraud?"),
    errors=(pl.col("Is Fraud?") != pl.col("predicted_Is_Fraud?"))
).with_row_index().filter(
    pl.col("errors")
)
errors_test_y = compare_test_y.select(
    pl.col("predicted_Is_Fraud?"),
    errors=(pl.col("Is Fraud?") != pl.col("predicted_Is_Fraud?"))
).with_row_index().filter(
    pl.col("errors")
)

# false positives = where predicted_Is_Fraud is 1
false_pos_idx_validate_y = errors_validate_y.filter(
    pl.col("index") & pl.col("predicted_Is_Fraud?") == 1).to_series().to_list()
false_pos_idx_test_y = errors_test_y.filter(
    pl.col("index") & pl.col("predicted_Is_Fraud?") == 1).to_series().to_list()

# false positives = where predicted_Is_Fraud is 0
false_neg_idx_validate_y = errors_validate_y.filter(
    pl.col("index") & pl.col("predicted_Is_Fraud?") == 0).to_series().to_list()
false_neg_idx_test_y = errors_test_y.filter(
    pl.col("index") & pl.col("predicted_Is_Fraud?") == 0).to_series().to_list()


# get features corresponding with the error idx's
false_pos_validate_x = validate_x.filter(
    pl.col("index").is_in(false_pos_idx_validate_y))
false_neg_validate_x = validate_x.filter(
    pl.col("index").is_in(false_neg_idx_validate_y))

# get features corresponding with the error idx's
false_pos_test_x = test_x.filter(pl.col("index").is_in(false_pos_idx_test_y))
false_neg_test_x = test_x.filter(pl.col("index").is_in(false_neg_idx_test_y))


Now let's display each dataset and see if we can find any patterns or obvious things to fix.

In [None]:
false_neg_validate_x

In [None]:
false_pos_validate_x

In [None]:
false_neg_test_x

In [None]:
false_pos_test_x

While there doesn't seem to be an obvious pattern in the features for the false positives and false negatives, there's a disproportionately higher number of false negatives vs false positives. This could be caused by a disproportion in the number of positive cases (fraudulent transactions) vs negative cases (legitimate transactions) in our dataset. To see if that's the case, let's evaluate how many fraudulent trasactions we have in our dataset in comparison with the legitimate ones.

In [None]:
def fraud_vs_legitimate_chart(data_df: DataFrame) -> Chart:
    """ Bar chart with the legitimate vs fraudulent transaction numbers. """
    top_amount = data_df.top_k(1, by="count").to_numpy()[0][1]
    print(top_amount)
    return alt.Chart(
        data_df,
        title="Fraudulent vs Legitimate Transactions").mark_bar().encode(
        x=alt.X('Is Fraud?:O', axis=alt.Axis(labelAngle=-45)),
        y="count",
        color=alt.condition(
            alt.datum.count == top_amount,
            alt.value('orange'),
            alt.value('steelblue'))
    ).properties(width=400)

fraud_vs_legit = new_data_df.select(
    pl.col('Is Fraud?')).to_series().value_counts().cast(
    {"count": pl.Int64})
print(new_data_df.select(
    pl.col('Is Fraud?')).to_series().value_counts())
fraud_vs_legitimate_chart(fraud_vs_legit)

### Error Analysis Results and Conclusions

- As we can see, the fraudulent transactions are way underrepresented in comparison with the legitimate ones.
- As it was mentioned in the article we included earlier about [TargetEncoder](https://www.pythonprog.com/sklearn-preprocessing-targetencoder/), one of its limitations was that it could be sensitive to imbalanced classes like the imbalance we are seeing precisely here, so it is probably the reason behind the reduced recall we are seeing, so we could try to fix that in our model_training_validation function and see if that improves the recall of our final model.
- For that we will use [Synthetic Minority Over-sampling TEchnique (SMOTE)](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) to resample the fraudulent cases in our dataset so they are more balanced with the legitimate ones.
- SMOTE is a very powerful tool that is used to [address class imbalance in a dataset](https://medium.com/@corymaklin/synthetic-minority-over-sampling-technique-smote-7d419696b88c).

In [None]:
from imblearn.over_sampling import SMOTE  # noqa: E402
from sklearn.metrics import accuracy_score, recall_score  # noqa: E402, F811


def model_training_validation(final_pipeline: Pipeline,
                              train_x: DataFrame, train_y: DataFrame,
                              validate_x: DataFrame, validate_y: DataFrame,
                              test_x: DataFrame, test_y: DataFrame):
    """ Before training and testing our model we will try to fix the dataset's
    fraud/legit case imbalance using SMOTE's fit_resample method.
    """
    sm = SMOTE(random_state=0)

    train_x_res, train_y_res = sm.fit_resample(
        train_x.to_pandas(), train_y.to_pandas())
    validate_x_res, validate_y_res = sm.fit_resample(
        validate_x.to_pandas(), validate_y.to_pandas())
    test_x_res, test_y_res = sm.fit_resample(
        test_x.to_pandas(), test_y.to_pandas())

    final_pipeline.fit(train_x_res, train_y_res.to_numpy().ravel())

    train_pred_y = final_pipeline.predict(train_x_res)
    validate_pred_y = final_pipeline.predict(validate_x_res)
    test_pred_y = final_pipeline.predict(test_x_res)

    train_accuracy = accuracy_score(train_pred_y, train_y_res.to_numpy().ravel())
    train_recall = recall_score(train_pred_y, train_y_res.to_numpy().ravel())

    validate_accuracy = accuracy_score(validate_pred_y, validate_y_res.to_numpy().ravel())
    validate_recall = recall_score(validate_pred_y, validate_y_res.to_numpy().ravel())

    test_accuracy = accuracy_score(test_pred_y, test_y_res.to_numpy().ravel())
    test_recall = recall_score(test_pred_y, test_y_res.to_numpy().ravel())

    metrics = {
        'train_accuracy': train_accuracy,
        'train_recall': train_recall,
        'validate_accuracy': validate_accuracy,
        'validate_recall': validate_recall,
        'test_accuracy': test_accuracy,
        'test_recall': test_recall,
    }

    return final_pipeline, metrics

full_training_run(write_model=False)

As we can see, after using SMOTE we get better recall values with our training, validation and test splits.

## Next Steps

In order to be able to use all the functions in this notebook, we will save them all in separate python files (fraud_detection_flow.py and feature_pipeline.py) to achieve the following:
- We will build a training pipeline using Metaflow.
- We will use mlflow to build and keep model registry.
- We will use BentoML to create a contained model in Docker ready to be used in Production environments.