# Credit Card Fraud Detection
## Final Project for the Codigo Facilito's Machine Learning 2023 Bootcamp

## Problem Definition and Objectives

### Problem Definition

Nowadays it is very easy for malicious actors to gain access to illegaly obtained banking accounts authentication databases that allows them to access unsuspected victim's financial assets without their knowledge, until it's too late. To minimize the impact of this, different techniques can be applied to detect when a user's identity has been comprompised and their assets are being accessed illegaly.

One of them are fraud detection systems, which are able to learn users' baking transactions behaviour - which means learning when (at what time of day), where (the type of stores they commonly visit) and how (purchasing online vs swiping a physical card) they usually perform transactions with their credit cards - in order to detect when a new transaction doen't follow the pattern previously learned, flagging such transactions as fraudulent, and require the user to perform additional verification for the transaction to go through.

### Problem Relevance


Just to highlight the importance of fraud detection systems, according to the [Security.org 2023 Credit Card Fraud Report](https://www.security.org/digital-safety/credit-card-fraud-report/):
- 65% of credit and credit card holders have been fraud victims at some point in their lives, up from 58 percent in 2022. This equates to about 151 million victims of fraud in the United States alone.
- An increasing number of Americans have been victimized multiple times: in 2022, 44 percent of credit card users reported having two or more fraudulent charges, compared to 35 percent in 2021.
- Since 2021, the median fraudulent charge has climbed by about 27 percent (rising to $79 in 2023). This equates to about $12 billion in total attempted fraudulent charges.



### Key Stakeholders

The main stakeholders in this project are:

1) The banking institution(s) that would provide banking transaction data required to train the machine learning model.
2) The user(s) allowing for their banking transactions data to be used to train the model.
3) The FTC (in the US) and other regulatory institutions that would need to verify and approve the use of the users data to train the model, and approve the use of the model.

### Objectives

The goal of this project is to build a machine learning model that allows the detection of fraudulent credit card transactions by training it with a credit card transaction dataset, and build a feature engineering and training pipeline that will allow the model to be re-trained in the future.

- The final machine learning model should provide at least 80% of fraud detection accuracy.
- A feature egineering and training pipeline should be used to allow future training of the model.
- An application that allows a user to enter a dummy transaction and verify its authenticity.

### Preparation Steps

1) Identify a public credit card transaction dataset suitable for an Exploratory Data Analysis, that allows the clear and easy identification of each column's information. Some datasets available in Kaggle contain columns that were already scaled or processed using PCA analysis, and therefore are not useful for this project's goals.
2) Research the different machine learning models that are best suited for detecting fraudulent credit card transactions.
3) Select a suitable online platform to deploy the machine learning model that is free to use.

### Dataset

- We will use the [Credit Card Transactions Kaggle Dataset](https://www.kaggle.com/datasets/ealtman2019/credit-card-transactions), because it contains a good amount of data to train our model - approximately 24M transactions of 2000 users generated by a multi-agent virtual world simulation performed by IBM - and its columns are easy to identify and work with because they are not scaled or obfuscated in any way that could result in us not being able to find correlations in the data.

- The dataset contains the following columns:
    1) 'User': An ID of the user.
    2) 'Card': An ID for the user's card, some users have multiple cards.
    3) 'Year', 'Month', 'Day', 'Time': The timestamp of the transaction. 
    4) 'Amount': The amount of the transaction.
    5) 'Use Chip': 'Swipe Transaction' if a physical card was used to perform the transaction, or 'Online Transaction' if the transaction was performed online.
    6) 'Merchant Name': The ID of the store where the transaction was made.
    7) 'Merchant City', 'Merchant State', 'Zip': The store's location.
    8) 'MCC': The [Merchant Category Code](https://www.investopedia.com/terms/m/merchant-category-codes-mcc.asp).
    9) 'Errors?': Any error(s) during the transaction, eg. 'Insufficient Balance', 'Technical Glitch', etc.
    10) 'Is Fraud?: A label indicating if the transaction was fraudulent or not.

### Deployment Plan

For deploying our model we will use BentoML, because it provides a very robust framework to serve and deploy machine learning models in the cloud. We will deploy our model into a free-tier virtual machine in the Google Cloud's Compute Engine, provided it has enough resources to run our model and server our model. In case it doesn't, then we will not deploy our model to the cloud, and we will store our model locally instead.

## Exploratory Data Analysis

We will begin by loading the credit card transaction dataset into a polars DataFrame and confirm the contents of the file have been loaded successfully.

In [1]:
import polars as pl

data_df = pl.read_csv("../data/credit_card_transactions-ibm_v2.csv")
data_df.head()

User,Card,Year,Month,Day,Time,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,Errors?,Is Fraud?
i64,i64,i64,i64,i64,str,str,str,i64,str,str,f64,i64,str,str
0,0,2002,9,1,"""06:21""","""$134.09""","""Swipe Transact…",3527213246127876953,"""La Verne""","""CA""",91750.0,5300,,"""No"""
0,0,2002,9,1,"""06:42""","""$38.48""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""06:22""","""$120.34""","""Swipe Transact…",-727612092139916043,"""Monterey Park""","""CA""",91754.0,5411,,"""No"""
0,0,2002,9,2,"""17:45""","""$128.95""","""Swipe Transact…",3414527459579106770,"""Monterey Park""","""CA""",91754.0,5651,,"""No"""
0,0,2002,9,3,"""06:23""","""$104.71""","""Swipe Transact…",5817218446178736267,"""La Verne""","""CA""",91750.0,5912,,"""No"""


First we can identify a few columns that need some work.
- The Time column needs to be transformed into "Hour" and "Minute" columns, instead of a string.
- We need to convert the Amount column into an actual number instead of a string.
- Let's make the Use Chip column categorical.
- Let's make the Merchant Name a string, then a categorical value.
- The Merchant State is empty when the transaction was online, so we'll fill the nulls with "N/A", then make it categorical.
- Let's make Merchant City categorical.
- The Zip column is empty when the transaction was online, so let's convert it to a string and replace null with "N/A", then make it categorical.
- Let's change the null values in Errors? to "N/A", then make it categorical.
- Finally, let's convert the IsFraud? column to categorical.

In [49]:
new_data_df = data_df.with_columns(
    Hour = pl.col("Time").map_elements(
        lambda x: x[:2]).cast(pl.Int64, strict=True),
    Minute = pl.col("Time").map_elements(
        lambda x: x[3:]).cast(pl.Int64, strict=True),
).drop("Time")
new_data_df = new_data_df.with_columns(
    pl.col("Amount").map_elements(
        lambda x: x.replace("$", "")).cast(pl.Float64, strict=True)
)
new_data_df = new_data_df.with_columns(
    pl.col("Use Chip").cast(pl.Categorical).to_physical()
)
new_data_df = new_data_df.with_columns(
    pl.col("Merchant Name").cast(pl.String, strict=True).cast(pl.Categorical).to_physical()
)
new_data_df = new_data_df.with_columns(
    pl.col("Merchant State").fill_null(value="N/A").cast(pl.Categorical).to_physical()
)
new_data_df = new_data_df.with_columns(
    pl.col("Merchant City").fill_null(value="N/A").cast(pl.Categorical).to_physical()
)
new_data_df = new_data_df.with_columns(
    pl.col("Zip").cast(pl.String, strict=True).fill_null("N/A").cast(pl.Categorical).to_physical()
)
new_data_df = new_data_df.with_columns(
    pl.col("Errors?").fill_null(value="No").map_elements(
        lambda x: 0 if x == "No" else 1
    ).alias("Errors")
).drop("Errors?")
new_data_df = new_data_df.rename({"Is Fraud?": "IsFraud"}).with_columns(
    pl.col("IsFraud").map_elements(
        lambda x: 0 if x == "No" else 1
    )
).drop("Is Fraud?")

new_data_df

User,Card,Year,Month,Day,Amount,Use Chip,Merchant Name,Merchant City,Merchant State,Zip,MCC,IsFraud,Hour,Minute,Errors
i64,i64,i64,i64,i64,f64,u32,u32,u32,u32,u32,i64,i64,i64,i64,i64
0,0,2002,9,1,134.09,0,0,0,0,0,5300,0,6,21,0
0,0,2002,9,1,38.48,0,1,1,0,1,5411,0,6,42,0
0,0,2002,9,2,120.34,0,1,1,0,1,5411,0,6,22,0
0,0,2002,9,2,128.95,0,2,1,0,1,5651,0,17,45,0
0,0,2002,9,3,104.71,0,3,0,0,0,5912,0,6,23,0
0,0,2002,9,3,86.19,0,4,1,0,2,5970,0,13,53,0
0,0,2002,9,4,93.84,0,1,1,0,1,5411,0,5,51,0
0,0,2002,9,4,123.5,0,1,1,0,1,5411,0,6,9,0
0,0,2002,9,5,61.72,0,1,1,0,1,5411,0,6,14,0
0,0,2002,9,5,57.1,0,5,0,0,0,7538,0,9,35,0


We will now use YData Profiling to identify what columns can help us best determine if a transaction is legitimate or not. We will create a report with only the first 10,000 columns so the process doesn't take too long.

In [51]:
from ydata_profiling import ProfileReport


report = ProfileReport(new_data_df.to_pandas())
report.to_file("../data/profile_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]