# Predicting Frauds Based on Tree Based Models

In the model development phase of our fraud prediction project, the journey begins with an insightful Exploratory Data Analysis (EDA). Through this process, we discerned the significance of certain features such as cardholder names, cities, states, and merchant types in relation to fraudulent transactions. These observations guided our model development strategy, indicating that these categorical features play pivotal roles in distinguishing fraudulent from legitimate transactions.

Our choice to employ CatBoost and LightGBM stems from their exceptional abilities in handling categorical variables efficiently without the need for extensive preprocessing. These tree-based models offer robustness against overfitting, thanks to their regularization parameters, and they leverage boosting algorithms to progressively improve predictive accuracy. By utilizing these models, we aim to harness their inherent strengths in handling complex categorical features while enhancing predictive performance through iterative training, hyperparameter tuning, and meticulous evaluation using metrics such as precision, recall, F1-score, AUC-ROC, and the Confusion Matrix. This strategic approach aligns with our goal to develop reliable and accurate predictive models capable of discerning fraudulent activities within our dataset, ultimately contributing to robust fraud detection mechanisms.

## Contents
- [Imports](#Imports)
- [Functions](#Functions)
- [Baseline Score](#Baseline-Score)
- [Catboost Model](#Catboost-Model)
- [LightGBM Model](#LightGBM-Model)
- [Save Models](#Save-Models)

## Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA, TruncatedSVD
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.metrics import classification_report
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay



In [2]:
# Determine the data path
data_path = "../data/final_df.csv"

In [3]:
# Load the dataset
df = pd.read_csv(data_path).drop(columns='Unnamed: 0')
df.head()

Unnamed: 0,cardholder_name,card_number,card_type,merchant_name,merchant_category,merchant_state,merchant_city,transaction_amount,merchant_category_code,fraud_flag
0,Meagan Smith,4408914864277480,visa,KFC,Fast Food,New Jersey,Jersey City,38.055684,MCC 5814,0
1,Miss Vanessa Briggs MD,4533948622139044,visa,McDonald's,Fast Food,Montana,Missoula,11.516379,MCC 5814,0
2,Casey Lyons,4350240875308199,visa,Domino's Pizza,Fast Food,Ohio,Cleveland,12.739792,MCC 5814,0
3,Cynthia Munoz,4756687869818916,visa,McDonald's,Fast Food,Massachusetts,Worcester,20.899888,MCC 5814,0
4,Lynn Pham,4813038430033752,visa,Papa John's,Fast Food,Minnesota,Saint Paul,11.073323,MCC 5814,0


## Functions

In [4]:
def evaluation(model, model_name):
        
    # Get prediction 
    preds = model.predict(X_test)

    # Confusion matrix values
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

    # Compute confusion matrix 
    conf_matrix = confusion_matrix(y_test, preds)

    # Create display confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
    disp.plot(values_format='d', cmap=plt.cm.Blues)

    # Define the title
    plt.title(f"The Confusion Matrix of {model_name}");

    print(f" Evaluation Metrics ".center(34, "="))
    print(f"Accuracy -------------- {accuracy_score(y_test, preds)}")
    print(f"Precision ------------- {precision_score(y_test, preds)}")
    print(f"Sensitivity ----------- {recall_score(y_test, preds)}")
    print(f"Specifity ------------- {tn/(tn+fp)}")
    print(f"F1 score -------------- {f1_score(y_test, preds)}")

## Baseline Score

### Train-Test Split

In [5]:
# Define X features and y target
X = df.drop('fraud_flag', axis=1)
y = df['fraud_flag']

print(f"X shape ----------- {X.shape}")
print(f"y shape ----------- {y.shape}")

X shape ----------- (100000, 9)
y shape ----------- (100000,)


In [6]:
# defining training and test Xs and ys
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=42)

In [7]:
# Get the ratio of classes
y_test.value_counts(normalize=True)

fraud_flag
0    0.836638
1    0.163362
Name: proportion, dtype: float64

**The baseline accuracy is 0.83 which is majority class**

## Catboost Model

In [8]:
# Specify categorical features
cats = ['cardholder_name', 'card_type', 'merchant_name', 'merchant_category',
       'merchant_state', 'merchant_city', 'merchant_category_code']

# Instatiate the model
cb = CatBoostClassifier(iterations=100, cat_features=cats, random_state=42)

# Fit the model
cb.fit(X_train, y_train)

# Get the accuracy score
print(f"Accuracy score is: {cb.score(X_train, y_train)}")

Learning rate set to 0.305826
0:	learn: 0.5619432	total: 212ms	remaining: 21s
1:	learn: 0.4942667	total: 245ms	remaining: 12s
2:	learn: 0.4564945	total: 262ms	remaining: 8.46s
3:	learn: 0.4344617	total: 283ms	remaining: 6.8s
4:	learn: 0.4227252	total: 296ms	remaining: 5.62s
5:	learn: 0.4145799	total: 327ms	remaining: 5.13s
6:	learn: 0.4096423	total: 362ms	remaining: 4.81s
7:	learn: 0.4080681	total: 374ms	remaining: 4.3s
8:	learn: 0.4060546	total: 413ms	remaining: 4.18s
9:	learn: 0.4046868	total: 446ms	remaining: 4.01s
10:	learn: 0.4030114	total: 479ms	remaining: 3.88s
11:	learn: 0.4018921	total: 513ms	remaining: 3.76s
12:	learn: 0.4012109	total: 552ms	remaining: 3.69s
13:	learn: 0.4009210	total: 602ms	remaining: 3.7s
14:	learn: 0.4003081	total: 651ms	remaining: 3.69s
15:	learn: 0.3999409	total: 720ms	remaining: 3.78s
16:	learn: 0.3990904	total: 771ms	remaining: 3.76s
17:	learn: 0.3987277	total: 821ms	remaining: 3.74s
18:	learn: 0.3987214	total: 841ms	remaining: 3.59s
19:	learn: 0.39847

In [None]:
evaluation(model=cb, 

### The Catboost Model Evaluation

In [None]:
# Get prediction 
preds = cb.predict(X_test)

# Confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

# Compute confusion matrix 
conf_matrix = confusion_matrix(y_test, preds)

# Create display confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
disp.plot(values_format='d', cmap=plt.cm.Blues)

# Define the title
plt.title(f"The Confusion Matrix of Catboost Model");

print(f" Evaluation Metrics ".center(34, "="))
print(f"Accuracy -------------- {accuracy_score(y_test, preds)}")
print(f"Precision ------------- {precision_score(y_test, preds)}")
print(f"Sensitivity ----------- {recall_score(y_test, preds)}")
print(f"Specifity ------------- {tn/(tn+fp)}")
print(f"F1 score -------------- {f1_score(y_test, preds)}")

### The Catboost Model Feature Importance

In [None]:
# Get the feature importance values
feature_importance = cb.feature_importances_

# Get feature names
feature_names = cb.feature_names_

# sort features in descending orders
sorted_indices = np.argsort(feature_importance)[::-1]
sorted_feature_importance = feature_importance[sorted_indices]
sorted_feature_names = np.array(feature_names)[sorted_indices]

# Create a bar plot
fig = go.Figure(data=[go.Bar(x=sorted_feature_names, y=sorted_feature_importance)])

fig.update_layout(title="Catboost Model Feature Importances", xaxis_title="Features", yaxis_title="Importance", xaxis_tickangle=-45)
fig.show()

## LightGBM Model

In [None]:
#X_encoded = pd.get_dummies(X, drop_first=True)

In [None]:
cats = ['cardholder_name', 'card_type', 'merchant_name', 'merchant_category',
       'merchant_state', 'merchant_city', 'merchant_category_code']

for col in cats:
    print(df[col].value_counts())

In [None]:
for col in cats:
    label_encoders[col] = LabelEncoder()
    X[col] = label_encoders[col].fit_transform(X[col].astype(str))
    print(X[col].value_counts())

In [None]:
# Apply TruncatedSVD for dimensionality redustion
# svd = TruncatedSVD(n_components=100)
# X_svd = svd.fit_transform(X_encoded)

# Apply PCA
# pca = PCA(n_components=0.95)  # Retain 95% of variance
# X_pca = pca.fit_transform(X_encoded)

# Split data into train and test sets
X_train, X_test, y_train, y_test, =train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize data
# sc = StandardScaler()
# X_train_sc = sc.fit_transform(X_train)
# X_test_sc = sc.fit_transform(X_test)

# Set lightgbm train and test data
# categorical_feature = [0, 2, 3, 4, 5, 6, 8]

train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# Define parameters
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_error',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

num_round = 100

bst = lgb.train(params, train_data, num_round, valid_sets=[test_data])

preds = bst.predict(X_test, num_iteration=bst.best_iteration)

preds_binary = [1 if x>= 0.5 else 0 for x in preds]


### The LightGBM Model Evaluation

In [None]:
# Confusion matrix values
tn, fp, fn, tp = confusion_matrix(y_test, preds_binary).ravel()

# Compute confusion matrix 
conf_matrix = confusion_matrix(y_test, preds_binary)

# Create display confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix)
disp.plot(values_format='d', cmap=plt.cm.Blues)

# Define the title
plt.title(f"The Confusion Matrix of LightGBM");

print(f" Evaluation Metrics ".center(34, "="))
print(f"Accuracy -------------- {accuracy_score(y_test, preds_binary)}")
print(f"Precision ------------- {precision_score(y_test, preds_binary)}")
print(f"Sensitivity ----------- {recall_score(y_test, preds_binary)}")
print(f"Specifity ------------- {tn/(tn+fp)}")
print(f"F1 score -------------- {f1_score(y_test, preds_binary)}")