# **Model Selection Analysis**: Transaction Fraud Detection

### Carina Carino

### 02/24/2024

 Analyze the performance of three candidate models of your choosing. Argue for which model you select using the criteria you design in . Report how the selected model's performance will influence the systems design.

## Install and Import Libraries

In [1]:
!pip install pandas
!pip install numpy
!pip install scikit-learn
!pip install matplotlib
!pip install xgboost
!pip install scikit-learn-intelex
!pip install Flask
!pip install catboost

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearnex import patch_sklearn 

patch_sklearn()

import calendar

import warnings
warnings.filterwarnings('ignore')


pd.set_option("display.precision", 2)





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Data Extraction and Transformation

In [24]:
data = pd.read_csv('../transactions.csv', header=[0], index_col=[0])

label_encoder = LabelEncoder()

data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time'])
data['trans_date'] = data['trans_date_trans_time'].dt.date
data['trans_date'] = pd.to_datetime(data['trans_date'])
data['dob']=pd.to_datetime(data['dob'])
data["age"] = data["trans_date"]-data["dob"]
data["age"] = data["age"].astype('int64')
data['trans_month'] = pd.DatetimeIndex(data['trans_date']).month
data['trans_year'] = pd.DatetimeIndex(data['trans_date']).year
data['month_name'] = data['trans_month'].apply(lambda x: calendar.month_abbr[x])
data['lat_dist_diff'] = abs(round(data['merch_lat']-data['lat'],3))
data['long_dist_diff'] = abs(round(data['merch_long']-data['long'],3))
data['city_state'] = data['city'] + ', ' + data['state']

data['trans_hour'] = data['trans_date_trans_time'].dt.hour
data['trans_minute'] = data['trans_date_trans_time'].dt.minute

data = data.drop(['cc_num','first','last','street','trans_num','trans_date_trans_time','city','lat','long','dob','merch_lat',
'merch_long','trans_date','month_name','city', 'trans_year'],axis=1)

data['city_state'] = label_encoder.fit_transform(data['city_state'])
data['zip'] = label_encoder.fit_transform(data['zip'])
data['job'] = label_encoder.fit_transform(data['job'])
data['merchant'] = label_encoder.fit_transform(data['merchant'])
data['sex'] = label_encoder.fit_transform(data['sex'])

data =pd.get_dummies(data,columns=['category'],drop_first=True)
data =pd.get_dummies(data,columns=['state'],drop_first=True)


### Data Splitting

In [5]:
X = data.drop(['is_fraud'], axis=1)
y = data['is_fraud']  # Ensure y is a Series for compatibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale your data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Random Forest


### I. No Hyperparameters

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Random Forest with default parameters
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train_scaled, y_train)

# Predictions
y_pred_default = rf_default.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_default))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.98      0.72      0.83      1953

    accuracy                           1.00    370479
   macro avg       0.99      0.86      0.91    370479
weighted avg       1.00      1.00      1.00    370479


### II. With Class Weights

In [8]:
# Random Forest with class weight balancing
rf_balanced = RandomForestClassifier(class_weight='balanced', random_state=42)
rf_balanced.fit(X_train_scaled, y_train)

# Predictions
y_pred_balanced = rf_balanced.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_balanced))


              precision    recall  f1-score   support

           0       1.00      0.95      0.97    368526
           1       0.09      0.97      0.17      1953

    accuracy                           0.95    370479
   macro avg       0.55      0.96      0.57    370479
weighted avg       1.00      0.95      0.97    370479


### III. With SMOTE

In [9]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline

# Define pipeline with SMOTE and Random Forest
pipeline_smote = ImPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline_smote.fit(X_train_scaled, y_train)

# Predictions
y_pred_smote = pipeline_smote.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_smote))



              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.86      0.81      0.83      1953

    accuracy                           1.00    370479
   macro avg       0.93      0.90      0.92    370479
weighted avg       1.00      1.00      1.00    370479


### IV. With Borderline SMOTE

In [10]:
from imblearn.over_sampling import BorderlineSMOTE

# Define pipeline with Borderline SMOTE and Random Forest
pipeline_bsmote = ImPipeline([
    ('bsmote', BorderlineSMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline_bsmote.fit(X_train_scaled, y_train)

# Predictions
y_pred_bsmote = pipeline_bsmote.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_bsmote))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.89      0.78      0.83      1953

    accuracy                           1.00    370479
   macro avg       0.95      0.89      0.92    370479
weighted avg       1.00      1.00      1.00    370479


## XGBoost

### I. No Hyperparameters

In [17]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, accuracy_score

# XGBoost with default parameters
xgb_default = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_default.fit(X_train_scaled, y_train)

# Predictions
y_pred_default = xgb_default.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_default))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.96      0.84      0.89      1953

    accuracy                           1.00    370479
   macro avg       0.98      0.92      0.95    370479
weighted avg       1.00      1.00      1.00    370479


### II. With Class Weights

In [16]:
# Calculate scale_pos_weight
# It's a good practice to set it as sum(negative instances) / sum(positive instances)
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])

# XGBoost with class weight balancing
xgb_balanced = XGBClassifier(use_label_encoder=False, eval_metric='logloss', scale_pos_weight=scale_pos_weight, random_state=42)
xgb_balanced.fit(X_train_scaled, y_train)

# Predictions
y_pred_balanced = xgb_balanced.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_balanced))


              precision    recall  f1-score   support

           0       1.00      0.99      1.00    368526
           1       0.49      0.96      0.65      1953

    accuracy                           0.99    370479
   macro avg       0.74      0.98      0.82    370479
weighted avg       1.00      0.99      1.00    370479


### III. With SMOTE

In [13]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImPipeline

# Define pipeline with SMOTE and XGBoost
pipeline_smote_xgb = ImPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42))
])

pipeline_smote_xgb.fit(X_train_scaled, y_train)

# Predictions
y_pred_smote = pipeline_smote_xgb.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_smote))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.72      0.89      0.80      1953

    accuracy                           1.00    370479
   macro avg       0.86      0.94      0.90    370479
weighted avg       1.00      1.00      1.00    370479


### IV. With Borderline SMOTE

In [14]:
from imblearn.over_sampling import BorderlineSMOTE

# Define pipeline with Borderline SMOTE and XGBoost
pipeline_bsmote_xgb = ImPipeline([
    ('bsmote', BorderlineSMOTE(random_state=42)),
    ('classifier', XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42))
])

pipeline_bsmote_xgb.fit(X_train_scaled, y_train)

# Predictions
y_pred_bsmote = pipeline_bsmote_xgb.predict(X_test_scaled)

# Evaluation
print(classification_report(y_test, y_pred_bsmote))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00    368526
           1       0.80      0.86      0.83      1953

    accuracy                           1.00    370479
   macro avg       0.90      0.93      0.91    370479
weighted avg       1.00      1.00      1.00    370479


## XGBoost with Hyperparameter Tuning

In [18]:
from sklearn.model_selection import GridSearchCV

# Parameter grid for XGBoost
param_grid = {
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
}

# Initialize the XGBoost classifier
xgb_classifier = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Initialize the GridSearchCV object
grid_search = GridSearchCV(estimator=xgb_classifier, param_grid=param_grid, scoring='recall', n_jobs=-1, cv=3, verbose=2)

# Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and best score
print("Best parameters found: ", grid_search.best_params_)
print("Best recall found: ", grid_search.best_score_)


Fitting 3 folds for each of 324 candidates, totalling 972 fits
Best parameters found:  {'colsample_bytree': 0.8, 'gamma': 0.5, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 1.0}
Best recall found:  0.7929332294102366


## Catboost
https://catboost.ai/en/docs/

In [32]:
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
data = pd.read_csv('../transactions.csv', header=[0], index_col=[0])


data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time'])
data['trans_date'] = data['trans_date_trans_time'].dt.date
data['trans_date'] = pd.to_datetime(data['trans_date'])
data['dob']=pd.to_datetime(data['dob'])
data["age"] = data["trans_date"]-data["dob"]
data["age"] = data["age"].astype('int64')
data['trans_month'] = pd.DatetimeIndex(data['trans_date']).month
data['trans_year'] = pd.DatetimeIndex(data['trans_date']).year
data['month_name'] = data['trans_month'].apply(lambda x: calendar.month_abbr[x])
data['lat_dist_diff'] = abs(round(data['merch_lat']-data['lat'],3))
data['long_dist_diff'] = abs(round(data['merch_long']-data['long'],3))
data['city_state'] = data['city'] + ', ' + data['state']

data['trans_hour'] = data['trans_date_trans_time'].dt.hour
data['trans_minute'] = data['trans_date_trans_time'].dt.minute

data['time_category'] = pd.cut(data['trans_hour'], bins=bins, labels=labels, right=False)

data = data.drop(['cc_num','first','last','street','trans_num','trans_date_trans_time','city','lat','long','dob','merch_lat',
'merch_long','trans_date','month_name','city'],axis=1)

# Define categorical features
categorical_features = ['city_state', 'zip', 'job', 'merchant', 'sex', 'category', 'state', 'time_category']

# Split the data into features (X) and target variable (y)
X = data.drop(['is_fraud'], axis=1)
y = data['is_fraud']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoostClassifier
catboost_model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.1,
    depth=8,
    loss_function='Logloss',
    eval_metric='AUC',
    use_best_model=True,
    random_seed=42,
    verbose=500
)

# Create a Pool object for the training and test data
train_pool = Pool(X_train, y_train, cat_features=categorical_features)
test_pool = Pool(X_test, y_test, cat_features=categorical_features)

# Train the model
catboost_model.fit(train_pool, eval_set=test_pool)

# Predict
y_pred = catboost_model.predict(X_test)

# Evaluate model performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

0:	test: 0.7373576	best: 0.7373576 (0)	total: 282ms	remaining: 9m 24s
500:	test: 0.9989258	best: 0.9989258 (500)	total: 5m 22s	remaining: 16m 6s
1000:	test: 0.9992824	best: 0.9992851 (986)	total: 10m 46s	remaining: 10m 45s
1500:	test: 0.9994531	best: 0.9994537 (1496)	total: 16m 17s	remaining: 5m 24s
1999:	test: 0.9994817	best: 0.9994833 (1985)	total: 21m 40s	remaining: 0us

bestTest = 0.9994832724
bestIteration = 1985

Shrink model to first 1986 iterations.
Accuracy: 0.9992874090029394
Precision: 0.9768492377188029
Recall: 0.8858166922683052
F1 Score: 0.9291084854994629


In [34]:
# Get feature importance scores
feature_importance = catboost_model.get_feature_importance(train_pool)

# Create a DataFrame to display feature importance scores
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print(feature_importance_df)


           Feature  Importance
2              amt       22.04
1         category       19.69
15      trans_hour       11.06
9              age        8.70
8        unix_time        7.83
17   time_category        4.99
6         city_pop        4.46
0         merchant        3.69
16    trans_minute        2.47
13  long_dist_diff        2.41
7              job        2.24
12   lat_dist_diff        2.21
3              sex        1.93
10     trans_month        1.84
5              zip        1.82
4            state        1.48
14      city_state        1.13
11      trans_year        0.03


In [12]:

{
    "trans_date_trans_time": "2022-10-12 14:32:21",
    "cc_num": 2703186189652095,
    "merchant": "fraud_Kilback LLC",
    "category": "shopping_net",
    "amt": 105.89,
    "first": "John",
    "last": "Doe",
    "sex": "M",
    "street": "123 Elm Street",
    "city": "Washington",
    "state": "DC",
    "zip": 62704,
    "lat": 39.7817,
    "long": -89.6508,
    "city_pop": 116250,
    "job": "IT trainer",
    "dob": "1990-05-19",
    "trans_num": "e9c2d8a2bb342bc446df5f578cddf8ac",
    "unix_time": 1634055141,
    "merch_lat": 39.7957,
    "merch_long": -89.6433
}


# NOTE:

The catboost model performed the best with 97% precision and 88% recall. However, since catboost does not require scaling and supports categorical features (meaning we do not have to scale the numeric features and encode the categorical features) and this case study focuses heavily on feature transformations, XGBoost with no hyperparameters is chosen instead to demonstrate proficiency in feature engineering techniques. 