In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Introduction

I've created a GitHub Repository for this group project, you can search for athemaris on GitHub.
The repository name is `FraudDetection_banksim`. There I uploaded the csv dataset (BankSim Dataset.csv).

In [3]:
# getting the dataset from github
url_df = 'https://raw.githubusercontent.com/athemaris/FraudDetection_banksim/refs/heads/main/BankSim%20Dataset.csv'

banksim_df = pd.read_csv(url_df)

In [4]:
banksim_df.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,C583110837,3,M,28007,M480139044,28007,es_health,4426,1
1,0,C1332295774,3,M,28007,M480139044,28007,es_health,3245,1
2,0,C1160421902,3,M,28007,M857378720,28007,es_hotelservices,17632,1
3,0,C966214713,3,M,28007,M857378720,28007,es_hotelservices,33741,1
4,0,C1450140987,4,F,28007,M1198415165,28007,es_wellnessandbeauty,22011,1


In [5]:
# quick look at unique values of categorical variables

print(f'Values age: {banksim_df['age'].unique()}')
print(f'Values gender: {banksim_df['gender'].unique()}')
print(banksim_df['zipcodeOri'].unique())
print(banksim_df['zipMerchant'].unique())
print(banksim_df['category'].unique())
print(banksim_df['merchant'].unique())

Values age: ['3' '4' '2' '5' '1' '6' '0' 'U']
Values gender: ['M' 'F' 'E' 'U']
[28007]
[28007]
['es_health' 'es_hotelservices' 'es_wellnessandbeauty' 'es_sportsandtoys'
 'es_home' 'es_otherservices' 'es_fashion' 'es_leisure' 'es_travel'
 'es_barsandrestaurants' 'es_tech' 'es_hyper' 'es_transportation'
 'es_food' 'es_contents']
['M480139044' 'M857378720' 'M1198415165' 'M980657600' 'M1741626453'
 'M1535107174' 'M2122776122' 'M209847108' 'M1888755466' 'M547558035'
 'M3697346' 'M1649169323' 'M923029380' 'M495352832' 'M17379832'
 'M1294758098' 'M1748431652' 'M151143676' 'M732195782' 'M840466850'
 'M855959430' 'M50039827' 'M2080407379' 'M2011752106' 'M692898500'
 'M1353266412' 'M1873032707' 'M78078399' 'M933210764' 'M348875670'
 'M348934600' 'M1823072687' 'M1053599405' 'M85975013' 'M349281107'
 'M1352454843' 'M117188757' 'M1946091778' 'M97925176' 'M1842530320'
 'M677738360' 'M1313686961' 'M1600850729' 'M1872033263' 'M1400236507'
 'M1913465890' 'M45060432' 'M1788569036' 'M1416436880' 'M172640

The values in `gender` stand for:
* E stands for ENTERPRISE.
* U stands for UNKNOWN.
* M stands for MALE
* F stands for FEMALE

The value U in `age` stands for UNKNOWN.

In [6]:
# collect info about the type of each variable

banksim_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594643 entries, 0 to 594642
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   step         594643 non-null  int64 
 1   customer     594643 non-null  object
 2   age          594643 non-null  object
 3   gender       594643 non-null  object
 4   zipcodeOri   594643 non-null  int64 
 5   merchant     594643 non-null  object
 6   zipMerchant  594643 non-null  int64 
 7   category     594643 non-null  object
 8   amount       594643 non-null  object
 9   fraud        594643 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 45.4+ MB


The Dtype of the following variables are incorrect:
- `amount` is an obejct and needs to be converted into a float

In [7]:
# let's convert amount type from object to float

banksim_df['amount'] = banksim_df['amount'].str.replace(',', '').astype(float)

print(banksim_df['amount'].dtype)

float64


## Feature Engineering

In order to set up the dataset for LightGBM, we first need to make sure that the categorical variables are of the type `category` LightGBM can use categorical features directly, but its processing actually relies on integer code. 
Pandas `dtype='category'` naturally provides these integer codes. 

This is the first step to do before the feature derivation process. 

In [8]:
# creating a working copy of the dataset

df = banksim_df.copy()

In [9]:
# changing the type of categorical variables to 'category' for LightGBM

for c in ['customer','age','gender','merchant','category']:
    if c in df.columns:
        df[c] = df[c].astype('category')



In [10]:
# check if the conversion worked properly

print(df.dtypes)

step              int64
customer       category
age            category
gender         category
zipcodeOri        int64
merchant       category
zipMerchant       int64
category       category
amount          float64
fraud             int64
dtype: object


The new dataset `df` contains all the original variables correctly converted. 

### Features Building

In order to extract some useful data from the raw dataset, the following variables will be computed:
* `log_amount`: stabilizes skewed amounts
* `cust_txn_count_prev`: how many txns this customer had before now
* `cust_amt_mean_prev`: customer’s average amount before now
* `amt_minus_cust_mean_diff`: difference between current amount and customer's historical mean amount
* `time_since_last_cust_txn`: days since customer’s last txn
* `time_since_last_cust_merchant`: days since customer last used this merchant
* `cust_merchant_txn_count_prev`: familiarity with merchant
* `cust_category_count_prev`: familiarity with category

Before doing so, the dataset will be sorted so every feature is computed on the same sorted timeline. 

In [11]:
# adding a unique row identifier (_row_id) to save the original order of the dataset

df['_row_id'] = np.arange(len(df), dtype=np.int64)
print(df['_row_id'].head())


0    0
1    1
2    2
3    3
4    4
Name: _row_id, dtype: int64


After preserving the original row order, we now sort the working dataset by `['customer', 'step', 'merchant', 'category']`. 
This establishes a time-aware sequence per customer so ‘previous’ features only use past information.


In [12]:
# creating a sorted version of the dataset for a robust feature engineering
df_sort = df.sort_values(['customer', 'step', 'merchant', 'category']).copy()

# this way the dataset is sorted by customer, step, merchant and category
# it is useful because now we can group by customer and step to see all transactions made by a customer at a certain step

In [13]:
# let's add the first new variable: log_amount
# it is the logarithm of the amount + 1 (to avoid log(0))

df_sort['log_amount'] = np.log1p(df_sort['amount'])

# let's check the new variable and compare it to amount

print(df_sort[['amount', 'log_amount']].head())
print(f"\n Here's a description of amount: \n {df_sort['amount'].describe()}")
print(f"\n Here's a description of log_amount: \n {df_sort['log_amount'].describe()}")

         amount  log_amount
86563   14387.0    9.574150
111048   1669.0    7.420579
122828   5618.0    8.633909
125868   1474.0    7.296413
130310   4742.0    8.464425

 Here's a description of amount: 
 count    594643.000000
mean       3452.935302
std       10674.403405
min           0.000000
25%         951.000000
50%        2399.000000
75%        4047.000000
max      832996.000000
Name: amount, dtype: float64

 Here's a description of log_amount: 
 count    594643.000000
mean          7.465628
std           1.293303
min           0.000000
25%           6.858565
50%           7.783224
75%           8.305978
max          13.632785
Name: log_amount, dtype: float64


In [14]:
# adding cust_txn_count_prev: number of previous transactions by the customer

df_sort['cust_txn_count_prev'] = df_sort.groupby('customer', observed=True).cumcount().astype('int64')

print (df_sort[['customer', 'cust_txn_count_prev']].head())


           customer  cust_txn_count_prev
86563   C1000148617                    0
111048  C1000148617                    1
122828  C1000148617                    2
125868  C1000148617                    3
130310  C1000148617                    4


In [15]:
# adding cust_amt_mean_prev: customer's historical mean amount up to previous txn

prev_sum  = df_sort.groupby('customer', observed=True)['amount'].cumsum() - df_sort['amount'] # cumulative sum of amounts per customer, excluding current txn
prev_count = df_sort.groupby('customer', observed=True).cumcount() # number of previous transactions per customer

# calculating the mean amount of previous transactions
df_sort['cust_amt_mean_prev'] = (
    prev_sum / prev_count.replace(0, np.nan)
).fillna(0).astype('float64').round(3)  # replacing NaN with 0 for customers with no previous transactions

# checking the new variable
print(df_sort[['customer', 'amount', 'cust_amt_mean_prev']].head(10))

           customer   amount  cust_amt_mean_prev
86563   C1000148617  14387.0               0.000
111048  C1000148617   1669.0           14387.000
122828  C1000148617   5618.0            8028.000
125868  C1000148617   1474.0            7224.667
130310  C1000148617   4742.0            5787.000
132931  C1000148617    171.0            5578.000
135609  C1000148617   3479.0            4676.833
137616  C1000148617   5514.0            4505.714
141258  C1000148617   1484.0            4631.750
143862  C1000148617   1323.0            4282.000


In [16]:
# descriptive statistics of the new variable cust_amt_mean_prev

df_sort['cust_amt_mean_prev'].describe()

count    594643.000000
mean       3736.603881
std        4962.501779
min           0.000000
25%        2605.057000
50%        2872.771000
75%        3313.141000
max      614259.000000
Name: cust_amt_mean_prev, dtype: float64

In [17]:
# adding amt_minus_cust_mean_diff: difference between current amount and customer's historical mean amount

df_sort['amt_minus_cust_mean_diff'] = df_sort['amount'] - df_sort['cust_amt_mean_prev'].astype('float64')
print(df_sort[['amount', 'cust_amt_mean_prev', 'amt_minus_cust_mean_diff']].head())

         amount  cust_amt_mean_prev  amt_minus_cust_mean_diff
86563   14387.0               0.000                 14387.000
111048   1669.0           14387.000                -12718.000
122828   5618.0            8028.000                 -2410.000
125868   1474.0            7224.667                 -5750.667
130310   4742.0            5787.000                 -1045.000


In [18]:
# adding time_since_last_cust_txn: time since last transaction by the customer

df_sort['time_since_last_cust_txn'] = (
    df_sort['step'] - df_sort.groupby('customer', observed=True)['step'].shift(1)
).fillna(0).astype('float64') #fillna(0) for customers with no previous transactions

print(df_sort[['customer', 'step', 'time_since_last_cust_txn']].head(10))

           customer  step  time_since_last_cust_txn
86563   C1000148617    30                       0.0
111048  C1000148617    38                       8.0
122828  C1000148617    42                       4.0
125868  C1000148617    43                       1.0
130310  C1000148617    44                       1.0
132931  C1000148617    45                       1.0
135609  C1000148617    46                       1.0
137616  C1000148617    47                       1.0
141258  C1000148617    48                       1.0
143862  C1000148617    49                       1.0


In [19]:
# adding time_since_last_cust_merchant: time since last transaction by the customer at this merchant

df_sort['time_since_last_cust_merchant'] = (
    df_sort['step'] - df_sort.groupby(['customer', 'merchant'], observed=True)['step'].shift(1)
).fillna(0).astype('float64') #fillna(0) for customers with no previous transactions at this merchant

print(df_sort[['customer', 'merchant', 'step', 'time_since_last_cust_merchant']].head(10))

           customer     merchant  step  time_since_last_cust_merchant
86563   C1000148617  M1888755466    30                            0.0
111048  C1000148617  M1741626453    38                            0.0
122828  C1000148617  M1888755466    42                           12.0
125868  C1000148617   M840466850    43                            0.0
130310  C1000148617  M1823072687    44                            0.0
132931  C1000148617  M1823072687    45                            1.0
135609  C1000148617  M1823072687    46                            1.0
137616  C1000148617  M1823072687    47                            1.0
141258  C1000148617    M85975013    48                            0.0
143862  C1000148617  M1823072687    49                            2.0


In [20]:
# adding cust_merchant_txn_count_prev: number of previous transactions by the customer at this merchant

df_sort['cust_merchant_txn_count_prev'] = df_sort.groupby(['customer', 'merchant'], observed=True).cumcount().astype('int64')

print(df_sort[['customer', 'merchant', 'cust_merchant_txn_count_prev']].head())

           customer     merchant  cust_merchant_txn_count_prev
86563   C1000148617  M1888755466                             0
111048  C1000148617  M1741626453                             0
122828  C1000148617  M1888755466                             1
125868  C1000148617   M840466850                             0
130310  C1000148617  M1823072687                             0


In [21]:
# adding cust_category_count_prev: number of previous transactions by the customer in this category

df_sort['cust_category_count_prev'] = df_sort.groupby(['customer', 'category'], observed=True).cumcount().astype('int64')

print(df_sort[['customer', 'category', 'cust_category_count_prev']].head(10))


           customer           category  cust_category_count_prev
86563   C1000148617   es_otherservices                         0
111048  C1000148617   es_sportsandtoys                         0
122828  C1000148617   es_otherservices                         1
125868  C1000148617            es_tech                         0
130310  C1000148617  es_transportation                         0
132931  C1000148617  es_transportation                         1
135609  C1000148617  es_transportation                         2
137616  C1000148617  es_transportation                         3
141258  C1000148617            es_food                         0
143862  C1000148617  es_transportation                         4


### Resulting dataset after building the new features

Now we have all the new variables in the dataset `df_sort`, which is still sorted. 
The next step consists of bringing back the original order of the data, like in `df`, through `_row_id`. 

In [22]:
# restoring the original order of the dataset using _row_id
df_final = df_sort.sort_values('_row_id').copy()

# LightGBM

In [23]:
# dividing the dataset into features and target variable

X = df_final.drop(columns=['fraud', '_row_id'])
y = df_final['fraud']

In [24]:
# importing all necessary libraries for model training and lightGBM

import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score

In [25]:
# splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) 

# `stratify=y` to maintain the same proportion of fraud and non-fraud cases in both sets
# useful for imbalanced datasets

Before applying the model, the data needs to be converted into a `LightGBM Dataset object`, since the model is designed to operate on this specialized structure. 

In fact, LightGBM doesn't train on raw data, because it doesn't work directly with continuous feature values. LightGBM, instead, internally discretizes continuous features into histogram bins (approach known as **histogram-based algorithm**). 
This binning drastically reduces memory usage and speeds up training because the algorithm only needs to evaluate the boundaries of these few discrete bins. 

In [26]:
# creating LightGBM datasets for training and testing

train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data) #reference to training data for consistent binning

Now we need to define the LightGBM parameters, but before this step we can identify the optimal value for `max_depth` by using a k-fold-cross-validation. 