In [1]:
!pip install gdown

Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Installing collected packages: gdown
Successfully installed gdown-4.7.1


In [2]:
import gdown

url = 'https://drive.google.com/uc?id=1syawlOQLjgMkmA-Ck28VsuAS8JmLdnEf'

output = 'Fraud_Detection_Dataset.csv'

gdown.download(url, output, quiet=False)

Downloading...
From (uriginal): https://drive.google.com/uc?id=1syawlOQLjgMkmA-Ck28VsuAS8JmLdnEf
From (redirected): https://drive.google.com/uc?id=1syawlOQLjgMkmA-Ck28VsuAS8JmLdnEf&confirm=t&uuid=23301d6d-98cf-488e-990b-36c4bc628d9e
To: /kaggle/working/Fraud_Detection_Dataset.csv
100%|██████████| 1.65G/1.65G [00:07<00:00, 218MB/s]


'Fraud_Detection_Dataset.csv'

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

In [4]:
df = pd.read_csv('/kaggle/working/Fraud_Detection_Dataset.csv')

In [5]:
df

Unnamed: 0,Transaction ID,User ID,Transaction Amount,Transaction Date and Time,Merchant ID,Payment Method,Country Code,Transaction Type,Device Type,IP Address,...,User's Transaction History,Merchant's Reputation Score,User's Device Location,Transaction Currency,Transaction Purpose,User's Credit Score,User's Email Domain,Merchant's Business Age,Transaction Authentication Method,Fraudulent Flag
0,51595306,9822,163.08,2023-01-02 07:47:54,4044,ACH Transfer,KOR,Charity,GPS Device,42.23.223.120,...,26,2.71,United Kingdom,NOK,Consultation Fee,343,cox.co.uk,3,Bluetooth Authentication,0
1,85052974,4698,430.74,2021-09-12 15:15:41,4576,2Checkout,VNM,Cashback,Medical Device,39.52.212.120,...,60,3.95,Mexico,EGP,Cashback Reward,688,gmail.com,13,NFC Tag,1
2,23954324,8666,415.74,2023-01-12 17:25:58,4629,Google Wallet,MEX,Reward,Vehicle Infotainment System,243.180.236.29,...,81,3.81,Qatar,MXN,Acquisition,371,rocketmail.com,7,Token,1
3,44108303,9012,565.89,2021-02-27 11:31:00,3322,Check,SGP,Purchase,Kiosk,212.186.227.14,...,18,2.67,Spain,CLP,Loan Repayment,687,roadrunner.co.uk,15,Time-Based OTP,1
4,66622683,5185,955.49,2022-09-24 04:06:38,7609,Worldpay,HKG,Acquisition,Smart Mirror,166.113.10.199,...,98,3.19,Israel,RUB,Dividend Reinvestment,605,protonmail.co.uk,17,Password,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5999995,61037029,7480,448.99,2021-10-20 15:56:32,3346,Discover,SGP,Scholarship,Server,255.134.160.201,...,34,2.78,Russia,CHF,Invoice Payment,679,aim.com,14,Retina Scan,0
5999996,56515851,5636,841.39,2021-06-14 02:10:00,8415,Alipay,ZAF,Loan,Digital Camera,48.190.84.14,...,80,2.60,Malaysia,HUF,Membership,706,cox.net,10,Social Media Login,1
5999997,66863972,5554,197.28,2021-11-06 22:33:19,4231,Afterpay,CAN,Service Charge,Barcode Scanner,7.21.196.39,...,12,1.35,Egypt,HKD,Admission,310,live.co.uk,14,Mobile App Notification,0
5999998,13449701,1275,358.33,2022-03-13 15:02:35,9614,JCB,UK,Fine,Robot,211.202.242.100,...,57,1.29,China,AED,Expense Reimbursement,460,rediffmail.com,16,Authentication App,0


Due to limited RAM resources on my Kaggle cloud environment, I needed to work with a smaller subset of the original dataset. To achieve this, I randomly sampled a portion of the dataset. This subset, which contains a fraction of the total data, was chosen so that it could fit comfortably within the available memory.

 This downsized dataset allows me to perform various tasks, including data exploration, model development, and testing, without encountering memory-related issues. While the sampled dataset is smaller, it still maintains the essential characteristics of the full dataset, making it suitable for prototyping and experimentation

In [6]:
# Specify the number of samples for training and testing
train_size = 60000
test_size = 60000

# To Check if the dataset has enough samples
if len(df) < (train_size + test_size):
    raise ValueError("Not enough samples in the dataset for the specified train and test sizes.")

# Create the training sample by slicing the first 'train_size' samples
train_sample = df[:train_size]

# Create the testing sample by slicing the next 'test_size' samples
test_sample = df[train_size:train_size+test_size]

# save the sample data to new CSV files
train_sample.to_csv('train_sample.csv', index=False)
test_sample.to_csv('test_sample.csv', index=False)


In [7]:
train_df = pd.read_csv('/kaggle/working/train_sample.csv')
test_df = pd.read_csv ('/kaggle/working/test_sample.csv')

In [8]:
train_df

Unnamed: 0,Transaction ID,User ID,Transaction Amount,Transaction Date and Time,Merchant ID,Payment Method,Country Code,Transaction Type,Device Type,IP Address,...,User's Transaction History,Merchant's Reputation Score,User's Device Location,Transaction Currency,Transaction Purpose,User's Credit Score,User's Email Domain,Merchant's Business Age,Transaction Authentication Method,Fraudulent Flag
0,51595306,9822,163.08,2023-01-02 07:47:54,4044,ACH Transfer,KOR,Charity,GPS Device,42.23.223.120,...,26,2.71,United Kingdom,NOK,Consultation Fee,343,cox.co.uk,3,Bluetooth Authentication,0
1,85052974,4698,430.74,2021-09-12 15:15:41,4576,2Checkout,VNM,Cashback,Medical Device,39.52.212.120,...,60,3.95,Mexico,EGP,Cashback Reward,688,gmail.com,13,NFC Tag,1
2,23954324,8666,415.74,2023-01-12 17:25:58,4629,Google Wallet,MEX,Reward,Vehicle Infotainment System,243.180.236.29,...,81,3.81,Qatar,MXN,Acquisition,371,rocketmail.com,7,Token,1
3,44108303,9012,565.89,2021-02-27 11:31:00,3322,Check,SGP,Purchase,Kiosk,212.186.227.14,...,18,2.67,Spain,CLP,Loan Repayment,687,roadrunner.co.uk,15,Time-Based OTP,1
4,66622683,5185,955.49,2022-09-24 04:06:38,7609,Worldpay,HKG,Acquisition,Smart Mirror,166.113.10.199,...,98,3.19,Israel,RUB,Dividend Reinvestment,605,protonmail.co.uk,17,Password,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,15897278,3856,7.82,2022-02-18 23:51:43,5913,Check,ITA,Loan,Industrial Controller,6.244.14.65,...,25,1.31,Italy,ILS,Auction Bid,476,live.com,19,Time-Based OTP,0
59996,49545040,8997,418.72,2022-04-12 21:02:28,9556,Cash,ARG,Reimbursement,Home Security System,246.61.109.217,...,67,3.48,China,SAR,Retail Purchase,709,verizon.co.uk,10,Face ID,1
59997,68379028,1664,882.16,2023-05-25 05:39:06,4843,Cash,HKG,Compensation,Smartwatch,144.130.16.105,...,33,1.90,Argentina,HUF,Admission,467,yandex.com,11,QR Code,0
59998,29067316,3325,143.11,2023-03-29 15:56:40,3895,Debit Card,IND,Charity,Industrial Controller,149.80.189.157,...,82,3.15,Taiwan,CHF,Payout,837,outlook.co.uk,6,QR Code,0


In [10]:
test_df

Unnamed: 0,Transaction ID,User ID,Transaction Amount,Transaction Date and Time,Merchant ID,Payment Method,Country Code,Transaction Type,Device Type,IP Address,...,User's Transaction History,Merchant's Reputation Score,User's Device Location,Transaction Currency,Transaction Purpose,User's Credit Score,User's Email Domain,Merchant's Business Age,Transaction Authentication Method,Fraudulent Flag
0,69414003,9939,929.19,2022-04-08 13:14:54,1091,Discover,QAT,Donation,E-Reader,216.97.124.253,...,53,3.77,United Arab Emirates,AED,Compensation,638,aol.com,20,Certificate-based Authentication,1
1,49235678,8112,403.19,2021-07-23 22:02:18,7760,Klarna,UK,Transfer,Kiosk,107.155.189.168,...,7,1.07,Hong Kong,QAR,Buyback,338,rocketmail.co.uk,13,Signature Verification,1
2,40696190,4844,669.28,2021-11-21 18:36:11,8096,Discover,NOR,Tax,Vending Machine,68.74.184.0,...,91,1.92,Hong Kong,TRY,Retail Purchase,384,gmail.com,1,Iris Scan,1
3,52083541,8862,558.31,2022-01-26 19:41:21,2397,Stripe,IND,Contribution,POS Terminal,182.49.242.200,...,29,1.18,Malaysia,RUB,Cashback Reward,666,aol.com,15,Voice Recognition,1
4,12933782,2687,691.77,2021-07-04 11:56:34,2849,Klarna,NOR,Settlement,GPS Device,92.93.32.160,...,91,3.77,Poland,USD,Scholarship,645,aim.com,3,QR Code,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59995,71652081,4455,519.72,2023-01-21 16:08:53,3721,Skrill,KOR,Buyback,Laptop,156.220.192.210,...,99,3.10,Canada,EGP,Acquisition,549,gmail.co.uk,10,Voiceprint,0
59996,28744902,8455,957.31,2021-04-29 13:00:33,3544,Cryptocurrency Wallet,IDN,Transfer,Gaming Console,223.248.110.39,...,49,4.84,Turkey,INR,Registration Fee,303,rocketmail.com,2,Smart Card,1
59997,42650837,3826,783.36,2023-06-01 13:01:56,5217,Google Wallet,ZAF,Auction,Vehicle Infotainment System,244.55.87.164,...,10,1.15,Egypt,VND,Invoice Payment,721,aol.com,6,CAPTCHA,1
59998,96541074,6937,983.64,2022-06-01 03:00:52,7683,Neteller,UAE,Cashback,Vending Machine,122.53.140.151,...,57,2.59,Greece,KES,Transfer to Family,597,tutanota.com,20,Behavioral Analytics,1


In [11]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 32 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Transaction ID                     60000 non-null  int64  
 1   User ID                            60000 non-null  int64  
 2   Transaction Amount                 60000 non-null  float64
 3   Transaction Date and Time          60000 non-null  object 
 4   Merchant ID                        60000 non-null  int64  
 5   Payment Method                     60000 non-null  object 
 6   Country Code                       60000 non-null  object 
 7   Transaction Type                   60000 non-null  object 
 8   Device Type                        60000 non-null  object 
 9   IP Address                         60000 non-null  object 
 10  Browser Type                       60000 non-null  object 
 11  Operating System                   60000 non-null  obj

In [12]:
# Concatenate the datasets vertically (along rows)
concat_data = pd.concat([train_df, test_df], axis=0)

# Optionally, you can reset the index of the concatenated dataset
concat_data.reset_index(drop=True, inplace=True)

## **Feature Engineering**

In [14]:
# Convert the "Transaction Date and Time" column to datetime type
concat_data['Transaction Date and Time'] = pd.to_datetime(df['Transaction Date and Time'])

# Extract various datetime components
concat_data['Transaction Year'] = concat_data['Transaction Date and Time'].dt.year
concat_data['Transaction Month'] = concat_data['Transaction Date and Time'].dt.month
concat_data['Transaction Day'] = concat_data['Transaction Date and Time'].dt.day
concat_data['Transaction Hour'] = concat_data['Transaction Date and Time'].dt.hour
concat_data['Transaction Minute'] = concat_data['Transaction Date and Time'].dt.minute
concat_data['Transaction Second'] = concat_data['Transaction Date and Time'].dt.second

concat_data.drop(columns=['Transaction Date and Time'], inplace=True)

In [15]:
# List of columns to one-hot encode
categorical_columns = ['Payment Method', 'Country Code', 'Transaction Type', 'Device Type',
                       'Browser Type', 'Operating System', 'Merchant Category', 'User Occupation',
                       'User Gender', 'User Account Status', 'Transaction Status',
                       'Transaction Time of Day', "User's Device Location", 'Transaction Currency',
                       'Transaction Purpose', "User's Email Domain", 'Transaction Authentication Method']

# Perform one-hot encoding for the selected columns
concat_data= pd.get_dummies(concat_data, columns=categorical_columns)

In [16]:
concat_data.drop(['Transaction ID', 'User ID', 'Merchant ID', 'IP Address'], axis=1, inplace=True)

In [17]:
print(concat_data.dtypes)


Transaction Amount                                                   float64
User Age                                                               int64
User Income                                                          float64
Location Distance                                                    float64
Time Taken for Transaction                                           float64
                                                                      ...   
Transaction Authentication Method_Transaction Confirmation Number       bool
Transaction Authentication Method_Two-Factor Authentication             bool
Transaction Authentication Method_USB Security Key                      bool
Transaction Authentication Method_Voice Recognition                     bool
Transaction Authentication Method_Voiceprint                            bool
Length: 582, dtype: object


In [18]:
from sklearn.model_selection import train_test_split

# Load your concatenated data (assuming it's already loaded as 'concatenated_data')

# Split the data into features (X) and target (y)
X = concat_data.drop(columns=['Fraudulent Flag'])  # Features
y = concat_data['Fraudulent Flag']  # Target variable

# Split the data into training and testing sets
test_size = 0.2  # Adjust the test size as needed (e.g., 0.2 for 80% train, 20% test)
random_state = 42  # Set a random seed for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

# Now, you have X_train (features for training), y_train (target for training),
# X_test (features for testing), and y_test (target for testing) ready for model prediction.

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the training data
rf_model.fit(X_train, y_train)

# Make predictions on the test data
rf_predictions = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, rf_predictions)
print(f"Accuracy: {accuracy:.2f}")

# Generate a classification report
class_report = classification_report(y_test, rf_predictions)
print("Classification Report:\n", class_report)

# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, rf_predictions)
print("Confusion Matrix:\n", conf_matrix)


Accuracy: 0.50
Classification Report:
               precision    recall  f1-score   support

           0       0.49      0.50      0.50     11923
           1       0.50      0.49      0.50     12077

    accuracy                           0.50     24000
   macro avg       0.50      0.50      0.50     24000
weighted avg       0.50      0.50      0.50     24000

Confusion Matrix:
 [[5996 5927]
 [6151 5926]]


Accuracy: The accuracy of the model is 0.50, which means that it correctly predicted 50% of the total instances in the dataset. Accuracy is the ratio of correctly predicted instances to the total instances.

Classification Report: The classification report provides more detailed information about the model's performance, including precision, recall, and F1-score for each class (0 and 1).

Precision: Precision is the ratio of true positives to the total predicted positives. For class 0, the precision is 0.49, and for class 1, it's 0.50. This indicates how many of the predicted positive cases are actually correct.

Recall: Recall, also known as sensitivity or true positive rate, is the ratio of true positives to the total actual positives. For class 0, the recall is 0.50, and for class 1, it's 0.49. This indicates how many of the actual positive cases were correctly predicted.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. For both classes, the F1-score is 0.50.

Support: Support represents the number of instances in each class.

Confusion Matrix: The confusion matrix is a tabular representation of the model's predictions. It shows how many instances were correctly or incorrectly classified.

The top-left value (5996) represents the true negatives (TN), meaning instances correctly classified as class 0.
The top-right value (5927) represents false positives (FP), meaning instances incorrectly classified as class 1.
The bottom-left value (6151) represents false negatives (FN), meaning instances incorrectly classified as class 0.
The bottom-right value (5926) represents true positives (TP), meaning instances correctly classified as class 1.

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report


# Create a Logistic Regression model
logistic_regression_model = LogisticRegression(random_state=42)

# Train the model on the training data
logistic_regression_model.fit(X_train, y_train)

# Make predictions on the test data
logistic_regression_predictions = logistic_regression_model.predict(X_test)

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, logistic_regression_predictions)
precision = precision_score(y_test, logistic_regression_predictions)
recall = recall_score(y_test, logistic_regression_predictions)
f1 = f1_score(y_test, logistic_regression_predictions)
conf_matrix = confusion_matrix(y_test, logistic_regression_predictions)
class_report = classification_report(y_test, logistic_regression_predictions)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Accuracy: 0.50
Precision: 0.50
Recall: 1.00
F1-score: 0.67
Confusion Matrix:
 [[    0 11923]
 [    0 12077]]
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00     11923
           1       0.50      1.00      0.67     12077

    accuracy                           0.50     24000
   macro avg       0.25      0.50      0.33     24000
weighted avg       0.25      0.50      0.34     24000



Accuracy: The accuracy of the model is 0.50, which means that it correctly predicted 50% of the total instances in the dataset. However, accuracy alone doesn't provide a complete picture of the model's performance.

Precision: The precision for class 0 is 0.00, and for class 1, it's 0.50. Precision is a measure of how many of the predicted positive cases are actually correct. In this case, for class 0, there are no true positives (TP), resulting in a precision of 0.00.

Recall: The recall for class 0 is 0.00, and for class 1, it's 1.00. Recall, also known as sensitivity, is the ratio of true positives (TP) to the total actual positives. In this case, for class 0, there are no true positives, leading to a recall of 0.00, while for class 1, all actual positives were correctly predicted, resulting in a recall of 1.00.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. For class 0, the F1-score is 0.00, and for class 1, it's 0.67.

Confusion Matrix: The confusion matrix provides details about how the model's predictions compare to the actual class labels.

For class 0, there are no true positives (0) and a high number of false negatives (11923), resulting in the low precision, recall, and F1-score.
For class 1, there are a high number of true positives (12077) and no false positives, leading to a relatively higher precision, recall, and F1-score.
Classification Report: The classification report summarizes the model's performance for each class (0 and 1) as well as some overall metrics.

The precision, recall, and F1-score for class 0 are low due to the absence of true positives.
The precision for class 1 is relatively better because all positive predictions belong to this class.
The accuracy is 0.50, indicating that the model correctly classified half of the instances, primarily driven by the correct classification of class 1 instances.

In [19]:
from catboost import CatBoostClassifier

# Create a CatBoostClassifier
catboost_model = CatBoostClassifier(iterations=1000, depth=6, learning_rate=0.1, loss_function='Logloss', random_seed=42)

# Train the model on the training data
catboost_model.fit(X_train, y_train)

# Make predictions on the test data
catboost_predictions = catboost_model.predict(X_test)

# Calculate and print evaluation metrics
accuracy = accuracy_score(y_test, catboost_predictions)
conf_matrix = confusion_matrix(y_test, catboost_predictions)
class_report = classification_report(y_test, catboost_predictions)

print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

0:	learn: 0.6930375	total: 78.8ms	remaining: 1m 18s
1:	learn: 0.6929497	total: 101ms	remaining: 50.5s
2:	learn: 0.6928583	total: 122ms	remaining: 40.5s
3:	learn: 0.6927699	total: 143ms	remaining: 35.7s
4:	learn: 0.6926467	total: 165ms	remaining: 32.8s
5:	learn: 0.6925880	total: 186ms	remaining: 30.9s
6:	learn: 0.6924867	total: 207ms	remaining: 29.4s
7:	learn: 0.6924123	total: 233ms	remaining: 28.8s
8:	learn: 0.6923226	total: 257ms	remaining: 28.3s
9:	learn: 0.6922397	total: 281ms	remaining: 27.8s
10:	learn: 0.6921525	total: 301ms	remaining: 27.1s
11:	learn: 0.6920871	total: 326ms	remaining: 26.8s
12:	learn: 0.6920156	total: 346ms	remaining: 26.3s
13:	learn: 0.6919498	total: 367ms	remaining: 25.8s
14:	learn: 0.6918734	total: 391ms	remaining: 25.7s
15:	learn: 0.6917876	total: 417ms	remaining: 25.6s
16:	learn: 0.6917172	total: 440ms	remaining: 25.4s
17:	learn: 0.6916482	total: 461ms	remaining: 25.1s
18:	learn: 0.6915587	total: 484ms	remaining: 25s
19:	learn: 0.6914848	total: 504ms	remaini

Accuracy: The accuracy of the model is 0.50, which means it correctly predicted 50% of the total instances in the dataset.

Confusion Matrix:

For class 0, there are 5,604 true positives (TP) and 6,319 false negatives (FN).
For class 1, there are 6,625 true positives (TP) and 6,452 false negatives (FN).
The confusion matrix provides details about how the model's predictions compare to the actual class labels.

Classification Report:

For class 0:

Precision: 0.50 (50% of predicted positives are correct)
Recall: 0.47 (47% of actual positives are correctly predicted)
F1-score: 0.48 (a balance between precision and recall)
Support: 11,923 instances in class 0.
For class 1:

Precision: 0.51 (51% of predicted positives are correct)
Recall: 0.53 (53% of actual positives are correctly predicted)
F1-score: 0.52 (a balance between precision and recall)
Support: 12,077 instances in class 1.
Macro Average:

Macro Average precision, recall, and F1-score represent the average values for all classes.
In this case, both precision and recall are around 0.50, and the F1-score is also around 0.50.
Weighted Average:

Weighted Average precision, recall, and F1-score are weighted by the number of instances in each class.
These values are also around 0.50, as they take into account the class imbalance.