**Model** **Selection** **and** **Algorithm** **Testing**

In [None]:
import pandas as pd

In [None]:

df = pd.read_csv('cleaned_fraud_data.csv')

In [None]:
#Prepare features (X) and target (y)
X = df.drop(columns=['trans_date_trans_time', 'dob', 'is_fraud'])

In [None]:
#Handle categorical columns using one-hot encoding
categorical_columns = ['category', 'gender', 'job_sector', 'Region', 'age_group', 'day_period']

In [None]:
# Target column
y = df['is_fraud']

In [None]:
y.value_counts()

Unnamed: 0_level_0,count
is_fraud,Unnamed: 1_level_1
0,1042569
1,6006


In [None]:
from sklearn.model_selection import train_test_split


In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize the feature data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
# Check the size of each group
fraud_df = df[df['is_fraud'] == 1]
no_fraud_df = df[df['is_fraud'] == 0]
fraud_count = len(fraud_df)
no_fraud_count = len(no_fraud_df)

# Print counts for debugging
print(f"Fraud cases: {fraud_count}, No Fraud cases: {no_fraud_count}")

# Adjust n if needed
fraud_subset_size = min(7000, fraud_count)
no_fraud_subset_size = min(10000, no_fraud_count)

# Sampling from fraud and no fraud subsets
fraud_subset = fraud_df.sample(n=fraud_subset_size, random_state=42, replace=False)
no_fraud_subset = no_fraud_df.sample(n=no_fraud_subset_size, random_state=42, replace=False)

# Combine the subsets
new_df = pd.concat([fraud_subset, no_fraud_subset])

# Reset index of the new dataframe
new_df.reset_index(drop=True, inplace=True)

# Assign the balanced dataframe back to df
df = new_df


Fraud cases: 6006, No Fraud cases: 1042569


**Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred_lr = log_reg.predict(X_test)

# Evaluation metrics
accuracy_lr = accuracy_score(y_test, y_pred_lr)
precision_lr = precision_score(y_test, y_pred_lr)
recall_lr = recall_score(y_test, y_pred_lr)
f1_lr = f1_score(y_test, y_pred_lr)

print(f"Logistic Regression - Accuracy: {accuracy_lr}, Precision: {precision_lr}, Recall: {recall_lr}, F1-Score: {f1_lr}")


Logistic Regression - Accuracy: 0.9939203204348759, Precision: 0.0, Recall: 0.0, F1-Score: 0.0


The Logistic Regression model demonstrates high accuracy due to class imbalance but fails to effectively predict fraud cases, leading to 0 for precision, recall, and F1-score.

**Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree Model
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

# Predictions
y_pred_dt = decision_tree.predict(X_test)

# Evaluation metrics
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)

print(f"Decision Tree - Accuracy: {accuracy_dt}, Precision: {precision_dt}, Recall: {recall_dt}, F1-Score: {f1_dt}")


Decision Tree - Accuracy: 0.9970197649190569, Precision: 0.7245557350565428, Recall: 0.7595258255715496, F1-Score: 0.7416287722199256


The Decision Tree model achieved a high accuracy of about 99.69%, indicating it effectively identifies most transactions correctly.he F1-score of about 73.99% suggests a good balance between precision and recall, making the Decision Tree a reliable choice for detecting credit card fraud.

**Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Model 3: Random Forest
rf = RandomForestClassifier(n_estimators=50,max_depth=10,class_weight='balanced', random_state=42)
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation metrics for Random Forest
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)
print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, Precision: {precision_rf:.4f}, Recall: {recall_rf:.4f}, F1-Score: {f1_rf:.4f}")


Random Forest - Accuracy: 0.9982, Precision: 0.9318, Recall: 0.7290, F1-Score: 0.8181


The Random Forest model achieves a high accuracy of 99.82% and strong precision of 93.18%, indicating it correctly identifies most transactions while minimizing false positives. However, its recall of 72.90% shows it misses some fraud cases. Overall, with an F1-score of 81.81%, it provides a reliable performance for credit card fraud detection.

**Support Vector Machine(SVM)**

In [None]:
from sklearn.svm import LinearSVC  # Faster alternative to SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

# Standardize the feature data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Subset the data for faster initial testing
X_train_scaled_subset = X_train_scaled[:1000]
y_train_subset = y_train[:1000]

#Hyperparameter tuning with RandomizedSearchCV
param_distributions = {'C': [0.1, 1, 10]}
svm_model = LinearSVC(random_state=42)

random_search = RandomizedSearchCV(svm_model, param_distributions, n_iter=3, cv=2, random_state=42, n_jobs=-1)
random_search.fit(X_train_scaled_subset, y_train_subset)


best_svm_model = random_search.best_estimator_


y_pred_svm = best_svm_model.predict(X_test_scaled)

# Evaluating the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
precision_svm = precision_score(y_test, y_pred_svm)
recall_svm = recall_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm)


print(f"SVM (Tuned) - Accuracy: {accuracy_svm}, Precision: {precision_svm}, Recall: {recall_svm}, F1-Score: {f1_svm}")



SVM (Tuned) - Accuracy: 0.9935722289774217, Precision: 0.0, Recall: 0.0, F1-Score: 0.0


The low precision, recall, and F1-score indicate that the SVM model is likely predicting only the majority class (no fraud) and not detecting the minority class (fraud).

## Conclusion


After applying different models on the dataset, we find that Random Forest is the best model for Credit Card Fraud Detection. It has the following metrics:



*  Accuracy: 0.9982
*  Precision: 0.9318
*  Recall: 0.7290
*  F1-Score: 0.8181






The Random Forest model achieves a high accuracy of 99.82% and strong precision of 93.18%, indicating it correctly identifies most transactions while minimizing false positives. However, its recall of 72.90% shows it misses some fraud cases. Overall, with an F1-score of 81.81%, it provides a reliable performance for credit card fraud detection.