# Fraud Vision ML: A Predictive Machine Learning Framework for Credit Card Transaction Risk Assessment


### Problem Statement
Credit Card fraud is a significant problem for banks and financial institutions in an era where most transactions occur online. 
This project utilizes the Kaggle Credit Card Fraud Detection dataset (by gpreda) to design a predictive machine learning system 
that can detect fraudulent transactions in real-time using Logistic Regression, Random Forest, and XGBoost models.


In [None]:

# Step 1: Import Libraries
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')


In [None]:

# Step 2: Load Dataset
url = "https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv"
df = pd.read_csv(url)
print("✅ Data Loaded Successfully! Shape:", df.shape)
df.head()


In [None]:

# Step 3: Exploratory Analysis
print("Class Distribution:")
print(df['Class'].value_counts())

plt.figure(figsize=(6,4))
sns.countplot(x='Class', data=df)
plt.title('Fraud vs Non-Fraud Transactions')
plt.show()

plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


In [None]:

# Step 4: Data Preprocessing
X = df.drop('Class', axis=1)
y = df['Class']

scaler = StandardScaler()
X['Amount'] = scaler.fit_transform(X['Amount'].values.reshape(-1,1))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Balance data using SMOTE
sm = SMOTE(random_state=42)
X_train_s, y_train_s = sm.fit_resample(X_train, y_train)
print("After SMOTE:", X_train_s.shape, y_train_s.value_counts())


In [None]:

# Step 5: Model Training and Evaluation

# Logistic Regression
lr = LogisticRegression(max_iter=500)
lr.fit(X_train_s, y_train_s)
y_pred_lr = lr.predict(X_test)
print("\n--- Logistic Regression ---")
print(classification_report(y_test, y_pred_lr))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_lr))

# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_s, y_train_s)
y_pred_rf = rf.predict(X_test)
print("\n--- Random Forest ---")
print(classification_report(y_test, y_pred_rf))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_rf))

# XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train_s, y_train_s)
y_pred_xgb = xgb.predict(X_test)
y_prob_xgb = xgb.predict_proba(X_test)[:,1]

print("\n--- XGBoost ---")
print(classification_report(y_test, y_pred_xgb))
print("ROC-AUC:", roc_auc_score(y_test, y_pred_xgb))


In [None]:

# Step 6: Risk Scoring
risk_scores = (y_prob_xgb * 100).round(2)
results = pd.DataFrame({'Risk_Score': risk_scores, 'Actual': y_test.values})
print(results.head())


In [None]:

# Step 7: Visualization and ROC Curve
def plot_conf_matrix(y_true, y_pred, title):
    cm = confusion_matrix(y_true, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(title)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

plot_conf_matrix(y_test, y_pred_xgb, "XGBoost Confusion Matrix")

fpr, tpr, _ = roc_curve(y_test, y_prob_xgb)
plt.plot(fpr, tpr, label='XGBoost (AUC = {:.2f})'.format(roc_auc_score(y_test, y_pred_xgb)))
plt.plot([0,1],[0,1],'k--')
plt.legend()
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()



### ✅ Conclusion
The XGBoost model provides the best balance between recall and precision for fraud detection. 
This notebook demonstrates a complete predictive ML framework with SMOTE balancing, risk scoring, and visualization suitable for real-world use.
