<a href="https://colab.research.google.com/github/Wezz-git/AI-samples/blob/main/Financial_Fraud_Detection_with_Imbalanced_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**The Business Problem:**

You're a data scientist at a bank. The fraud department is losing millions. They need a model that can, in real-time, identify if a credit card transaction is fraudulent (1) or legitimate (0).

**The "Real-World" Challenge: **

This is a "needle in a haystack" problem. Over 99% of transactions are legitimate. This is called imbalanced data. If you're not careful, you'll build a model that's 99% accurate but catches 0% of the fraud. This is the exact same problem as the "Accuracy Trap" we discussed on Day 1, but in a real-world scenario.

**Your Goal: **

Build a model that is actually good at finding the "needle" (the fraud).

In [1]:
import pandas as pd

# This path assumes the file is in your main Colab folder
file_name = '/content/sample_data/PS_20174392719_1491204439457_log.csv'

# Load the DataFrame
df = pd.read_csv(file_name)

# Print the first 5 rows
print(df.head())

# Print the technical summary
print(df.info())

   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 co

The 'Imbalance' Check

In [2]:
## First, let's see the RAW count (how many of each)
print("--- Raw Fraud Counts ---")

print(df['isFraud'].value_counts())


## Now, let's see the PERCENTAGE (using your normalize=True)
print("\n--- Percentage of Total ---")

print(df['isFraud'].value_counts(normalize=True))

--- Raw Fraud Counts ---
isFraud
0    6354407
1       8213
Name: count, dtype: int64

--- Percentage of Total ---
isFraud
0    0.998709
1    0.001291
Name: proportion, dtype: float64


Data Preprocessing)

In [3]:
# 1. Drop the columns we don't need (they are just IDs or useless)

df_clean = df.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])

# 2. Use 'get_dummies' to convert the 'type' column
# add drop_first=True as a pro-step to avoid redundant columns
df_processed = pd.get_dummies(df_clean, drop_first=True)

print("--- Processed Data ---")
print(df_processed.head())
print("\n--- New Data Summary ---")
df_processed.info()

--- Processed Data ---
   step    amount  oldbalanceOrg  newbalanceOrig  oldbalanceDest  \
0     1   9839.64       170136.0       160296.36             0.0   
1     1   1864.28        21249.0        19384.72             0.0   
2     1    181.00          181.0            0.00             0.0   
3     1    181.00          181.0            0.00         21182.0   
4     1  11668.14        41554.0        29885.86             0.0   

   newbalanceDest  isFraud  type_CASH_OUT  type_DEBIT  type_PAYMENT  \
0             0.0        0          False       False          True   
1             0.0        0          False       False          True   
2             0.0        1          False       False         False   
3             0.0        1           True       False         False   
4             0.0        0          False       False          True   

   type_TRANSFER  
0          False  
1          False  
2           True  
3          False  
4          False  

--- New Data Summary ---
<

Split Data & Baseline Model

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# --- 1. Load the Data ---
file_name = '/content/sample_data/PS_20174392719_1491204439457_log.csv'
print("Loading 6.3 million rows... This may take 20-30 seconds.")
df = pd.read_csv(file_name)

# --- 2. Data Cleaning & Preprocessing ---
# Drop the columns we don't need
df_cleaned = df.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])

# Use 'get_dummies' to convert the 'type' column (This fixes your error!)
print("Processing data...")
df_processed = pd.get_dummies(df_cleaned, drop_first=True)

# --- 3. Create 'y' and 'X' ---
y = df_processed['isFraud']
X = df_processed.drop(columns=['isFraud'])

# --- 4. Split the data ---
print("Splitting the data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 5. Create and Train the Baseline Model ---
model = LogisticRegression(max_iter=1000)
print("Training the baseline Logistic Regression model... (This may take a minute)")
model.fit(X_train, y_train)
print("Training complete!")

# --- 6. Make predictions and print the 'pro' report ---
predictions = model.predict(X_test)
print("\n--- Baseline Model Classification Report ---")
print(classification_report(y_test, predictions, target_names=['Legit (0)', 'Fraud (1)']))

Loading 6.3 million rows... This may take 20-30 seconds.
Processing data...
Splitting the data...
Training the baseline Logistic Regression model... (This may take a minute)
Training complete!

--- Baseline Model Classification Report ---
              precision    recall  f1-score   support

   Legit (0)       1.00      1.00      1.00   1906351
   Fraud (1)       0.90      0.49      0.64      2435

    accuracy                           1.00   1908786
   macro avg       0.95      0.75      0.82   1908786
weighted avg       1.00      1.00      1.00   1908786



The "Real" Solution: SMOTE (Synthetic Minority Over-sampling Technique). It's a "magic" data-prep step)

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

# --- 1. Load the Data ---
file_name = '/content/sample_data/PS_20174392719_1491204439457_log.csv'
print("Loading 6.3 million rows...")
df = pd.read_csv(file_name)

# --- 2. Take a 10% sample to make it run faster ---
print("Taking a 10% sample to speed up the process...")
df_sample = df.sample(frac=0.1, random_state=42)

# --- 3. Data Cleaning & Preprocessing (FIX 1: Use df_sample) ---
print("Processing data...")
df_cleaned = df_sample.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])
df_processed = pd.get_dummies(df_cleaned, drop_first=True)

# --- 4. Create 'y' and 'X' ---
y = df_processed['isFraud']
X = df_processed.drop(columns=['isFraud'])

# --- 5. Split the data ---
print("Splitting the data...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 6. THE 'PRO' STEP: FIX THE IMBALANCE WITH SMOTE ---
print("Applying SMOTE... (This will be much faster now)")
sm = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = sm.fit_resample(X_train, y_train)

print(f"Original training shape: {len(y_train)}")
print(f"New balanced training shape: {len(y_train_balanced)}")

# --- 7. Train the ADVANCED Model (on the BALANCED data) ---
model_rf = RandomForestClassifier(random_state=42, n_jobs=-1)

print("Training the RandomForest model on the balanced data...")
model_rf.fit(X_train_balanced, y_train_balanced)
print("Training complete!")

# --- 8. Make predictions and print the 'pro' report (FIX 2: Use model_rf) ---
predictions_rf = model_rf.predict(X_test)
print("\n--- ADVANCED Model Classification Report (with SMOTE) ---")
print(classification_report(y_test, predictions_rf, target_names=['Legit (0)', 'Fraud (1)']))

Loading 6.3 million rows...
Taking a 10% sample to speed up the process...
Processing data...
Splitting the data...
Applying SMOTE... (This will be much faster now)
Original training shape: 445383
New balanced training shape: 889608
Training the RandomForest model on the balanced data...
Training complete!

--- ADVANCED Model Classification Report (with SMOTE) ---
              precision    recall  f1-score   support

   Legit (0)       1.00      1.00      1.00    190641
   Fraud (1)       0.56      0.91      0.69       238

    accuracy                           1.00    190879
   macro avg       0.78      0.95      0.85    190879
weighted avg       1.00      1.00      1.00    190879



Model:

**Baseline (Logistic Regression)	**   

Precision(Fraud) 0.90 (90%)	    Recall (Fraud)0.49 (49%)

**Advanced (Random Forest + SMOTE)	   **  

Precision(Fraud)0.56 (56%)	    Recall (Fraud)0.91 (91%)