# AI Anomaly Detection Training (Real Dataset)

This notebook trains an **Isolation Forest** model to detect unusual spending patterns using the provided `personal_finance_tracker_dataset.csv`.

## 1. Setup and Libraries

In [6]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
import joblib
import os
import json

print("Libraries loaded successfully.")

Libraries loaded successfully.


## 2. Load Dataset

We use `category` and `monthly_expense_total` as features. We apply Label Encoding to the categories to make them numeric.

In [7]:
dataset_path = 'personal_finance_tracker_dataset.csv'

if os.path.exists(dataset_path):
    print(f"Loading dataset from {dataset_path}...")
    full_df = pd.read_csv(dataset_path)
    df = full_df[['category', 'monthly_expense_total']].copy()
    df.columns = ['category', 'amount']
else:
    print("Error: personal_finance_tracker_dataset.csv not found!")
    # Stop execution if CSV is missing

Loading dataset from personal_finance_tracker_dataset.csv...


## 3. Label Encoding

We map each category string to a unique ID and save this mapping for the detection script.

In [8]:
unique_categories = df['category'].unique().tolist()
cat_to_id = {cat: i for i, cat in enumerate(unique_categories)}

with open('category_mapping.json', 'w') as f:
    json.dump(cat_to_id, f)

df['category_label'] = df['category'].map(cat_to_id)
print(f"Categories mapped: {list(cat_to_id.keys())}")
print(df.head())

Categories mapped: ['Investments', 'Healthcare', 'Groceries', 'Utilities', 'Transportation', 'Entertainment', 'Education', 'Insurance', 'Dining Out', 'Rent']
      category   amount  category_label
0  Investments  3212.07               0
1  Investments  3732.81               0
2   Healthcare  3335.58               1
3    Groceries  2327.59               2
4    Utilities  2182.58               3


## 4. Train Isolation Forest

The model learns the 'normal' distribution of spending for each category.

In [9]:
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
model.fit(df[['category_label', 'amount']])

print("Model trained.")

Model trained.


## 5. Test Anomaly Detection

In [10]:
test_data = pd.DataFrame([
    ['Groceries', 2000],    # Likely normal
    ['Groceries', 500000],  # SUSPICIOUS (Too high)
    ['Rent', 50000],        # Likely normal
    ['Entertainment', 250000] # SUSPICIOUS (Too high)
], columns=['category', 'amount'])

test_data['category_label'] = test_data['category'].map(cat_to_id)
predictions = model.predict(test_data[['category_label', 'amount']])
test_data['is_anomaly'] = ['YES' if p == -1 else 'NO' for p in predictions]

print(test_data[['category', 'amount', 'is_anomaly']])

        category  amount is_anomaly
0      Groceries    2000         NO
1      Groceries  500000        YES
2           Rent   50000        YES
3  Entertainment  250000        YES


In [12]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Define the Ground Truth (What the answer SHOULD be)
# Based on your test_data logic: 
# [2000 is Normal, 500000 is Anomaly, 50000 is Normal, 250000 is Anomaly]
true_labels = ['NO', 'YES', 'NO', 'YES'] 

# 2. Add them to your DataFrame for comparison
test_data['actual'] = true_labels

# 3. Calculate "Accuracy"
# This is the percentage of total guesses that were correct
acc = accuracy_score(test_data['actual'], test_data['is_anomaly'])

# 4. Generate a detailed report
# This shows Precision (how many 'YES' were correct) 
# and Recall (how many 'YES' did we miss)
report = classification_report(test_data['actual'], test_data['is_anomaly'])

print(f"Overall Accuracy: {acc * 100:.2f}%")
print("\nDetailed Classification Report:")
print(report)

# 5. Visualizing the Confusion Matrix (Optional but recommended)
print("\nConfusion Matrix:")
print(confusion_matrix(test_data['actual'], test_data['is_anomaly'], labels=['NO', 'YES']))

Overall Accuracy: 75.00%

Detailed Classification Report:
              precision    recall  f1-score   support

          NO       1.00      0.50      0.67         2
         YES       0.67      1.00      0.80         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4


Confusion Matrix:
[[1 1]
 [0 2]]


## 6. Export Model

In [13]:
joblib.dump(model, 'anomaly_model.pkl')
print("Model saved as anomaly_model.pkl")

Model saved as anomaly_model.pkl
