Insurance Fraud Detection System

2. Load Required Libraries
Import the necessary Python libraries.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

3. Load Dataset

In [2]:

data = pd.read_csv('data.csv')

# Display first 5 rows
print("First 5 rows of dataset:")
print(data.head())

# Check for missing values
print("\nMissing Values in Dataset:")
print(data.isnull().sum())

# Fill missing values (if any)
data = data.ffill()  

First 5 rows of dataset:
  PolicyHolder  ClaimAmount ClaimType  Fraudulent
0     John Doe         5000    Health           0
1   Jane Smith        12000      Auto           1
2  Alice Brown         7000      Home           0
3    Bob White        15000      Auto           1
4    Eve Black         9000    Health           0

Missing Values in Dataset:
PolicyHolder    0
ClaimAmount     0
ClaimType       0
Fraudulent      0
dtype: int64


4. Preprocessing Data

In [3]:
# Encode categorical variables
le = LabelEncoder()
data['PolicyHolder'] = le.fit_transform(data['PolicyHolder'])
data['ClaimType'] = le.fit_transform(data['ClaimType'])


5. Train and Evaluate Models

In [4]:
# Define features (X) and target (y)
X = data.drop(columns=['Fraudulent'])
y = data['Fraudulent']

# Check class distribution (to see if it's imbalanced)
print("\nClass Distribution:")
print(y.value_counts())

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



Class Distribution:
Fraudulent
0    5
1    5
Name: count, dtype: int64


In [5]:
# Apply SMOTE only if data is imbalanced
if y_train.value_counts().min() < 0.3 * y_train.value_counts().max():
    print("\nApplying SMOTE to balance data...")
    smote = SMOTE(random_state=42)
    X_train, y_train = smote.fit_resample(X_train, y_train)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [6]:
def train_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\nModel: {model.__class__.__name__}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(classification_report(y_test, y_pred, zero_division=1))  # Fix precision warning
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("-" * 50)

# List of models
models = [
    RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42),
    DecisionTreeClassifier(random_state=42),
    KNeighborsClassifier(n_neighbors=5),
    LogisticRegression()
]

# Train and evaluate each model
for model in models:
    train_model(model, X_train, y_train, X_test, y_test)

print("\n✅ Model training and evaluation complete!")


Model: RandomForestClassifier
Accuracy: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

Confusion Matrix:
 [[1 0]
 [0 1]]
--------------------------------------------------

Model: DecisionTreeClassifier
Accuracy: 0.5000
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.75      0.50      0.33         2
weighted avg       0.75      0.50      0.33         2

Confusion Matrix:
 [[1 0]
 [1 0]]
--------------------------------------------------

Model: KNeighborsClassifier
Accuracy: 0.5000
              precision    recall 

6. Conclusion
This notebook successfully trains models to detect fraudulent claims.