# NGO Fraud Detection — PyCaret + EDA

**Notebook:** `NGO_Fraud_Detection_PyCaret.ipynb`

**Purpose:** This notebook performs Exploratory Data Analysis (EDA) and builds a fraud-detection model using PyCaret. It is written in professional English for use in a GitHub portfolio.

**Dataset:** `pakistan_ngo_fraud_perfect_dataset.csv` (target column: `Is_Fraud` with values TRUE/FALSE)

---

## 1. Setup & Imports

This section installs required packages (if needed) and imports Python libraries used across the notebook.


In [None]:
# Install PyCaret (uncomment if running in a new environment)
# !pip install pycaret

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Display settings
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 200)

print('Libraries imported successfully')

## 2. Load Dataset

Load the CSV file into a pandas DataFrame and preview basic information. Make sure `pakistan_ngo_fraud_perfect_dataset.csv` is in the same directory or provide a correct path.

In [None]:
DATA_PATH = '/mnt/data/pakistan_ngo_fraud_perfect_dataset.csv'

# Load
if not Path(DATA_PATH).exists():
    print(f'Warning: {DATA_PATH} not found in the working directory. Please upload the file to run this notebook.')
else:
    df = pd.read_csv(DATA_PATH)
    display(df.head())
    print('\nShape:', df.shape)
    print('\nColumns:', list(df.columns))

## 3. Exploratory Data Analysis (EDA)

The following cells examine the data distribution, fraud vs non-fraud counts, and top categories linked to fraud.

In [None]:
# Basic info & null counts
try:
    display(df.info())
    display(df.isnull().sum())
except NameError:
    print('Load the dataset first (run the cell above).')

In [None]:
# Convert Date column to datetime if present
try:
    if 'Date' in df.columns:
        df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
        df['year'] = df['Date'].dt.year
        df['month'] = df['Date'].dt.month
        print('Date converted. Sample:')
        display(df[['Date']].head())
except Exception as e:
    print('Date conversion skipped or failed:', e)

In [None]:
# Ensure target column is boolean / binary
try:
    if 'Is_Fraud' in df.columns:
        print('Unique values in Is_Fraud:', df['Is_Fraud'].unique())
        # Normalize values to boolean 1/0
        df['Is_Fraud_bool'] = df['Is_Fraud'].map({True:1, False:0, 'TRUE':1, 'True':1, 'true':1, 'FALSE':0, 'False':0, 'false':0, '1':1, '0':0}).astype('Int64')
        print('\nValue counts (Is_Fraud_bool):')
        display(df['Is_Fraud_bool'].value_counts(dropna=False))
    else:
        print('Is_Fraud column not found. Please check dataset.')
except Exception as e:
    print('Error handling Is_Fraud column:', e)

In [None]:
# Plot: Fraud vs Non-Fraud count
try:
    fig, ax = plt.subplots(figsize=(6,4))
    sns.countplot(x='Is_Fraud_bool', data=df, palette='pastel')
    ax.set_xticklabels(['Non-Fraud (0)', 'Fraud (1)'])
    ax.set_title('Fraud vs Non-Fraud Count')
    plt.show()
except NameError:
    print('Load the dataset first (run the cell above).')

In [None]:
# Top 10 NGOs by number of fraud cases
try:
    if 'NGO_Name' in df.columns:
        top_ngos = df[df['Is_Fraud_bool']==1]['NGO_Name'].value_counts().head(10)
        display(top_ngos)
        plt.figure(figsize=(10,4))
        sns.barplot(x=top_ngos.values, y=top_ngos.index, palette='magma')
        plt.title('Top 10 NGOs by Fraud Count')
        plt.xlabel('Fraud Count')
        plt.show()
except Exception as e:
    print('NGO plot skipped:', e)

In [None]:
# Top 10 Vendors by fraud cases
try:
    if 'Vendor_Name' in df.columns:
        top_vendors = df[df['Is_Fraud_bool']==1]['Vendor_Name'].value_counts().head(10)
        display(top_vendors)
        plt.figure(figsize=(10,4))
        sns.barplot(x=top_vendors.values, y=top_vendors.index)
        plt.title('Top 10 Vendors by Fraud Count')
        plt.xlabel('Fraud Count')
        plt.show()
except Exception as e:
    print('Vendor plot skipped:', e)

In [None]:
# Requested vs Legitimate Amount comparison
try:
    if {'Requested_Amount_PKR','Legitimate_Estimate_PKR'}.issubset(df.columns):
        df['Amount_Gap'] = df['Requested_Amount_PKR'] - df['Legitimate_Estimate_PKR']
        plt.figure(figsize=(6,4))
        sns.boxplot(x='Is_Fraud_bool', y='Amount_Gap', data=df)
        plt.xticks([0,1], ['Non-Fraud (0)', 'Fraud (1)'])
        plt.title('Amount Gap by Fraud Label')
        plt.ylabel('Requested - Legitimate (PKR)')
        plt.show()
    else:
        print('Amount columns not found in dataset.')
except Exception as e:
    print('Amount gap plot failed:', e)

In [None]:
# Correlation heatmap for numeric features
try:
    num_df = df.select_dtypes(include=[np.number])
    plt.figure(figsize=(10,8))
    sns.heatmap(num_df.corr(), annot=False, cmap='coolwarm')
    plt.title('Correlation Heatmap (numeric features)')
    plt.show()
except Exception as e:
    print('Heatmap failed:', e)

## 4. PyCaret Model Building

This section uses PyCaret for automated preprocessing and model selection. The `setup()` call will perform encoding, imputation, transformation, and train/test splitting automatically.

In [None]:
# Import PyCaret classification
try:
    from pycaret.classification import *
except Exception as e:
    print('PyCaret not installed. To install: !pip install pycaret')

# Setup
try:
    # Use the boolean target column created earlier
    s = setup(data=df, target='Is_Fraud_bool', session_id=42, silent=True, \
n_jobs=-1, normalize=True, transformation=True, combine_rare_levels=True, \
remove_multicollinearity=True, ignore_low_variance=True)
    
    print('PyCaret setup completed. Use compare_models() to find best model.')
except Exception as e:
    print('PyCaret setup could not run (likely because PyCaret is not installed or dataset not loaded).', e)

In [None]:
# Compare models (automated) - UNCOMMENT to run
# best = compare_models(n_select=1)
# print(best)

print('Run compare_models() when PyCaret is available in your environment.')

In [None]:
# Example: Create, tune and finalize a model (uncomment to run when ready)
# model = create_model('rf')
# tuned = tune_model(model)
# finalize_model(tuned)
# save_model(tuned, 'fraud_rf_pycaret')

print('Example model commands provided. Uncomment to run in your environment.')

## 5. Insights & Next Steps

**Key Actions After Running This Notebook:**

- Review EDA outputs (fraud concentration by NGO, vendor, bank, and geography).
- Run PyCaret `compare_models()` to find the best model automatically.
- Evaluate the selected model using `evaluate_model()` and `predict_model()`.
- Save the final model with `save_model()` for deployment in production or an API.

**Potential Improvements:**

- Generate synthetic examples (SMOTE) if the dataset is imbalanced.
- Create time-series features (lag counts, rolling fraud-rate) from the `Date` column.
- Build an inference pipeline (Flask/FastAPI) for live fraud scoring.

---

*Notebook created for Tehreem — professional, English documentation suitable for GitHub.*