<a href="https://colab.research.google.com/github/atsuvovor/Pub_Data_Analytics_Project/blob/main/Fraud_Detection_Using_LLMs_RAG_and_GenAI_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###**Project: Fraud Detection Using Machine Learning, Retrieval-Augmented Generation (RAG), and Generative AI**

**Toronto, January 17 2025**  
**Autor : Atsu Vovor**
>Master of Management in Artificial Intelligence    
>Consultant Data Analytics Specialist | Machine Learning |  
Data science | Quantitative Analysis |French & English Bilingual  


---

### Abstract
This project focuses on designing and developing generative AI solutions for finance and cybersecurity use cases, specifically fraud detection. By integrating machine learning models with retrieval-augmented generation (RAG) pipelines, the project enhances retrieval and generation tasks while optimizing large language model (LLM) performance through fine-tuning and strategic prompting. The system addresses challenges related to speed, performance, and cost-effectiveness, offering a scalable and efficient fraud detection framework. Detailed documentation of methodologies, results, and limitations ensures accessibility for diverse audiences.

---

### 1. Introduction
Fraud detection is critical for mitigating financial losses and ensuring the integrity of digital transactions. The increasing sophistication of fraud tactics necessitates innovative solutions that combine machine learning, retrieval-augmented generation, and generative AI technologies. This project enhances fraud detection accuracy and interpretability by optimizing LLMs and RAG pipelines, addressing challenges such as latency, cost-effectiveness, and real-time decision-making.

---

### 2. Objectives
1. **Design and develop generative AI solutions** for finance and cybersecurity use cases, focusing on fraud detection.
2. **Implement RAG pipelines** to enhance retrieval and generation tasks by integrating knowledge bases with LLMs.
3. **Optimize LLM performance** through fine-tuning, prompting strategies, and rigorous model evaluation.
4. **Address challenges** related to speed, performance, and cost-effectiveness in deploying LLMs for real-time fraud detection.
5. **Document methodologies, results, and limitations** for technical and non-technical audiences.

---

### 3. Project Description
The fraud detection system includes:
1. A **knowledge base** embedded and indexed for retrieval tasks.
2. A machine learning model trained on transaction data for fraud prediction.
3. RAG pipelines that integrate predictions with contextual knowledge explanations.
4. Real-time optimization of LLMs to improve latency and cost-effectiveness.
5. Comprehensive evaluation metrics, including precision, recall, F1-score, latency, and confusion matrix analysis.

---

### 4. Project Scope
This project aims to:
1. Detect fraudulent financial transactions with high accuracy.
2. Enhance interpretability through a retrieval-augmented generation pipeline.
3. Optimize LLMs for real-time fraud detection tasks.
4. Provide detailed documentation for reproducibility and diverse audience comprehension.
5. Ensure extensibility for additional datasets and use cases.

---

### 5. Data Collection
#### 5.1 Data Sources
1. **Transaction Data**:
   - Simulated or real transaction datasets containing attributes like transaction ID, user ID, amount, location, device type, and timestamp.
   - Fraud labels (binary: 0 for legitimate, 1 for fraud).
2. **Knowledge Base**:
   - Fraud-related insights, such as high-risk transaction thresholds and suspicious patterns.

#### 5.2 Data Characteristics
1. High class imbalance (~5% fraud cases).
2. Diverse transaction contexts (e.g., geographic locations, device types).

---

### 6. Exploratory Data Analysis (EDA)
#### 6.1 Steps
1. **Descriptive Statistics**:
   - Summarize distributions of transaction attributes.
2. **Fraud Analysis**:
   - Compare features across fraudulent and legitimate transactions.
3. **Class Imbalance Analysis**:
   - Evaluate the extent of fraud cases and propose mitigation strategies.
4. **Visualization**:
   - Generate heatmaps, histograms, and boxplots to uncover patterns.

---

### 7. Feature Engineering
#### 7.1 Existing Features
1. Transaction amount
2. User ID
3. Location
4. Device type
5. Timestamp

#### 7.2 Engineered Features
1. **Temporal Features**:
   - Hour of transaction, day of the week.
2. **Risk Indicators**:
   - High-risk locations and amounts.
3. **Behavioral Features**:
   - User transaction frequency and deviations from historical patterns.
4. **Encoded Variables**:
   - One-hot or label encoding for categorical features.

---

### 8. Model Development
#### 8.1 Machine Learning Models
1. **Base Models**:
   - Random Forest, Gradient Boosting.
2. **Advanced Models**:
   - XGBoost, LightGBM for improved accuracy.

#### 8.2 Retrieval-Augmented Generation (RAG) Pipeline
1. **Knowledge Embedding**:
   - Use SentenceTransformer for semantic encoding of knowledge.
2. **FAISS Index**:
   - Store and retrieve knowledge embeddings for contextual explanations.
3. **Integration**:
   - Combine RAG outputs with model predictions for interpretability.

#### 8.3 LLM Optimization
1. Fine-tune models on domain-specific datasets.
2. Develop and test prompting strategies for effective retrieval and generation.
3. Evaluate model performance using latency, accuracy, and cost metrics.

#### 8.4 Training and Testing
1. Train-test split (80/20).
2. Cross-validation for hyperparameter tuning.
3. Address class imbalance with SMOTE or cost-sensitive learning.

---

### 9. Model Evaluation
#### Metrics
1. **Accuracy**
2. **Precision**
3. **Recall**
4. **F1-score**
5. **Latency (ms)**
6. **Cost-effectiveness**
7. **Confusion Matrix**

#### Visualization
1. Precision-recall curve.
2. ROC-AUC curve.
3. Confusion matrix heatmap.

---

### 10. Deployment Plan
#### 10.1 Infrastructure
1. Cloud-based deployment for scalability.
2. Integration with real-time data streams for fraud detection.

#### 10.2 Monitoring and Maintenance
1. Periodic model retraining.
2. Continuous monitoring of latency, accuracy, and cost metrics.

---

### 11. Timeline and Milestones
1. **Week 1-2**: Data collection and knowledge base creation.
2. **Week 3-4**: EDA and feature engineering.
3. **Week 5-6**: Model training and RAG pipeline development.
4. **Week 7**: LLM fine-tuning and optimization.
5. **Week 8**: Model evaluation and deployment preparation.
6. **Week 9-10**: Deployment and documentation.

---

### 12. Deliverables
1. **Fraud detection model with detailed metrics**.
2. **Final transaction table** with predictions, metrics, and explanations.
3. **Deployment-ready fraud detection system**.

---

### 13. Project Limitations
1. **Data Quality**:
   - Performance may vary with incomplete or inconsistent transaction data.
2. **Real-Time Constraints**:
   - High transaction volumes may affect system latency and performance.
3. **Domain Generalization**:
   - Models trained on specific datasets may not generalize well to other domains without retraining.
4. **Cost**:
   - Fine-tuning and deploying LLMs can be resource-intensive.
5. **Interpretability**:
   - Complex models and RAG pipelines may require additional effort to explain predictions to non-technical stakeholders.

------
### 14. Conclusion
This project presents an innovative approach to fraud detection by integrating machine learning, RAG pipelines, and generative AI. By optimizing LLM performance and addressing deployment challenges, the system delivers high accuracy and interpretability, contributing significantly to finance and cybersecurity domains.



### **Python Impplementation**

### Key Features of the Script:
1. **Modular Functions**: Each component (data simulation, RAG, ML model training, evaluation) is encapsulated in functions or classes.
2. **Machine Learning (ML)**: A `RandomForestClassifier` model is trained for fraud detection.
3. **RAG Pipeline**: Knowledge is embedded and retrieved using `SentenceTransformer` and `FAISS`.
4. **Metrics and Visualization**: Includes precision, recall, F1-score, and confusion matrix. Results are displayed in a dashboard.
5. **Generative AI (LLMs)**: `SentenceTransformer` for knowledge embedding and contextual retrieval.

### Environment update

In [None]:
#!pip install transformers sentence-transformers faiss sklearn pandas numpy
#!pip install transformers sentence-transformers
!pip install faiss-cpu
#!pip install faiss-gpu
#!pip install farm-haystack

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score
from sentence_transformers import SentenceTransformer
import faiss
import time

import warnings
warnings.filterwarnings('ignore', category=UserWarning, message="X does not have valid feature names")

# Generate simulated transaction data
def generate_transaction_data():
    np.random.seed(42)
    data = {
        "transaction_id": range(1, 1001),
        "user_id": np.random.randint(1, 100, 1000),
        "transaction_amount": np.random.uniform(10, 10000, 1000),
        "location": np.random.choice(["US", "UK", "CA", "IN", "AU"], 1000),
        "device_type": np.random.choice(["Mobile", "Desktop", "Tablet"], 1000),
        "is_fraud": np.random.choice([0, 1], 1000, p=[0.95, 0.05]),
        "transaction_time": pd.date_range("2025-01-01", periods=1000, freq="min").to_series().sample(1000).values
    }
    df = pd.DataFrame(data)
    df["hour"] = pd.to_datetime(df["transaction_time"]).dt.hour
    df["is_high_risk_location"] = df["location"].isin(["IN", "AU"]).astype(int)
    df["device_type_encoded"] = df["device_type"].astype("category").cat.codes
    display(df)
    return df

#--------------Explanatory Data Analysis(EDA)---------------------------------------
# Descriptive statistics
def descriptive_statistics(df):
    print("\n--- Descriptive Statistics ---")
    display(df.describe(include='all'))
    print("\nUnique Values per Column:")
    display(df.nunique())


def fraud_analysis(df):
    print("\n--- Fraud Analysis ---")

    # Validate if the required columns are present
    required_columns = ["is_fraud", "device_type", "transaction_amount", "location", "hour"]
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        print(f"Missing columns for analysis: {missing_columns}")
        return

    # Fraud statistics: Calculate mean for numerical columns grouped by fraud status
    fraud_stats = df.groupby("is_fraud").mean(numeric_only=True)
    print("\nFraud Statistics (Mean Values by Fraud Status):")
    display(fraud_stats)

    # Plot: Fraud Statistics (Mean Values by Fraud Status)
    fraud_stats.plot(kind='bar', figsize=(10, 6))
    plt.title("Fraud Statistics (Mean Values by Fraud Status)")
    plt.ylabel("Mean Value")
    plt.show()

    # Proportions of fraud by category (device type)
    print("\nFraud Proportions by Device Type:")
    fraud_device_proportions = df.groupby("is_fraud")["device_type"].value_counts(normalize=True)
    display(fraud_device_proportions)

    # Plot: Fraud Proportions by Device Type
    fraud_device_proportions.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title("Fraud Proportions by Device Type")
    plt.ylabel("Proportion")
    plt.show()

    # Fraud by location
    print("\nFraud Proportions by Location:")
    fraud_location_proportions = df.groupby("is_fraud")["location"].value_counts(normalize=True)
    display(fraud_location_proportions)

    # Plot: Fraud Proportions by Location
    fraud_location_proportions.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title("Fraud Proportions by Location")
    plt.ylabel("Proportion")
    plt.show()

    # Insights: High-risk locations vs. fraud occurrences
    high_risk_stats = df.groupby("is_high_risk_location")["is_fraud"].mean()
    print("\nFraud Rate in High-Risk Locations:")
    display(high_risk_stats)

    # Plot: Fraud Rate in High-Risk Locations
    high_risk_stats.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title("Fraud Rate in High-Risk Locations")
    plt.ylabel("Proportion")
    plt.show()

   # Fraud Rate by Hour of Transaction
    print("\nFraud Rate by Hour of Transaction:")
    hour_fraud_rate = df.groupby("hour")["is_fraud"].mean()
    print(hour_fraud_rate)

    # Plot: Fraud Rate by Hour of Transaction
    plt.figure(figsize=(10, 6))
    sns.lineplot(x=hour_fraud_rate.index, y=hour_fraud_rate.values)
    plt.title("Fraud Rate by Hour of Transaction")
    plt.xlabel("Hour of Day")
    plt.ylabel("Fraud Rate")
    plt.show()

# Class imbalance analysis
def class_imbalance_analysis(df):
    print("\n--- Class Imbalance Analysis ---")
    fraud_count = df["is_fraud"].value_counts()
    print(fraud_count)
    fraud_ratio = fraud_count[1] / fraud_count[0]
    print(f"Fraud-to-Legitimate Ratio: {fraud_ratio:.4f}")
    if fraud_ratio < 0.1:
        print("Significant class imbalance detected. Consider using SMOTE or class weighting.")

    # Class distribution of 'is_fraud'
    class_distribution = df["is_fraud"].value_counts(normalize=True)
    print("\nClass Distribution of 'is_fraud':")
    print(class_distribution)

    # Plot: Class Distribution of 'is_fraud'
    plt.figure(figsize=(8, 6))
    sns.barplot(x=class_distribution.index, y=class_distribution.values)
    plt.title("Class Distribution of 'is_fraud'")
    plt.xlabel("Fraud Status")
    plt.ylabel("Proportion")
    plt.show()

    # Plot: Histogram of Transaction Amounts by Fraud Status
    plt.figure(figsize=(8, 6))
    sns.histplot(data=df, x="transaction_amount", hue="is_fraud", bins=30, kde=True)
    plt.title("Transaction Amount Distribution by Fraud Status")
    plt.xlabel("Transaction Amount")
    plt.ylabel("Frequency")
    plt.show()

    # Plot: Transaction Amounts Distribution for Fraud vs Non-Fraud
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df, x="is_fraud", y="transaction_amount")
    plt.title("Transaction Amounts by Fraud Status")
    plt.xlabel("Fraud Status")
    plt.ylabel("Transaction Amount")
    plt.show()

# Visualization
def create_visualizations(df):
    print("\n--- Generating Visualizations ---")

    # Exclude non-numeric columns for the correlation heatmap
    numeric_df = df.select_dtypes(include=[np.number])

    # Heatmap of correlations
    plt.figure(figsize=(10, 6))
    sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm")
    plt.title("Correlation Heatmap")
    plt.show()

    # Histogram of transaction amounts
    plt.figure(figsize=(8, 6))
    sns.histplot(data=df, x="transaction_amount", hue="is_fraud", bins=30, kde=True)
    plt.title("Transaction Amount Distribution")
    plt.xlabel("Transaction Amount")
    plt.ylabel("Frequency")
    plt.show()

    # Boxplot of transaction amounts by fraud status
    plt.figure(figsize=(8, 6))
    sns.boxplot(data=df, x="is_fraud", y="transaction_amount")
    plt.title("Transaction Amounts by Fraud Status")
    plt.xlabel("Is Fraud")
    plt.ylabel("Transaction Amount")
    plt.show()

#-------------------------------------------------------------------------
def setup_knowledge_base():
    """Setup the knowledge base and return embeddings and FAISS index."""
    knowledge = [
        "Transactions above $5000 are considered high-risk.",
        "Transactions from unknown locations are flagged.",
        "Multiple transactions in a short time period may indicate fraud.",
        "Transaction patterns deviating from history are suspicious."
    ]
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(knowledge)
    index = faiss.IndexFlatL2(embeddings.shape[1])
    index.add(embeddings)
    return knowledge, model, index

def train_model(df, features, target):
    """Train a Random Forest classifier and return the trained model and split data."""
    X = df[features]
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=100)
    clf.fit(X_train, y_train)
    X_test = pd.DataFrame(X_test, columns=X_train.columns)
    #X_train = pd.DataFrame(X_train, columns=X_train.columns)
    y_test = pd.DataFrame(y_test, columns=[target])
    y_train = pd.DataFrame(y_train, columns=[target])
    print("\nModel trained successfully!")

    return clf, X_train, X_test, y_train, y_test

def evaluate_model(y_test, y_pred):
    """Calculate and print performance metrics."""
    precision = precision_score(y_test, y_pred, zero_division=1)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    return precision, recall, f1, cm

def detect_fraud(transaction, clf, model, knowledge, faiss_index):
    """Detect fraud for a given transaction and return prediction with explanations."""
    query = f"Transaction of ${transaction['transaction_amount']} in {transaction['location']} using {transaction['device_type']}."
    query_embedding = model.encode([query])
    _, indices = faiss_index.search(np.array(query_embedding), 2)
    related_knowledge = [knowledge[i] for i in indices[0]]
    feature_values = [[
        transaction["transaction_amount"],
        transaction["user_id"],
        transaction["hour"],
        transaction["is_high_risk_location"],
        transaction["device_type_encoded"]
    ]]
    start_time = time.time()
    fraud_prediction = clf.predict(feature_values)
    fraud_probability = clf.predict_proba(feature_values)[0][1]
    latency = time.time() - start_time
    explanation = f"The transaction is flagged because: {', '.join(related_knowledge)}"
    return bool(fraud_prediction[0]), fraud_probability, explanation, latency

def append_results_to_table(df, clf, model, knowledge, faiss_index):
    """Append fraud detection results and metrics to the transaction table."""
    results = []
    for _, row in df.iterrows():
        prediction, probability, explanation, latency = detect_fraud(row, clf, model, knowledge, faiss_index)
        results.append({
            "is_fraud_predicted": prediction,
            "fraud_probability": probability,
            "explanation": explanation,
            "latency_ms": latency * 1000  # Convert to milliseconds
        })
    results_df = pd.DataFrame(results)
    return pd.concat([df.reset_index(drop=True), results_df], axis=1)

def calculate_cost_effectiveness(df):
    """Calculate the cost-effectiveness metric based on predictions."""
    cost_per_fraud = 100  # Arbitrary cost for a fraudulent transaction
    cost_saved = cost_per_fraud * df[df["is_fraud_predicted"] & df["is_fraud"]].shape[0]
    return cost_saved


#---------------------------------------------------------------------
# Main script
def main():
    # Generate transaction data
    transactions = generate_transaction_data()

    # EDA steps
    descriptive_statistics(transactions)
    fraud_analysis(transactions)
    class_imbalance_analysis(transactions)
    create_visualizations(transactions)

     # Setup knowledge base
    knowledge, embed_model, faiss_index = setup_knowledge_base()

    # Define features and target
    features = ["transaction_amount", "user_id", "hour", "is_high_risk_location", "device_type_encoded"]
    target = "is_fraud"

    # Train the model
    clf, X_train, X_test, y_train, y_test = train_model(transactions, features, target)
    #print("\nModel trained successfully!")
    # Evaluate the model
    y_pred = pd.DataFrame(clf.predict(X_test), columns=["is_fraud_predicted"])
    #y_pred = clf.predict(X_test)
    if y_pred["is_fraud_predicted"].sum() == 0:
        print("\nWarning: No positive predictions made!")

    print("\ndisplay(X_train)")
    display(X_train)
    print("\ndisplay(X_test)")
    display(X_test)
    print("\ndisplay(y_test)")
    display(y_test)
    print("\ndisplay(y_pred)")
    display(y_pred)

    print("\nModel evaluation results:")
    precision, recall, f1, confusion_mat = evaluate_model(y_test, y_pred)
    print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")
    print(f"Confusion Matrix:\n{confusion_mat}")
    # Classification report
    print(f"\nClassification report:\n{classification_report(y_test, y_pred)}")

    # Append fraud detection results to table
    transactions_with_results = append_results_to_table(transactions, clf, embed_model, knowledge, faiss_index)

    # Add model metrics to every row in the table
    transactions_with_results["precision"] = precision
    transactions_with_results["recall"] = recall
    transactions_with_results["f1_score"] = f1
    transactions_with_results["confusion_matrix"] = [confusion_mat.tolist()] * len(transactions_with_results)

    # Calculate cost-effectiveness
    cost_saved = calculate_cost_effectiveness(transactions_with_results)
    transactions_with_results["cost_effectiveness"] = cost_saved

    # Print the final table with all metrics
    print("Final Transactions with Fraud Detection Results and Metrics:")
    display(transactions_with_results.head())

if __name__ == "__main__":
    main()
