# Document Chunking and Vector Database Setup

This notebook demonstrates how to process and chunk documents for the RAG pipeline in the Multi-Agent AML Investigation System.

## What this notebook covers:
- Document loading and preprocessing
- Text chunking strategies
- Embedding generation
- ChromaDB vector database setup
- Document similarity search


In [3]:
# Import necessary libraries
import sys
import os
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import chromadb
from chromadb.config import Settings as ChromaSettings

# Add the app directory to the path
sys.path.append('../')

from app.core.config_simple import settings
from app.core.logger import get_logger
from app.agents.tools.vector_tools import VectorRetrievalTool

logger = get_logger(__name__)
print("✅ Imports successful!")


✅ Imports successful!


## 1. Sample KYC Documents Creation


In [4]:
# Create sample KYC documents
sample_kyc_documents = [
    {
        "customer_id": "C001",
        "document_type": "passport",
        "content": """
        PASSPORT INFORMATION
        Name: John Doe
        Passport Number: P123456789
        Date of Birth: March 15, 1985
        Place of Birth: New York, USA
        Nationality: American
        Issue Date: January 15, 2020
        Expiry Date: January 15, 2030
        Issuing Authority: US Department of State
        
        This passport is valid for international travel and contains biometric data.
        The holder has been verified through standard security procedures.
        """,
        "metadata": {
            "document_id": "DOC001",
            "upload_date": "2024-01-15",
            "verification_status": "verified",
            "source": "government_database"
        }
    },
    {
        "customer_id": "C001",
        "document_type": "address_proof",
        "content": """
        ADDRESS VERIFICATION DOCUMENT
        Customer: John Doe
        Address: 123 Main Street, Apt 4B, New York, NY 10001
        Document Type: Utility Bill
        Service Provider: ConEdison
        Account Number: 1234567890
        Billing Period: December 2023
        Amount Due: $156.78
        
        This document serves as proof of residence for the above customer.
        The address has been verified through utility company records.
        """,
        "metadata": {
            "document_id": "DOC002",
            "upload_date": "2024-01-10",
            "verification_status": "verified",
            "source": "utility_company"
        }
    },
    {
        "customer_id": "C002",
        "document_type": "source_of_wealth",
        "content": """
        SOURCE OF WEALTH DECLARATION
        Customer: Jane Smith
        Declaration Date: January 20, 2024
        
        I, Jane Smith, declare that my primary sources of wealth are:
        1. Employment Income: Senior Software Engineer at TechCorp Inc.
           - Annual Salary: $180,000
           - Employment Duration: 5 years
           - Position: Senior Software Engineer
        
        2. Investment Income: 
           - Stock Portfolio: $250,000
           - Mutual Funds: $75,000
           - Real Estate Investment: $300,000
        
        3. Inheritance:
           - Received from family trust: $100,000 (2022)
           - Property inheritance: $200,000 (2023)
        
        All sources of wealth are legitimate and properly documented.
        I understand that any false declarations may result in account closure.
        """,
        "metadata": {
            "document_id": "DOC003",
            "upload_date": "2024-01-20",
            "verification_status": "pending",
            "source": "customer_declaration"
        }
    },
    {
        "customer_id": "C003",
        "document_type": "business_license",
        "content": """
        BUSINESS LICENSE INFORMATION
        Business Name: Johnson Consulting LLC
        License Number: BL-2023-789456
        Business Type: Professional Services
        Owner: Robert Johnson
        Address: 789 Pine Street, Miami, FL 33101
        Issue Date: March 1, 2023
        Expiry Date: March 1, 2025
        License Status: Active
        
        This business is licensed to provide consulting services in the state of Florida.
        The business has been operating for 2 years and maintains good standing.
        """,
        "metadata": {
            "document_id": "DOC004",
            "upload_date": "2024-01-25",
            "verification_status": "verified",
            "source": "state_business_registry"
        }
    }
]

print(f"✅ Created {len(sample_kyc_documents)} sample KYC documents")
for doc in sample_kyc_documents:
    print(f"- {doc['customer_id']}: {doc['document_type']} ({doc['metadata']['document_id']})")


✅ Created 4 sample KYC documents
- C001: passport (DOC001)
- C001: address_proof (DOC002)
- C002: source_of_wealth (DOC003)
- C003: business_license (DOC004)
