# Generic File Ingestion RAG Pipeline

This notebook contains a comprehensive RAG pipeline for analyzing any data file and finding the top 5 best database tables for data ingestion using semantic search and LLM analysis.

## Features:
- **Multi-format support**: CSV, Excel, JSON, TSV, TXT files
- **Automatic domain detection** from column names
- **Context-aware semantic query generation**
- **SQL agent optimized output format**
- **Confidence scoring for automation decisions**

## Setup and Dependencies

In [88]:
# Import required libraries
import os
import json
import pandas as pd
import hashlib
from uuid import uuid4
from dotenv import load_dotenv

# Vector store and embeddings
from qdrant_client import QdrantClient, models
from qdrant_client.models import PointStruct, PayloadSchemaType
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
from openai import OpenAI

# Load environment variables
load_dotenv(r'C:\Users\axel.grille\Documents\rules-engine-agent\Agent\.env')
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

print("✅ Dependencies loaded successfully")

✅ Dependencies loaded successfully


## Initialize Vector Store and Clients

In [89]:
# Initialize OpenAI and Qdrant clients
client = OpenAI()
encoder = OpenAIEmbeddings()

# Initialize Qdrant client
qdrant_client = QdrantClient(
    url="https://456cac0f-558d-40b4-ab89-e103423d7d7e.eu-central-1-0.aws.cloud.qdrant.io:6333", 
    api_key="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIiwiZXhwIjoxODE3MzkwMzg0fQ.Hi3c7w5CVjn_Xdpvh4Z9HEEYPbQXXPznxuXKyEEVTLg",
)

print("✅ Clients initialized")
print(f"📊 Available collections: {[col.name for col in qdrant_client.get_collections().collections]}")

✅ Clients initialized
📊 Available collections: ['maxo_vector_store_v2', 'maxo_vector_store']


## Database Setup (Run Once)

In [90]:
# DICO API call to get database schema
import requests

def efficy_api_call():
    """Fetch database schema from Efficy API"""
    session = requests.Session()
    
    try:
        # Login
        login_response = session.post(
            "https://sandbox-5.efficytest.cloud/crm/logon",
            headers={
                'X-Efficy-Customer': 'SANDBOX05',
                'X-Requested-By': 'User',
                'X-Requested-With': 'XMLHttpRequest',
                'Content-Type': 'application/x-www-form-urlencoded'
            },
            data='user=paul&password=Eff1cyDemo!'
        )
        
        if login_response.status_code == 200:
            print("✅ Login successful")
            
            # DICO request
            dico_response = session.get(
                "https://sandbox-5.efficytest.cloud/crm/system/dico",
                headers={
                    'X-Requested-By': 'User',
                    'X-Requested-With': 'XMLHttpRequest'
                }
            )
            
            if dico_response.status_code == 200:
                print("✅ DICO data retrieved")
                return dico_response.json()
            else:
                print(f"❌ DICO request failed: {dico_response.status_code}")
                
        else:
            print(f"❌ Login failed: {login_response.status_code}")
            
    except Exception as e:
        print(f"❌ Request error: {e}")
    
    return None

# Execute the API call (comment out if dico_data already exists)
dico_data = efficy_api_call()
if dico_data:
    print(f"✅ Retrieved {len(dico_data['data']['tables'])} tables from database schema")

✅ Login successful
✅ DICO data retrieved
✅ Retrieved 409 tables from database schema


In [91]:
# Create stable ID function
def stable_id(*parts, length=32): 
    base = '|'.join(str(p) for p in parts)
    return hashlib.sha256(base.encode()).hexdigest()[:length]

# Generate table chunks for vector store
from chunk_generator import generate_table_ingestion_chunks

if 'table_chunks' not in locals():
    table_chunks = generate_table_ingestion_chunks(dico_data)

print(f"✅ Generated {len(table_chunks)} table chunks")

# Create vector store collection
collection_name = "maxo_vector_store_v2"

existing_collections = [col.name for col in qdrant_client.get_collections().collections]
if collection_name not in existing_collections: 
    qdrant_client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=len(encoder.embed_query("Hello world")),
            distance=models.Distance.COSINE,
        ),
    )
    print(f"✅ Created collection: {collection_name}")
else:
    print(f"✅ Using existing collection: {collection_name}")

✅ Generated 286 table chunks
✅ Using existing collection: maxo_vector_store_v2


In [92]:
# Create and upload table points to vector store
table_points = []

for chunk in table_chunks: 
    chunk_id = stable_id(
        chunk.metadata['chunk_type'],
        chunk.metadata['primary_table'],
        chunk.metadata['table_code']
    )

    embedding = encoder.embed_query(chunk.page_content)

    point = PointStruct(
        id=chunk_id, 
        vector=embedding,
        payload={
            'content': chunk.page_content,
            'chunk_type': chunk.metadata['chunk_type'],
            'primary_table': chunk.metadata['primary_table'],
            'table_code': chunk.metadata['table_code'],
            'table_kind': chunk.metadata['table_kind'],
            'field_count': chunk.metadata['field_count'],
            'metadata': chunk.metadata
        }
    )
    table_points.append(point)

# Upsert points to Qdrant
try:
    result = qdrant_client.upsert(
        collection_name=collection_name, 
        points=table_points
    )
    print(f"✅ Successfully upserted {len(table_points)} table chunks")
    
    collection_info = qdrant_client.get_collection(collection_name)
    print(f"📊 Collection now contains {collection_info.points_count} total points")
except Exception as e:
    print(f"❌ Error during upsert: {e}")

# Create payload indexes for efficient filtering
try:
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="chunk_type",
        field_schema=PayloadSchemaType.KEYWORD
    )
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="primary_table",
        field_schema=PayloadSchemaType.KEYWORD
    )
    print("✅ Payload indexes created")
except Exception as e:
    print(f"⚠️  Index creation (may already exist): {e}")

✅ Successfully upserted 286 table chunks
📊 Collection now contains 286 total points
✅ Payload indexes created


## Generic File Ingestion RAG Pipeline

In [None]:
class GenericFileIngestionRAGPipeline:
    """
    Generic RAG pipeline for analyzing any data file and finding the top 5 best
    database tables for data ingestion using semantic search and LLM analysis.
    """
    
    def __init__(self, qdrant_client, encoder, collection_name="maxo_vector_store_v2"):
        self.qdrant_client = qdrant_client
        self.encoder = encoder
        self.collection_name = collection_name
        self.llm_client = OpenAI()
        self.supported_formats = ['.csv', '.xlsx', '.xls', '.json', '.txt', '.tsv']
    
    def analyze_file_structure(self, file_path):
        """Analyze any supported file structure and content"""
        try:
            if not os.path.exists(file_path):
                return {'error': f'File not found: {file_path}'}
            
            file_extension = os.path.splitext(file_path)[1].lower()
            file_name = os.path.basename(file_path)
            
            if file_extension not in self.supported_formats:
                return {'error': f'Unsupported file format: {file_extension}'}
            
            # Handle different file types
            if file_extension == '.csv':
                return self._analyze_csv(file_path, file_name)
            elif file_extension in ['.xlsx', '.xls']:
                return self._analyze_excel(file_path, file_name)
            elif file_extension == '.json':
                return self._analyze_json(file_path, file_name)
            elif file_extension in ['.txt', '.tsv']:
                return self._analyze_text(file_path, file_name)
            
        except Exception as e:
            return {'error': f'Failed to analyze file: {str(e)}'}
    
    def _analyze_csv(self, file_path, file_name):
        """Analyze CSV files"""
        df = pd.read_csv(file_path)
        return self._create_file_analysis(df, file_name, 'CSV')
    
    def _analyze_excel(self, file_path, file_name):
        """Analyze Excel files"""
        df = pd.read_excel(file_path)
        return self._create_file_analysis(df, file_name, 'Excel')
    
    def _analyze_json(self, file_path, file_name):
        """Analyze JSON files"""
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        if isinstance(data, list) and len(data) > 0 and isinstance(data[0], dict):
            df = pd.DataFrame(data)
            return self._create_file_analysis(df, file_name, 'JSON Array')
        elif isinstance(data, dict):
            df = pd.DataFrame([data])
            return self._create_file_analysis(df, file_name, 'JSON Object')
        else:
            return {'error': 'JSON structure not suitable for tabular analysis'}
    
    def _analyze_text(self, file_path, file_name):
        """Analyze text/TSV files"""
        with open(file_path, 'r', encoding='utf-8') as f:
            first_line = f.readline()
        
        delimiter = '\t' if '\t' in first_line else ',' if ',' in first_line else ';'
        df = pd.read_csv(file_path, delimiter=delimiter)
        return self._create_file_analysis(df, file_name, 'Text/TSV')
    
    def _create_file_analysis(self, df, file_name, file_type):
        """Create standardized file analysis from DataFrame"""
        file_info = {
            'file_name': file_name,
            'file_type': file_type,
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'columns': df.columns.tolist(),
            'sample_data': df.head(2).to_dict('records') if len(df) > 0 else []
        }
        
        # Analyze column types and content
        column_analysis = {}
        for col in df.columns:
            column_analysis[col] = {
                'dtype': str(df[col].dtype),
                'non_null_count': df[col].notna().sum(),
                'null_count': df[col].isna().sum(),
                'unique_values': df[col].nunique(),
                'sample_values': df[col].dropna().head(3).tolist()
            }
        
        file_info['column_analysis'] = column_analysis
        return file_info
    
    def _infer_data_domain(self, columns):
        ""Enhanced data domain inference with comprehensive business entity detection"
        columns_lower = [col.lower() for col in columns]
        
        # Comprehensive domain detection patterns
        domain_patterns = {
            # Core CRM entities
            'leads': [
                'lead', 'prospect', 'lead_status', 'source', 'qualification', 'score', 'conversion',
                'lead_id', 'prospect_id', 'qualified', 'unqualified', 'mql', 'sql', 'nurture'
            ],
            'opportunities': [
                'opportunity', 'deal', 'pipeline', 'stage', 'probability', 'close_date', 'forecast',
                'opp_id', 'deal_id', 'sales_stage', 'win_probability', 'expected_revenue', 'deal_value'
            ],
            'contacts': [
                'contact', 'person', 'individual', 'first_name', 'last_name', 'title', 'relationship',
                'contact_id', 'person_id', 'full_name', 'job_title', 'phone', 'mobile', 'email'
            ],
            'companies': [
                'company', 'organization', 'enterprise', 'business', 'industry', 'sector', 'headquarters',
                'company_id', 'org_id', 'business_name', 'company_name', 'industry_type', 'company_size'
            ],
            'activities': [
                'activity', 'action', 'event', 'log', 'history', 'timeline', 'interaction',
                'activity_id', 'event_id', 'action_type', 'activity_type', 'interaction_type', 'follow_up'
            ],
            'meetings': [
                'meeting', 'appointment', 'schedule', 'calendar', 'attendee', 'agenda', 'duration',
                'meeting_id', 'appointment_id', 'scheduled', 'start_time', 'end_time', 'location'
            ],
            'campaigns': [
                'campaign', 'marketing', 'promotion', 'advertising', 'channel', 'target', 'response',
                'campaign_id', 'promo_id', 'marketing_campaign', 'campaign_name', 'campaign_type'
            ],
            'tickets': [
                'ticket', 'issue', 'support', 'incident', 'priority', 'resolution', 'escalation',
                'ticket_id', 'issue_id', 'support_ticket', 'incident_id', 'case_id', 'help_desk'
            ],
            'users': [
                'user', 'username', 'login', 'profile', 'role', 'permission', 'access',
                'user_id', 'account', 'user_name', 'login_name', 'user_role', 'access_level'
            ],
            
            # Extended business domains
            'communication': [
                'message', 'email', 'mail', 'subject', 'sender', 'recipient', 'date',
                'corps du message', 'mail expéditeur', 'mail destinataire', 'objet', 'visite_mail'
            ],
            'sales_orders': ['order', 'product', 'price', 'quantity', 'total', 'invoice', 'payment'],
            'financial': ['amount', 'cost', 'revenue', 'budget', 'transaction', 'account', 'currency'],
            'hr_employee': ['employee', 'staff', 'salary', 'department', 'position', 'hire', 'manager'],
            'inventory': ['item', 'stock', 'warehouse', 'supplier', 'category', 'sku', 'unit'],
            'project': ['project', 'task', 'milestone', 'deadline', 'status', 'resource', 'team'],
            'logistics': ['shipment', 'delivery', 'tracking', 'carrier', 'destination', 'weight']
        }
        
        # Calculate domain scores with weighted importance
        domain_scores = {}
        for domain, keywords in domain_patterns.items():
            score = 0
            matched_keywords = []
            
            for keyword in keywords:
                for col in columns_lower:
                    if keyword in col:
                        # Weight exact matches higher
                        if keyword == col:
                            score += 3
                        # Weight ID fields higher (strong indicators)
                        elif keyword.endswith('_id') and keyword in col:
                            score += 2.5
                        # Regular substring matches
                        else:
                            score += 1
                        matched_keywords.append(keyword)
                        break
            
            domain_scores[domain] = {
                'score': score,
                'matched_keywords': list(set(matched_keywords))
            }
        
        # Get best matching domain
        best_domain = max(domain_scores, key=lambda x: domain_scores[x]['score']) if domain_scores else 'general'
        best_score = domain_scores[best_domain]['score'] if best_domain != 'general' else 0
        
        # Enhanced domain mapping with confidence indicators
        domain_mapping = {
            'leads': {
                'primary_domain': 'lead management and prospecting', 
                'business_area': 'sales lead generation', 
                'data_category': 'leads',
                'table_hints': ['lead', 'prospect', 'qualification']
            },
            'opportunities': {
                'primary_domain': 'sales opportunity tracking', 
                'business_area': 'sales pipeline management', 
                'data_category': 'opportunities',
                'table_hints': ['opportunity', 'deal', 'sales_pipeline']
            },
            'contacts': {
                'primary_domain': 'contact and person management', 
                'business_area': 'relationship management', 
                'data_category': 'contacts',
                'table_hints': ['contact', 'person', 'individual']
            },
            'companies': {
                'primary_domain': 'company and organization management', 
                'business_area': 'corporate data management', 
                'data_category': 'companies',
                'table_hints': ['company', 'organization', 'enterprise']
            },
            'activities': {
                'primary_domain': 'activity and event tracking', 
                'business_area': 'interaction management', 
                'data_category': 'activities',
                'table_hints': ['activity', 'event', 'action', 'visit']
            },
            'meetings': {
                'primary_domain': 'meeting and calendar management', 
                'business_area': 'scheduling and appointments', 
                'data_category': 'meetings',
                'table_hints': ['meeting', 'appointment', 'schedule']
            },
            'campaigns': {
                'primary_domain': 'marketing campaign management', 
                'business_area': 'marketing operations', 
                'data_category': 'campaigns',
                'table_hints': ['campaign', 'marketing', 'promotion']
            },
            'tickets': {
                'primary_domain': 'ticketing and support management', 
                'business_area': 'customer support', 
                'data_category': 'tickets',
                'table_hints': ['ticket', 'support', 'incident']
            },
            'users': {
                'primary_domain': 'user and account management', 
                'business_area': 'system administration', 
                'data_category': 'users',
                'table_hints': ['user', 'account', 'profile']
            },
            'communication': {
                'primary_domain': 'communication and messaging', 
                'business_area': 'correspondence', 
                'data_category': 'communication',
                'table_hints': ['mail', 'email', 'message', 'visit']
            },
            'sales_orders': {'primary_domain': 'sales and order management', 'business_area': 'sales operations', 'data_category': 'transactional', 'table_hints': ['order', 'sale', 'invoice']},
            'financial': {'primary_domain': 'financial and accounting', 'business_area': 'finance', 'data_category': 'financial', 'table_hints': ['financial', 'accounting', 'budget']},
            'hr_employee': {'primary_domain': 'human resources', 'business_area': 'HR management', 'data_category': 'employee', 'table_hints': ['employee', 'staff', 'hr']},
            'inventory': {'primary_domain': 'inventory and stock management', 'business_area': 'supply chain', 'data_category': 'inventory', 'table_hints': ['inventory', 'stock', 'product']},
            'project': {'primary_domain': 'project management', 'business_area': 'project operations', 'data_category': 'project', 'table_hints': ['project', 'task', 'milestone']},
            'logistics': {'primary_domain': 'logistics and shipping', 'business_area': 'operations', 'data_category': 'logistics', 'table_hints': ['shipment', 'delivery', 'logistics']},
            'general': {'primary_domain': 'business data', 'business_area': 'general operations', 'data_category': 'business', 'table_hints': ['data', 'general']}
        }
        
        result = domain_mapping.get(best_domain, domain_mapping['general'])
        
        # Add detection metadata for debugging and confidence assessment
        result['detection_confidence'] = min(best_score / 5.0, 1.0)  # Normalize to 0-1
        result['matched_keywords'] = domain_scores.get(best_domain, {}).get('matched_keywords', [])
        result['all_scores'] = {k: v['score'] for k, v in domain_scores.items() if v['score'] > 0}
        
        return result

✅ GenericFileIngestionRAGPipeline class defined


In [None]:
# Create and upload table points to vector store
table_points = []

for chunk in table_chunks: 
    chunk_id = stable_id(
        chunk.metadata['chunk_type'],
        chunk.metadata['primary_table'],
        chunk.metadata['table_code']
    )

    embedding = encoder.embed_query(chunk.page_content)

    point = PointStruct(
        id=chunk_id, 
        vector=embedding,
        payload={
            'content': chunk.page_content,
            'chunk_type': chunk.metadata['chunk_type'],
            'primary_table': chunk.metadata['primary_table'],
            'table_code': chunk.metadata['table_code'],
            'table_kind': chunk.metadata['table_kind'],
            'field_count': chunk.metadata['field_count'],
            'metadata': chunk.metadata
        }
    )
    table_points.append(point)

# Upsert points to Qdrant
try:
    result = qdrant_client.upsert(
        collection_name=collection_name, 
        points=table_points
    )
    print(f"✅ Successfully upserted {len(table_points)} table chunks")
    
    collection_info = qdrant_client.get_collection(collection_name)
    print(f"📊 Collection now contains {collection_info.points_count} total points")
except Exception as e:
    print(f"❌ Error during upsert: {e}")

# Create payload indexes for efficient filtering
try:
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="chunk_type",
        field_schema=PayloadSchemaType.KEYWORD
    )
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="primary_table",
        field_schema=PayloadSchemaType.KEYWORD
    )
    print("✅ Payload indexes created")
except Exception as e:
    print(f"⚠️  Index creation (may already exist): {e}")

## Generic File Ingestion RAG Pipeline

In [None]:
class GenericFileIngestionRAGPipeline:
    """
    Generic RAG pipeline for analyzing any data file and finding the top 5 best
    database tables for data ingestion using semantic search and LLM analysis.
    """
    
    def __init__(self, qdrant_client, encoder, collection_name="maxo_vector_store_v2"):
        self.qdrant_client = qdrant_client
        self.encoder = encoder
        self.collection_name = collection_name
        self.llm_client = OpenAI()
        self.supported_formats = ['.csv', '.xlsx', '.xls', '.json', '.txt', '.tsv']
    
    def analyze_file_structure(self, file_path):
        """Analyze any supported file structure and content"""
        try:
            if not os.path.exists(file_path):
                return {'error': f'File not found: {file_path}'}
            
            file_extension = os.path.splitext(file_path)[1].lower()
            file_name = os.path.basename(file_path)
            
            if file_extension not in self.supported_formats:
                return {'error': f'Unsupported file format: {file_extension}'}
            
            # Handle different file types
            if file_extension == '.csv':
                return self._analyze_csv(file_path, file_name)
            elif file_extension in ['.xlsx', '.xls']:
                return self._analyze_excel(file_path, file_name)
            elif file_extension == '.json':
                return self._analyze_json(file_path, file_name)
            elif file_extension in ['.txt', '.tsv']:
                return self._analyze_text(file_path, file_name)
            
        except Exception as e:
            return {'error': f'Failed to analyze file: {str(e)}'}
    
    def _analyze_csv(self, file_path, file_name):
        """Analyze CSV files"""
        df = pd.read_csv(file_path)
        return self._create_file_analysis(df, file_name, 'CSV')
    
    def _analyze_excel(self, file_path, file_name):
        """Analyze Excel files"""
        df = pd.read_excel(file_path)
        return self._create_file_analysis(df, file_name, 'Excel')
    
    def _analyze_json(self, file_path, file_name):
        """Analyze JSON files"""
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        if isinstance(data, list) and len(data) > 0 and isinstance(data[0], dict):
            df = pd.DataFrame(data)
            return self._create_file_analysis(df, file_name, 'JSON Array')
        elif isinstance(data, dict):
            df = pd.DataFrame([data])
            return self._create_file_analysis(df, file_name, 'JSON Object')
        else:
            return {'error': 'JSON structure not suitable for tabular analysis'}
    
    def _analyze_text(self, file_path, file_name):
        """Analyze text/TSV files"""
        with open(file_path, 'r', encoding='utf-8') as f:
            first_line = f.readline()
        
        delimiter = '\t' if '\t' in first_line else ',' if ',' in first_line else ';'
        df = pd.read_csv(file_path, delimiter=delimiter)
        return self._create_file_analysis(df, file_name, 'Text/TSV')
    
    def _create_file_analysis(self, df, file_name, file_type):
        """Create standardized file analysis from DataFrame"""
        file_info = {
            'file_name': file_name,
            'file_type': file_type,
            'total_rows': len(df),
            'total_columns': len(df.columns),
            'columns': df.columns.tolist(),
            'sample_data': df.head(2).to_dict('records') if len(df) > 0 else []
        }
        
        # Analyze column types and content
        column_analysis = {}
        for col in df.columns:
            column_analysis[col] = {
                'dtype': str(df[col].dtype),
                'non_null_count': df[col].notna().sum(),
                'null_count': df[col].isna().sum(),
                'unique_values': df[col].nunique(),
                'sample_values': df[col].dropna().head(3).tolist()
            }
        
        file_info['column_analysis'] = column_analysis
        return file_info
    
    def _infer_data_domain(self, columns):
        ""Enhanced data domain inference with comprehensive business entity detection"
        columns_lower = [col.lower() for col in columns]
        
        # Comprehensive domain detection patterns
        domain_patterns = {
            # Core CRM entities
            'leads': [
                'lead', 'prospect', 'lead_status', 'source', 'qualification', 'score', 'conversion',
                'lead_id', 'prospect_id', 'qualified', 'unqualified', 'mql', 'sql', 'nurture'
            ],
            'opportunities': [
                'opportunity', 'deal', 'pipeline', 'stage', 'probability', 'close_date', 'forecast',
                'opp_id', 'deal_id', 'sales_stage', 'win_probability', 'expected_revenue', 'deal_value'
            ],
            'contacts': [
                'contact', 'person', 'individual', 'first_name', 'last_name', 'title', 'relationship',
                'contact_id', 'person_id', 'full_name', 'job_title', 'phone', 'mobile', 'email'
            ],
            'companies': [
                'company', 'organization', 'enterprise', 'business', 'industry', 'sector', 'headquarters',
                'company_id', 'org_id', 'business_name', 'company_name', 'industry_type', 'company_size'
            ],
            'activities': [
                'activity', 'action', 'event', 'log', 'history', 'timeline', 'interaction',
                'activity_id', 'event_id', 'action_type', 'activity_type', 'interaction_type', 'follow_up'
            ],
            'meetings': [
                'meeting', 'appointment', 'schedule', 'calendar', 'attendee', 'agenda', 'duration',
                'meeting_id', 'appointment_id', 'scheduled', 'start_time', 'end_time', 'location'
            ],
            'campaigns': [
                'campaign', 'marketing', 'promotion', 'advertising', 'channel', 'target', 'response',
                'campaign_id', 'promo_id', 'marketing_campaign', 'campaign_name', 'campaign_type'
            ],
            'tickets': [
                'ticket', 'issue', 'support', 'incident', 'priority', 'resolution', 'escalation',
                'ticket_id', 'issue_id', 'support_ticket', 'incident_id', 'case_id', 'help_desk'
            ],
            'users': [
                'user', 'username', 'login', 'profile', 'role', 'permission', 'access',
                'user_id', 'account', 'user_name', 'login_name', 'user_role', 'access_level'
            ],
            
            # Extended business domains
            'communication': [
                'message', 'email', 'mail', 'subject', 'sender', 'recipient', 'date',
                'corps du message', 'mail expéditeur', 'mail destinataire', 'objet', 'visite_mail'
            ],
            'sales_orders': ['order', 'product', 'price', 'quantity', 'total', 'invoice', 'payment'],
            'financial': ['amount', 'cost', 'revenue', 'budget', 'transaction', 'account', 'currency'],
            'hr_employee': ['employee', 'staff', 'salary', 'department', 'position', 'hire', 'manager'],
            'inventory': ['item', 'stock', 'warehouse', 'supplier', 'category', 'sku', 'unit'],
            'project': ['project', 'task', 'milestone', 'deadline', 'status', 'resource', 'team'],
            'logistics': ['shipment', 'delivery', 'tracking', 'carrier', 'destination', 'weight']
        }
        
        # Calculate domain scores with weighted importance
        domain_scores = {}
        for domain, keywords in domain_patterns.items():
            score = 0
            matched_keywords = []
            
            for keyword in keywords:
                for col in columns_lower:
                    if keyword in col:
                        # Weight exact matches higher
                        if keyword == col:
                            score += 3
                        # Weight ID fields higher (strong indicators)
                        elif keyword.endswith('_id') and keyword in col:
                            score += 2.5
                        # Regular substring matches
                        else:
                            score += 1
                        matched_keywords.append(keyword)
                        break
            
            domain_scores[domain] = {
                'score': score,
                'matched_keywords': list(set(matched_keywords))
            }
        
        # Get best matching domain
        best_domain = max(domain_scores, key=lambda x: domain_scores[x]['score']) if domain_scores else 'general'
        best_score = domain_scores[best_domain]['score'] if best_domain != 'general' else 0
        
        # Enhanced domain mapping with confidence indicators
        domain_mapping = {
            'leads': {
                'primary_domain': 'lead management and prospecting', 
                'business_area': 'sales lead generation', 
                'data_category': 'leads',
                'table_hints': ['lead', 'prospect', 'qualification']
            },
            'opportunities': {
                'primary_domain': 'sales opportunity tracking', 
                'business_area': 'sales pipeline management', 
                'data_category': 'opportunities',
                'table_hints': ['opportunity', 'deal', 'sales_pipeline']
            },
            'contacts': {
                'primary_domain': 'contact and person management', 
                'business_area': 'relationship management', 
                'data_category': 'contacts',
                'table_hints': ['contact', 'person', 'individual']
            },
            'companies': {
                'primary_domain': 'company and organization management', 
                'business_area': 'corporate data management', 
                'data_category': 'companies',
                'table_hints': ['company', 'organization', 'enterprise']
            },
            'activities': {
                'primary_domain': 'activity and event tracking', 
                'business_area': 'interaction management', 
                'data_category': 'activities',
                'table_hints': ['activity', 'event', 'action', 'visit']
            },
            'meetings': {
                'primary_domain': 'meeting and calendar management', 
                'business_area': 'scheduling and appointments', 
                'data_category': 'meetings',
                'table_hints': ['meeting', 'appointment', 'schedule']
            },
            'campaigns': {
                'primary_domain': 'marketing campaign management', 
                'business_area': 'marketing operations', 
                'data_category': 'campaigns',
                'table_hints': ['campaign', 'marketing', 'promotion']
            },
            'tickets': {
                'primary_domain': 'ticketing and support management', 
                'business_area': 'customer support', 
                'data_category': 'tickets',
                'table_hints': ['ticket', 'support', 'incident']
            },
            'users': {
                'primary_domain': 'user and account management', 
                'business_area': 'system administration', 
                'data_category': 'users',
                'table_hints': ['user', 'account', 'profile']
            },
            'communication': {
                'primary_domain': 'communication and messaging', 
                'business_area': 'correspondence', 
                'data_category': 'communication',
                'table_hints': ['mail', 'email', 'message', 'visit']
            },
            'sales_orders': {'primary_domain': 'sales and order management', 'business_area': 'sales operations', 'data_category': 'transactional', 'table_hints': ['order', 'sale', 'invoice']},
            'financial': {'primary_domain': 'financial and accounting', 'business_area': 'finance', 'data_category': 'financial', 'table_hints': ['financial', 'accounting', 'budget']},
            'hr_employee': {'primary_domain': 'human resources', 'business_area': 'HR management', 'data_category': 'employee', 'table_hints': ['employee', 'staff', 'hr']},
            'inventory': {'primary_domain': 'inventory and stock management', 'business_area': 'supply chain', 'data_category': 'inventory', 'table_hints': ['inventory', 'stock', 'product']},
            'project': {'primary_domain': 'project management', 'business_area': 'project operations', 'data_category': 'project', 'table_hints': ['project', 'task', 'milestone']},
            'logistics': {'primary_domain': 'logistics and shipping', 'business_area': 'operations', 'data_category': 'logistics', 'table_hints': ['shipment', 'delivery', 'logistics']},
            'general': {'primary_domain': 'business data', 'business_area': 'general operations', 'data_category': 'business', 'table_hints': ['data', 'general']}
        }
        
        result = domain_mapping.get(best_domain, domain_mapping['general'])
        
        # Add detection metadata for debugging and confidence assessment
        result['detection_confidence'] = min(best_score / 5.0, 1.0)  # Normalize to 0-1
        result['matched_keywords'] = domain_scores.get(best_domain, {}).get('matched_keywords', [])
        result['all_scores'] = {k: v['score'] for k, v in domain_scores.items() if v['score'] > 0}
        
        return result

In [None]:
# Create and upload table points to vector store
table_points = []

for chunk in table_chunks: 
    chunk_id = stable_id(
        chunk.metadata['chunk_type'],
        chunk.metadata['primary_table'],
        chunk.metadata['table_code']
    )

    embedding = encoder.embed_query(chunk.page_content)

    point = PointStruct(
        id=chunk_id, 
        vector=embedding,
        payload={
            'content': chunk.page_content,
            'chunk_type': chunk.metadata['chunk_type'],
            'primary_table': chunk.metadata['primary_table'],
            'table_code': chunk.metadata['table_code'],
            'table_kind': chunk.metadata['table_kind'],
            'field_count': chunk.metadata['field_count'],
            'metadata': chunk.metadata
        }
    )
    table_points.append(point)

# Upsert points to Qdrant
try:
    result = qdrant_client.upsert(
        collection_name=collection_name, 
        points=table_points
    )
    print(f"✅ Successfully upserted {len(table_points)} table chunks")
    
    collection_info = qdrant_client.get_collection(collection_name)
    print(f"📊 Collection now contains {collection_info.points_count} total points")
except Exception as e:
    print(f"❌ Error during upsert: {e}")

# Create payload indexes for efficient filtering
try:
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="chunk_type",
        field_schema=PayloadSchemaType.KEYWORD
    )
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name="primary_table",
        field_schema=PayloadSchemaType.KEYWORD
    )
    print("✅ Payload indexes created")
except Exception as e:
    print(f"⚠️  Index creation (may already exist): {e}")

## Generic File Ingestion RAG Pipeline

## Usage Example

In [94]:
# Initialize the pipeline
pipeline = GenericFileIngestionRAGPipeline(qdrant_client, encoder, "maxo_vector_store_v2")

# Analyze a file (example with Mails.csv)
file_path = r"C:\Users\axel.grille\Documents\rules-engine-agent\Mails.csv"
user_context = "email correspondence and customer communication tracking"

# Run the complete pipeline
results = pipeline.run_complete_pipeline(file_path, user_context)

# Display results
pipeline.display_results_summary(results)

# Export for SQL agent
if 'error' not in results:
    export_result = pipeline.export_for_sql_agent(results)
    if 'success' in export_result:
        print(f"✅ Ready for SQL Agent: {export_result['output_file']}")

=== GENERIC FILE INGESTION RAG PIPELINE ===
📁 Analyzing: Mails.csv

Step 1: Analyzing file structure...
✓ CSV file: 25 columns, 5 rows
✓ Detected domain: communication and messaging

Step 2: Generating semantic search queries...
🔍 Enhanced with 5 email-specific queries (confidence: 1.00)
✓ Generated 11 queries for database search

Step 3: Searching for relevant database tables...
✓ Found 43 unique tables across all queries

Step 4: Ranking tables by relevance...
✓ Ranked 43 tables by composite relevance score

🎯 INGESTION ANALYSIS: Mails.csv
📊 File: CSV | 5 rows | 25 columns
🏷️  Domain: communication and messaging
🎯 Best Table: Mail
🔍 Confidence: High
⚠️  Review Required: No
🤖 SQL Agent Ready: Yes

📋 TOP 10 DATABASE TABLES:
1. Mail (Entity)
   Score: 2.032 | Fields: 31 | Matches: 7
2. Contact (Entity)
   Score: 1.996 | Fields: 70 | Matches: 7
3. Sms (Entity)
   Score: 1.815 | Fields: 9 | Matches: 6
4. Docu_Camp (Relation)
   Score: 1.802 | Fields: 5 | Matches: 6
5. Docu_Docu (Relation)