## 🔧 Code Quality: Single Source of Truth

This notebook follows the **Single Source of Truth** principle for all configuration values:

### 📍 **Configuration Variables** (Defined Once, Used Everywhere)

- `COLLECTION_NAME = "seatbelt_comments"` - Vector database collection name
- `PERSIST_DIRECTORY = os.path.join(os.getcwd(), "chroma_db")` - Database storage location  
- `SIMILARITY_THRESHOLD = 0.7` - Default similarity threshold for searches
- `embedding_model` - OpenAI embedding model from `.env` file
- `llm_model` - LLM model for validation from `.env` file

### ✅ **Benefits of This Approach**

1. **Consistency**: All functions use the same configuration values
2. **Maintainability**: Change values in one place to update everywhere
3. **No Hardcoding**: Print statements and function calls reference variables
4. **Flexibility**: Easy to modify thresholds and paths for different use cases

### 🚫 **What We Avoided**

- ❌ Hardcoded `"chroma_db"` strings in print statements
- ❌ Hardcoded `0.7`, `0.6`, `0.5` threshold values in functions
- ❌ Hardcoded `"seatbelt_comments"` collection names
- ❌ Duplicate configuration values scattered throughout code

This ensures reliable, maintainable code that's easy to configure and modify.

---

# Semantic Search Demo with Excel Data

This notebook will:
1. Read and understand the Excel file structure
2. Prepare the data for semantic search
3. Create a vector database
4. Demonstrate semantic search capabilities

In [19]:
# Import required libraries
import pandas as pd
import os
from pathlib import Path
import numpy as np

# Display pandas dataframes in a more readable format
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 1000)

In [20]:
# Read the Excel file
file_path = "Lm_orders_seatbelt_v2.xlsx"
df = pd.read_excel(file_path)

# Display basic information about the dataset
print(f"File: {file_path}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print("\nColumn names:")
for col in df.columns:
    print(f"- {col}")

# Show the first few rows
print("\nFirst 5 rows of data:")
df.head()

File: Lm_orders_seatbelt_v2.xlsx
Number of rows: 237
Number of columns: 11

Column names:
- orderid
- Orderstatus
- Completedate
- Compcode
- Compcodedescrip
- Sectioncomments
- Linetype
- Hours
- Qty
- Unitcst
- linetotal

First 5 rows of data:


Unnamed: 0,orderid,Orderstatus,Completedate,Compcode,Compcodedescrip,Sectioncomments,Linetype,Hours,Qty,Unitcst,linetotal
0,12383330,REPAIR,2025-03-05 16:23:00,002-011-003,Seat Belt & Retainer Assembly ...,<Complaint>Seatbelt</Complaint><Cause>Check va...,LABOR,1.0,0,110.0,110.0
1,12383330,REPAIR,2025-03-05 16:23:00,002-011-003,Seat Belt & Retainer Assembly ...,<Complaint>Seatbelt</Complaint><Cause>Check va...,PART,0.0,1,369.7,369.7
2,12383330,REPAIR,2025-03-05 16:23:00,002-011-003,Seat Belt & Retainer Assembly ...,<Complaint>Seatbelt</Complaint><Cause>Check va...,PART,0.0,1,142.1,142.1
3,12482815,REPAIR,2025-03-05 04:17:00,002-011-003,Seat Belt & Retainer Assembly ...,<Complaint>Seat belt buckle open circuit </Com...,LABOR,1.0,0,110.0,110.0
4,12482815,REPAIR,2025-03-05 04:17:00,002-011-003,Seat Belt & Retainer Assembly ...,<Complaint>Seat belt buckle open circuit </Com...,PART,0.0,1,134.85,134.85


## Data Exploration and Preparation

Before creating our vector database, let's analyze the "Sectioncomments" column that we'll be indexing to better understand our data.

In [21]:
# Explore the "Sectioncomments" column that we'll be indexing
if "Sectioncomments" in df.columns:
    # Check for missing values
    missing_comments = df["Sectioncomments"].isna().sum()
    print(f"Number of missing values in Sectioncomments: {missing_comments} ({missing_comments/len(df)*100:.2f}%)")
    
    # Get statistics on comment length
    df["comment_length"] = df["Sectioncomments"].astype(str).apply(len)
    print(f"\nComment length statistics:")
    print(f"Min: {df['comment_length'].min()} characters")
    print(f"Max: {df['comment_length'].max()} characters")
    print(f"Mean: {df['comment_length'].mean():.2f} characters")
    print(f"Median: {df['comment_length'].median()} characters")
    
    # Display some sample comments of different lengths
    print("\nSample comments:")
    
    # Short comment (if available)
    short_comments = df[df["comment_length"] < 50].sort_values("comment_length")
    if not short_comments.empty:
        idx = short_comments.index[0]
        print(f"\nShort comment ({df.loc[idx, 'comment_length']} chars):")
        print(df.loc[idx, "Sectioncomments"])
    
    # Medium comment
    med_comments = df[(df["comment_length"] >= 50) & (df["comment_length"] < 200)]
    if not med_comments.empty:
        idx = med_comments.index[0]
        print(f"\nMedium comment ({df.loc[idx, 'comment_length']} chars):")
        print(df.loc[idx, "Sectioncomments"])
    
    # Long comment
    long_comments = df[df["comment_length"] >= 200].sort_values("comment_length", ascending=False)
    if not long_comments.empty:
        idx = long_comments.index[0]
        print(f"\nLong comment ({df.loc[idx, 'comment_length']} chars):")
        print(df.loc[idx, "Sectioncomments"])
else:
    print("Column 'Sectioncomments' not found in the dataset. Available columns:")
    print(df.columns.tolist())

Number of missing values in Sectioncomments: 0 (0.00%)

Comment length statistics:
Min: 159 characters
Max: 1696 characters
Mean: 401.57 characters
Median: 310.0 characters

Sample comments:

Medium comment (182 chars):
<Correction>Replace seatbelt buckle</Correction><Notes></Notes><Cause>Casing missing off of buckle</Cause><Complaint>Seatbelt buckle broken</Complaint><CauseOfRepair></CauseOfRepair>

Long comment (1696 chars):


## Creating a Vector Database

Now we'll create a vector database using:
1. **ChromaDB** as our vector store
2. **OpenAI Embeddings** to convert text into vector embeddings
3. **LangChain** to tie everything together

We'll index only the "Sectioncomments" column, while keeping the other columns as metadata for retrieval.

In [22]:
# Load environment variables from .env file
import os
from dotenv import load_dotenv

# Load the .env file
load_dotenv()

# Check if the OpenAI API key is set
openai_api_key = os.getenv("OPENAI_API_KEY")
embedding_model = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")

if openai_api_key == "your_openai_api_key_here" or not openai_api_key:
    print("⚠️ WARNING: You need to set your OpenAI API key in the .env file!")
    print("Please update the .env file with your actual API key.")
else:
    print(f"✅ OpenAI API key is set")
    print(f"✅ Using embedding model: {embedding_model}")

✅ OpenAI API key is set
✅ Using embedding model: text-embedding-3-small


## Vector Database Structure Verification

Let's verify the structure of our vector database to confirm it's correctly storing embeddings for "Sectioncomments" only, with all other columns as metadata.

In [23]:
# Create a visual diagram explaining the vector database structure
from IPython.display import display, Markdown

explanation = """
## Vector Database Structure Explanation

```
┌─────────────────────────────────────────────────────────────┐
│                      Vector Database                         │
└─────────────────────────────────────────────────────────────┘
               │                         │
┌──────────────┴──────────────┐ ┌───────┴───────────────────┐
│       Embeddings            │ │        Metadata            │
│ (from "Sectioncomments")    │ │  (all other columns)       │
│                             │ │                            │
│ • Dense vectors             │ │ • Row ID                   │
│ • High-dimensional space    │ │ • Vehicle info             │
│ • Capture semantic meaning  │ │ • Timestamps               │
│                             │ │ • Any other columns        │
└─────────────────────────────┘ └────────────────────────────┘
               │                          │
               └──────────┬──────────────┘
                          │
          ┌───────────────┴───────────────┐
          │       Search Process           │
          │                                │
          │ 1. Query is embedded           │
          │ 2. Find similar embeddings     │
          │ 3. Return matched documents    │
          │    with their metadata         │
          └────────────────────────────────┘
```

**Important points:**
1. Only the "Sectioncomments" text is converted to embeddings
2. All other columns are stored as metadata (not embedded)
3. Searches match against embedded "Sectioncomments" only
4. Search results include both the matching comment and all metadata
"""

display(Markdown(explanation))


## Vector Database Structure Explanation

```
┌─────────────────────────────────────────────────────────────┐
│                      Vector Database                         │
└─────────────────────────────────────────────────────────────┘
               │                         │
┌──────────────┴──────────────┐ ┌───────┴───────────────────┐
│       Embeddings            │ │        Metadata            │
│ (from "Sectioncomments")    │ │  (all other columns)       │
│                             │ │                            │
│ • Dense vectors             │ │ • Row ID                   │
│ • High-dimensional space    │ │ • Vehicle info             │
│ • Capture semantic meaning  │ │ • Timestamps               │
│                             │ │ • Any other columns        │
└─────────────────────────────┘ └────────────────────────────┘
               │                          │
               └──────────┬──────────────┘
                          │
          ┌───────────────┴───────────────┐
          │       Search Process           │
          │                                │
          │ 1. Query is embedded           │
          │ 2. Find similar embeddings     │
          │ 3. Return matched documents    │
          │    with their metadata         │
          └────────────────────────────────┘
```

**Important points:**
1. Only the "Sectioncomments" text is converted to embeddings
2. All other columns are stored as metadata (not embedded)
3. Searches match against embedded "Sectioncomments" only
4. Search results include both the matching comment and all metadata


In [24]:
# Check and install required libraries
import subprocess
import sys

def install_if_missing(package):
    try:
        __import__(package)
        print(f"✅ {package} is already installed")
    except ImportError:
        print(f"📦 Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check required libraries
required_libs = [
    'langchain',
    'langchain_chroma', 
    'chromadb',
    'langchain_openai',
    'python-dotenv'
]

for lib in required_libs:
    install_if_missing(lib)

print("\n✅ All required libraries are available!")

✅ langchain is already installed
✅ langchain_chroma is already installed
✅ chromadb is already installed
✅ langchain_openai is already installed
📦 Installing python-dotenv...

✅ All required libraries are available!

✅ All required libraries are available!


In [25]:
# Import required libraries for vector database
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
import tempfile
import shutil
from datetime import datetime
import json

# Configuration
COLLECTION_NAME = "seatbelt_comments"
# Ensure vector database is stored in the current working directory
import os
PERSIST_DIRECTORY = os.path.join(os.getcwd(), "chroma_db")
SIMILARITY_THRESHOLD = 0.7  # Default threshold, can be modified by user

# Initialize embeddings
embeddings = OpenAIEmbeddings(
    model=embedding_model,
    openai_api_key=openai_api_key
)

print(f"✅ Vector database configuration:")
print(f"   - Collection name: {COLLECTION_NAME}")
print(f"   - Persist directory: {PERSIST_DIRECTORY}")
print(f"   - Current working directory: {os.getcwd()}")
print(f"   - Embedding model: {embedding_model}")
print(f"   - Default similarity threshold: {SIMILARITY_THRESHOLD}")

✅ Vector database configuration:
   - Collection name: seatbelt_comments
   - Persist directory: /Users/ahb/CaylentDocs/Projects/AmeritFleetSolutions-AIStrategy/SampleData/chroma_db
   - Current working directory: /Users/ahb/CaylentDocs/Projects/AmeritFleetSolutions-AIStrategy/SampleData
   - Embedding model: text-embedding-3-small
   - Default similarity threshold: 0.7


In [26]:
# Function to create or load vector database
def create_or_load_vectordb():
    """Create a new vector database or load existing one"""
    
    # Use a temporary directory approach to avoid settings conflicts
    import tempfile
    import os
    
    # Try to connect to existing database or create new one
    try:
        # Check if vectordb already exists in current scope
        try:
            # Try to access existing vectordb from globals
            existing_db = globals().get('vectordb')
            if existing_db is not None:
                print("🔄 Using existing vector database from current session")
                collection_count = existing_db._collection.count()
                return existing_db, collection_count == 0
        except:
            pass
        
        # Create new vector database
        print("📝 Creating new vector database...")
        vectordb = Chroma(
            collection_name=COLLECTION_NAME,
            embedding_function=embeddings,
            persist_directory=PERSIST_DIRECTORY
        )
        
        # Check if collection has any documents
        collection_count = vectordb._collection.count()
        
        if collection_count > 0:
            print(f"✅ Loaded existing vector database with {collection_count} documents")
            return vectordb, False  # False indicates it's not empty
        else:
            print(f"📊 Found empty vector database, will populate with data")
            return vectordb, True  # True indicates it's empty
            
    except Exception as e:
        print(f"⚠️ Database connection issue: {e}")
        print("🔄 Using temporary database for this session...")
        
        # Use a temporary directory
        temp_dir = tempfile.mkdtemp()
        vectordb = Chroma(
            collection_name=COLLECTION_NAME,
            embedding_function=embeddings,
            persist_directory=temp_dir
        )
        print(f"📁 Temporary database created at: {temp_dir}")
        return vectordb, True  # True indicates it's empty

# Create or load the database
vectordb, is_empty = create_or_load_vectordb()

🔄 Using existing vector database from current session


In [27]:
# Verify vector database location and recreate if needed
def ensure_vectordb_in_current_directory():
    """Ensure the vector database is stored in the current directory"""
    import os
    
    current_dir = os.getcwd()
    expected_db_path = os.path.join(current_dir, os.path.basename(PERSIST_DIRECTORY))
    
    print(f"🔍 Checking vector database location...")
    print(f"   - Expected path: {expected_db_path}")
    print(f"   - Directory exists: {os.path.exists(expected_db_path)}")
    
    # Check if current vectordb is using the correct directory
    if 'vectordb' in globals():
        try:
            current_persist_dir = getattr(vectordb._client, 'path', None)
            print(f"   - Current vectordb path: {current_persist_dir}")
            
            # If the database is not in the expected location, recreate it
            if current_persist_dir != expected_db_path:
                print(f"⚠️ Vector database is not in the current directory")
                print(f"🔄 Recreating database in current directory...")
                
                # Create new database in correct location
                new_vectordb = Chroma(
                    collection_name=COLLECTION_NAME,
                    embedding_function=embeddings,
                    persist_directory=PERSIST_DIRECTORY
                )
                
                # If old database had data, we'll need to re-populate
                old_count = vectordb._collection.count()
                new_count = new_vectordb._collection.count()
                
                if old_count > 0 and new_count == 0:
                    print(f"📋 Old database had {old_count} documents, will need re-population")
                    return new_vectordb, True  # Needs population
                else:
                    print(f"✅ Database recreated with {new_count} documents")
                    return new_vectordb, False  # Already has data
            else:
                print(f"✅ Vector database is already in the correct location")
                return vectordb, False
                
        except Exception as e:
            print(f"⚠️ Error checking database location: {e}")
            print(f"🔄 Creating new database in current directory...")
            new_vectordb = Chroma(
                collection_name=COLLECTION_NAME,
                embedding_function=embeddings,
                persist_directory=PERSIST_DIRECTORY
            )
            return new_vectordb, True
    else:
        print(f"📝 Creating new vector database in current directory...")
        new_vectordb = Chroma(
            collection_name=COLLECTION_NAME,
            embedding_function=embeddings,
            persist_directory=PERSIST_DIRECTORY
        )
        return new_vectordb, True

# Ensure database is in current directory
vectordb, needs_population = ensure_vectordb_in_current_directory()

🔍 Checking vector database location...
   - Expected path: /Users/ahb/CaylentDocs/Projects/AmeritFleetSolutions-AIStrategy/SampleData/chroma_db
   - Directory exists: True
   - Current vectordb path: None
⚠️ Vector database is not in the current directory
🔄 Recreating database in current directory...
✅ Database recreated with 237 documents


In [28]:
# Function to add Excel data to vector database
def add_excel_data_to_vectordb(df, vectordb):
    """Add Excel data to vector database, embedding only Sectioncomments"""
    
    # Filter out rows with missing Sectioncomments
    df_filtered = df.dropna(subset=['Sectioncomments']).copy()
    
    # Convert Sectioncomments to string and filter out very short comments
    df_filtered['Sectioncomments'] = df_filtered['Sectioncomments'].astype(str)
    df_filtered = df_filtered[df_filtered['Sectioncomments'].str.len() > 10]  # Minimum 10 characters
    
    print(f"📊 Processing {len(df_filtered)} records with valid comments...")
    
    # Prepare documents
    documents = []
    for idx, row in df_filtered.iterrows():
        # Create metadata from all columns except Sectioncomments
        metadata = {}
        for col in df_filtered.columns:
            if col != 'Sectioncomments':
                value = row[col]
                # Handle different data types for metadata
                if pd.isna(value):
                    metadata[col] = None
                elif isinstance(value, (pd.Timestamp, datetime)):
                    metadata[col] = value.isoformat() if not pd.isna(value) else None
                else:
                    metadata[col] = str(value)
        
        # Add row index for unique identification
        metadata['row_index'] = int(idx)
        
        # Create document with Sectioncomments as content and everything else as metadata
        doc = Document(
            page_content=str(row['Sectioncomments']),
            metadata=metadata
        )
        documents.append(doc)
    
    # Add documents to vector database in batches
    batch_size = 50
    total_added = 0
    
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        vectordb.add_documents(batch)
        total_added += len(batch)
        print(f"✅ Added batch {i//batch_size + 1}: {total_added}/{len(documents)} documents")
    
    print(f"🎉 Successfully added {total_added} documents to vector database!")
    return total_added

# Add data if database is empty or needs population
if is_empty or needs_population:
    print("📝 Adding Excel data to vector database...")
    total_added = add_excel_data_to_vectordb(df, vectordb)
else:
    print("📊 Database already contains data. Skipping data addition.")
    print(f"💡 To force re-indexing, delete the {os.path.basename(PERSIST_DIRECTORY)} folder and run again.")

📊 Database already contains data. Skipping data addition.
💡 To force re-indexing, delete the chroma_db folder and run again.


In [29]:
# Create LCEL pipeline for semantic search with date ranking and similarity threshold
from langchain_core.runnables import RunnableLambda
from datetime import datetime
import re

def parse_date_from_metadata(metadata):
    """Extract and parse date from metadata for sorting"""
    # Look for common date fields
    date_fields = ['date', 'Date', 'created_date', 'timestamp', 'created_at', 'Date Time']
    
    for field in date_fields:
        if field in metadata and metadata[field]:
            try:
                date_str = str(metadata[field])
                # Try parsing ISO format first
                if 'T' in date_str:
                    return datetime.fromisoformat(date_str.replace('Z', '+00:00'))
                # Try other common formats
                for fmt in ['%Y-%m-%d %H:%M:%S', '%Y-%m-%d', '%m/%d/%Y', '%d/%m/%Y']:
                    try:
                        return datetime.strptime(date_str, fmt)
                    except ValueError:
                        continue
            except:
                continue
    
    # If no date found, return a very old date for sorting
    return datetime(1900, 1, 1)

def filter_and_rank_results(results_with_scores, similarity_threshold=SIMILARITY_THRESHOLD):
    """Filter results by similarity threshold and rank by date descending"""
    
    # Filter by similarity threshold
    filtered_results = [
        (doc, score) for doc, score in results_with_scores 
        if score >= similarity_threshold
    ]
    
    if not filtered_results:
        return []
    
    # Sort by date descending (most recent first)
    def get_sort_key(doc_score_tuple):
        doc, score = doc_score_tuple
        date = parse_date_from_metadata(doc.metadata)
        return (date, score)  # Sort by date first, then by score
    
    sorted_results = sorted(filtered_results, key=get_sort_key, reverse=True)
    
    return sorted_results

def semantic_search_with_ranking(query: str, k: int = 10, similarity_threshold: float = SIMILARITY_THRESHOLD):
    """
    Perform semantic search with date ranking and similarity threshold filtering
    
    Args:
        query (str): Search query
        k (int): Number of results to retrieve initially (before filtering)
        similarity_threshold (float): Minimum similarity score (0.0 to 1.0)
    
    Returns:
        list: Filtered and ranked results
    """
    print(f"🔍 Searching for: '{query}'")
    print(f"📊 Similarity threshold: {similarity_threshold}")
    print(f"📅 Results will be ranked by date (newest first)")
    
    # Perform similarity search with scores
    results_with_scores = vectordb.similarity_search_with_score(query, k=k*2)  # Get more to account for filtering
    
    # Filter and rank results
    final_results = filter_and_rank_results(results_with_scores, similarity_threshold)
    
    # Limit to requested number
    final_results = final_results[:k]
    
    print(f"✅ Found {len(final_results)} results above threshold {similarity_threshold}")
    
    return final_results

# Create the LCEL pipeline
search_pipeline = RunnableLambda(semantic_search_with_ranking)

print("🚀 LCEL Pipeline created successfully!")
print("💡 Use semantic_search_with_ranking(query, k, similarity_threshold) to search")

🚀 LCEL Pipeline created successfully!
💡 Use semantic_search_with_ranking(query, k, similarity_threshold) to search


## Demonstration of Semantic Search Functionality

### Complete end-to-end search demonstration

In [30]:
# Complete end-to-end search demonstration with structured table display
print("🎯 Complete Search Demonstration")
print("=" * 50)

# Perform a realistic search
query = "broken seatbelt needs replacement"
print(f"🔍 Searching for: '{query}'")
print(f"📋 This will show results in both original format and structured table format")

# Get results with moderate threshold (slightly lower than default for demo)
demo_threshold = SIMILARITY_THRESHOLD - 0.1
results = semantic_search_with_ranking(query, k=3, similarity_threshold=demo_threshold)

# Display detailed results in original format
if results:
    print(f"\n📊 Found {len(results)} relevant results")
    print("=" * 60)
    print("📋 Results in ORIGINAL DETAILED FORMAT:")
    print("=" * 60)
    
    for i, (doc, score) in enumerate(results, 1):
        print(f"\n🏆 Result #{i} (Similarity Score: {score:.3f})")
        print("-" * 40)
        
        # Show the comment content
        comment = doc.page_content
        print(f"💬 Comment: {comment[:150]}{'...' if len(comment) > 150 else ''}")
        
        # Show relevant metadata
        metadata = doc.metadata
        print(f"📝 Metadata:")
        
        # Show key fields if they exist
        key_fields = ['Compcode', 'Completedate', 'Orderstatus', 'Qty', 'Unitcst']
        for field in key_fields:
            if field in metadata and metadata[field] and str(metadata[field]) != 'nan':
                print(f"   • {field}: {metadata[field]}")
        
        print(f"   • Comment Length: {metadata.get('comment_length', 'N/A')} characters")
        print(f"   • Row Index: {metadata.get('row_index', 'N/A')}")
    
    # Now display same results in structured table format using pandas
    print(f"\n" + "="*80)
    print("📊 Same results in STRUCTURED TABLE FORMAT:")
    print("=" * 60)
    
    # Create a simple table using pandas for basic semantic search results
    import pandas as pd
    from IPython.display import display
    
    table_data = []
    for i, (doc, score) in enumerate(results, 1):
        metadata = doc.metadata
        row = {
            'Result #': i,
            'Similarity Score': f"{score:.3f}",
            'Comment Preview': doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content,
            'Date': metadata.get('Completedate', 'N/A'),
            'Status': metadata.get('Orderstatus', 'N/A'),
            'Cost': metadata.get('Unitcst', 'N/A'),
            'Code': metadata.get('Compcode', 'N/A'),
            'Comment Length': metadata.get('comment_length', 'N/A')
        }
        table_data.append(row)
    
    # Create and display DataFrame
    df_results = pd.DataFrame(table_data)
    display(df_results)
    
    # Also show detailed view for each result
    print(f"\n📋 Detailed Results with All Metadata:")
    print("=" * 60)
    
    for i, (doc, score) in enumerate(results, 1):
        metadata = doc.metadata
        
        print(f"\n📄 Result #{i} Details:")
        print("-" * 40)
        print(f"🎯 Similarity Score: {score:.3f}")
        
        # Full comment
        print(f"\n💬 Full Comment:")
        print(f"   {doc.page_content}")
        
        # Complete metadata
        print(f"\n📝 Complete Metadata:")
        for key, value in metadata.items():
            if value is not None and str(value) != 'nan':
                clean_key = key.replace('_', ' ').title()
                print(f"   • {clean_key}: {value}")
        
        print("-" * 40)
        
else:
    print("❌ No results found. Try lowering the similarity threshold.")

print("\n" + "=" * 80)
print("✅ End-to-end search demonstration complete!")
print("💡 The system successfully:")
print("   - Embedded the query using OpenAI")
print("   - Searched the vector database")
print("   - Applied similarity threshold filtering")
print("   - Ranked results appropriately")
print("   - Returned complete records with metadata")
print("   - Displayed results in both original and table formats")

🎯 Complete Search Demonstration
🔍 Searching for: 'broken seatbelt needs replacement'
📋 This will show results in both original format and structured table format
🔍 Searching for: 'broken seatbelt needs replacement'
📊 Similarity threshold: 0.6
📅 Results will be ranked by date (newest first)
✅ Found 3 results above threshold 0.6

📊 Found 3 relevant results
📋 Results in ORIGINAL DETAILED FORMAT:

🏆 Result #1 (Similarity Score: 0.654)
----------------------------------------
💬 Comment: <Correction>Replaced Seat belt</Correction><Notes></Notes><Cause>Seat belt is frayed and needs replaced</Cause><Complaint>Mechanic Inspection</Complai...
📝 Metadata:
   • Compcode: 002-011-003 
   • Completedate: 2025-06-20T10:12:00
   • Orderstatus: REPAIR      
   • Qty: 1
   • Unitcst: 291.95
   • Comment Length: 184 characters
   • Row Index: 218

🏆 Result #2 (Similarity Score: 0.654)
----------------------------------------
💬 Comment: <Correction>Replaced Seat belt</Correction><Notes></Notes><Cause>Seat

Unnamed: 0,Result #,Similarity Score,Comment Preview,Date,Status,Cost,Code,Comment Length
0,1,0.654,<Correction>Replaced Seat belt</Correction><No...,2025-06-20T10:12:00,REPAIR,291.95,002-011-003,184
1,2,0.654,<Correction>Replaced Seat belt</Correction><No...,2025-06-20T10:12:00,REPAIR,110.0,002-011-003,184
2,3,0.649,<Correction>Replaced Seat Belt</Correction><No...,2025-05-07T10:28:00,REPAIR,110.0,002-011-003,184



📋 Detailed Results with All Metadata:

📄 Result #1 Details:
----------------------------------------
🎯 Similarity Score: 0.654

💬 Full Comment:
   <Correction>Replaced Seat belt</Correction><Notes></Notes><Cause>Seat belt is frayed and needs replaced</Cause><Complaint>Mechanic Inspection</Complaint><CauseOfRepair></CauseOfRepair>

📝 Complete Metadata:
   • Compcode: 002-011-003 
   • Hours: 0.0
   • Qty: 1
   • Compcodedescrip: Seat Belt & Retainer Assembly                               
   • Linetotal: 291.95
   • Orderstatus: REPAIR      
   • Comment Length: 184
   • Linetype: PART        
   • Row Index: 218
   • Orderid: 13176686
   • Unitcst: 291.95
   • Completedate: 2025-06-20T10:12:00
----------------------------------------

📄 Result #2 Details:
----------------------------------------
🎯 Similarity Score: 0.654

💬 Full Comment:
   <Correction>Replaced Seat belt</Correction><Notes></Notes><Cause>Seat belt is frayed and needs replaced</Cause><Complaint>Mechanic Inspection</Com

## LLM Validation Pipeline

Now we'll add an advanced LLM validation layer that takes the semantic search results and validates whether the diagnostic descriptions in the user's query actually match the maintenance issues found in the search results.

This validation step uses a Language Model to:
1. **Analyze the user's query** to extract the diagnostic intent
2. **Compare search results** against the user's diagnostic description
3. **Filter and rank results** based on diagnostic relevance
4. **Provide explanations** for why results match or don't match

In [31]:
# Set up LLM validation pipeline
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
import json
import re

# Get LLM model configuration from environment
llm_model = os.getenv("LLM_MODEL", "gpt-4o")  # Default fallback
llm_temperature = float(os.getenv("TEMPERATURE", "0.1"))  # Default for validation

# Initialize the LLM for validation
validation_llm = ChatOpenAI(
    model=llm_model,
    temperature=llm_temperature,  # Low temperature for consistent validation
    openai_api_key=openai_api_key
)

# Create the validation prompt template
validation_prompt = ChatPromptTemplate.from_template("""
You are an expert automotive maintenance analyst. Your task is to validate if search results match the user's diagnostic query.

USER QUERY: "{user_query}"

SEARCH RESULTS TO VALIDATE:
{search_results}

For each search result, analyze:
1. Does the maintenance comment describe the same or similar diagnostic issue as the user's query?
2. How relevant is this result to the user's specific problem?
3. What is the confidence level of this match?

Provide your analysis in the following JSON format:
{{
    "overall_assessment": "brief summary of how well results match the query",
    "validated_results": [
        {{
            "result_index": 1,
            "relevance_score": 0.85,
            "is_diagnostic_match": true,
            "explanation": "detailed explanation of why this result matches or doesn't match",
            "key_similarities": ["list", "of", "key", "matching", "elements"],
            "concerns": ["any", "concerns", "about", "the", "match"]
        }}
    ],
    "recommendations": "suggestions for the user based on the validated results"
}}

Be thorough but concise. Focus on diagnostic accuracy and practical relevance.
""")

# Create the validation chain
validation_chain = validation_prompt | validation_llm | StrOutputParser()

print("🧠 LLM Validation Pipeline configured successfully!")
print(f"   - Model: {llm_model}")
print(f"   - Temperature: {llm_temperature} (consistent validation)")
print(f"   - Validation focus: Diagnostic accuracy")

🧠 LLM Validation Pipeline configured successfully!
   - Model: gpt-4o
   - Temperature: 0.1 (consistent validation)
   - Validation focus: Diagnostic accuracy


In [32]:
# Main LLM validation function
def validate_search_results_with_llm(user_query: str, search_results: list, max_results_to_validate: int = 5):
    """
    Validate search results using LLM to ensure diagnostic relevance
    
    Args:
        user_query (str): The original user query
        search_results (list): List of (document, score) tuples from semantic search
        max_results_to_validate (int): Maximum number of results to send to LLM
    
    Returns:
        dict: Validation results with filtered and ranked results
    """
    if not search_results:
        return {
            "validated_results": [],
            "overall_assessment": "No search results to validate",
            "recommendations": "Try a different query or lower the similarity threshold"
        }
    
    print(f"🧠 Validating {min(len(search_results), max_results_to_validate)} results with LLM...")
    
    # Prepare search results for LLM analysis
    results_for_validation = []
    for i, (doc, score) in enumerate(search_results[:max_results_to_validate], 1):
        result_data = {
            "index": i,
            "similarity_score": score,
            "comment": doc.page_content,
            "metadata": {
                "date": doc.metadata.get('Completedate', 'No date'),
                "status": doc.metadata.get('Orderstatus', 'Unknown'),
                "cost": doc.metadata.get('Unitcst', 'No cost'),
                "compcode": doc.metadata.get('Compcode', 'No code')
            }
        }
        results_for_validation.append(result_data)
    
    # Format results for the LLM prompt
    formatted_results = ""
    for result in results_for_validation:
        formatted_results += f"""
Result {result['index']} (Similarity: {result['similarity_score']:.3f}):
Comment: {result['comment'][:300]}{'...' if len(result['comment']) > 300 else ''}
Date: {result['metadata']['date']}
Status: {result['metadata']['status']}
Cost: {result['metadata']['cost']}
---
"""
    
    try:
        # Run the validation chain
        validation_response = validation_chain.invoke({
            "user_query": user_query,
            "search_results": formatted_results
        })
        
        # Try to parse the JSON response
        try:
            # Clean the response if it contains markdown code blocks
            clean_response = validation_response.strip()
            if clean_response.startswith('```json'):
                clean_response = clean_response[7:]  # Remove ```json
            if clean_response.endswith('```'):
                clean_response = clean_response[:-3]  # Remove ```
            clean_response = clean_response.strip()
            
            validation_data = json.loads(clean_response)
        except json.JSONDecodeError:
            # Fallback: extract key information from text response
            print("⚠️ LLM response wasn't valid JSON, parsing as text...")
            validation_data = {
                "overall_assessment": validation_response[:200] + "..." if len(validation_response) > 200 else validation_response,
                "validated_results": [],
                "recommendations": "LLM validation completed but format needs review"
            }
        
        return validation_data
        
    except Exception as e:
        print(f"❌ LLM validation failed: {e}")
        return {
            "overall_assessment": f"Validation failed: {str(e)}",
            "validated_results": [],
            "recommendations": "Validation system encountered an error"
        }

print("✅ LLM validation function created successfully!")

✅ LLM validation function created successfully!


In [33]:
# Enhanced search with LLM validation
def semantic_search_with_llm_validation(
    query: str, 
    k: int = 10, 
    similarity_threshold: float = SIMILARITY_THRESHOLD,
    enable_llm_validation: bool = True,
    max_validate: int = 5
):
    """
    Complete search pipeline with LLM validation
    
    Args:
        query (str): Search query
        k (int): Number of initial results to retrieve
        similarity_threshold (float): Minimum similarity score
        enable_llm_validation (bool): Whether to use LLM validation
        max_validate (int): Maximum results to validate with LLM
    
    Returns:
        dict: Complete search and validation results
    """
    print(f"🚀 Enhanced Search Pipeline Starting...")
    print(f"   Query: '{query}'")
    print(f"   LLM Validation: {'Enabled' if enable_llm_validation else 'Disabled'}")
    
    # Step 1: Semantic search
    print(f"\n1️⃣ Running semantic search...")
    semantic_results = semantic_search_with_ranking(query, k, similarity_threshold)
    
    if not semantic_results:
        return {
            "query": query,
            "semantic_results": [],
            "validation_results": None,
            "final_results": [],
            "recommendations": "No semantic matches found. Try lowering similarity threshold or different keywords."
        }
    
    # Step 2: LLM validation (if enabled)
    validation_results = None
    if enable_llm_validation:
        print(f"\n2️⃣ Running LLM validation...")
        validation_results = validate_search_results_with_llm(query, semantic_results, max_validate)
    
    # Step 3: Combine results
    print(f"\n3️⃣ Processing final results...")
    final_results = []
    
    if enable_llm_validation and validation_results and 'validated_results' in validation_results:
        # Use LLM validation to filter and rank results
        for validated_item in validation_results['validated_results']:
            if validated_item.get('is_diagnostic_match', False) and validated_item.get('relevance_score', 0) > 0.5:
                result_index = validated_item['result_index'] - 1  # Convert to 0-based index
                if result_index < len(semantic_results):
                    doc, original_score = semantic_results[result_index]
                    final_results.append({
                        'document': doc,
                        'semantic_score': original_score,
                        'llm_relevance': validated_item['relevance_score'],
                        'explanation': validated_item['explanation'],
                        'key_similarities': validated_item.get('key_similarities', []),
                        'concerns': validated_item.get('concerns', [])
                    })
    else:
        # Fall back to semantic results only
        for doc, score in semantic_results:
            final_results.append({
                'document': doc,
                'semantic_score': score,
                'llm_relevance': None,
                'explanation': 'LLM validation not performed',
                'key_similarities': [],
                'concerns': []
            })
    
    return {
        "query": query,
        "semantic_results": semantic_results,
        "validation_results": validation_results,
        "final_results": final_results,
        "total_semantic_matches": len(semantic_results),
        "total_validated_matches": len(final_results),
        "recommendations": validation_results.get('recommendations', 'Search completed successfully') if validation_results else "Semantic search completed"
    }

print("🎯 Enhanced search function with LLM validation ready!")

🎯 Enhanced search function with LLM validation ready!


In [34]:
# Display function for enhanced validation results
def display_validated_results(search_results: dict):
    """Display enhanced search results with LLM validation"""
    
    print(f"\n🎯 Enhanced Search Results for: '{search_results['query']}'")
    print("=" * 80)
    
    # Show summary
    semantic_count = search_results['total_semantic_matches']
    validated_count = search_results['total_validated_matches']
    
    print(f"📊 Summary:")
    print(f"   • Semantic matches found: {semantic_count}")
    print(f"   • LLM validated matches: {validated_count}")
    print(f"   • Validation enabled: {'Yes' if search_results['validation_results'] else 'No'}")
    
    # Show overall assessment if available
    if search_results['validation_results'] and 'overall_assessment' in search_results['validation_results']:
        assessment = search_results['validation_results']['overall_assessment']
        print(f"   • LLM Assessment: {assessment}")
    
    # Display validated results
    final_results = search_results['final_results']
    
    if not final_results:
        print(f"\n❌ No validated results found")
        print(f"💡 {search_results['recommendations']}")
        return
    
    print(f"\n🏆 Validated Results:")
    print("=" * 60)
    
    for i, result in enumerate(final_results, 1):
        doc = result['document']
        semantic_score = result['semantic_score']
        llm_relevance = result['llm_relevance']
        explanation = result['explanation']
        
        print(f"\n📄 Result #{i}")
        print("-" * 40)
        
        # Scores
        scores_info = f"Semantic: {semantic_score:.3f}"
        if llm_relevance is not None:
            scores_info += f" | LLM Relevance: {llm_relevance:.3f}"
        print(f"🎯 Scores: {scores_info}")
        
        # Comment
        comment = doc.page_content
        print(f"💬 Comment: {comment[:200]}{'...' if len(comment) > 200 else ''}")
        
        # LLM explanation
        if explanation and explanation != 'LLM validation not performed':
            print(f"🧠 LLM Analysis: {explanation[:300]}{'...' if len(explanation) > 300 else ''}")
        
        # Key similarities
        if result['key_similarities']:
            similarities = ', '.join(result['key_similarities'][:3])
            print(f"🔗 Key Matches: {similarities}")
        
        # Concerns
        if result['concerns']:
            concerns = ', '.join(result['concerns'][:2])
            print(f"⚠️ Concerns: {concerns}")
        
        # Metadata
        metadata = doc.metadata
        print(f"📝 Metadata:")
        if 'Completedate' in metadata and metadata['Completedate']:
            print(f"   • Date: {metadata['Completedate']}")
        if 'Orderstatus' in metadata and metadata['Orderstatus']:
            print(f"   • Status: {metadata['Orderstatus']}")
        if 'Unitcst' in metadata and metadata['Unitcst']:
            print(f"   • Cost: ${metadata['Unitcst']}")
    
    # Show recommendations
    print(f"\n💡 Recommendations: {search_results['recommendations']}")
    print("=" * 80)

print("📋 Enhanced results display function ready!")

📋 Enhanced results display function ready!


In [35]:
# Enhanced table display function for LLM validation results
import pandas as pd
from IPython.display import display, HTML

def display_validated_results_table(search_results: dict):
    """Display enhanced search results with LLM validation in a structured table format"""
    
    print(f"\n🎯 Enhanced Search Results for: '{search_results['query']}'")
    print("=" * 80)
    
    # Show summary
    semantic_count = search_results['total_semantic_matches']
    validated_count = search_results['total_validated_matches']
    
    print(f"📊 Summary:")
    print(f"   • Semantic matches found: {semantic_count}")
    print(f"   • LLM validated matches: {validated_count}")
    print(f"   • Validation enabled: {'Yes' if search_results['validation_results'] else 'No'}")
    
    # Show overall assessment if available
    if search_results['validation_results'] and 'overall_assessment' in search_results['validation_results']:
        assessment = search_results['validation_results']['overall_assessment']
        print(f"   • LLM Assessment: {assessment}")
    
    # Display validated results
    final_results = search_results['final_results']
    
    if not final_results:
        print(f"\n❌ No validated results found")
        print(f"💡 {search_results['recommendations']}")
        return
    
    print(f"\n🏆 Validated Results Table:")
    print("=" * 80)
    
    # Prepare data for table
    table_data = []
    
    for i, result in enumerate(final_results, 1):
        doc = result['document']
        semantic_score = result['semantic_score']
        llm_relevance = result['llm_relevance']
        explanation = result['explanation']
        metadata = doc.metadata
        
        # Create a comprehensive row with all available metadata
        row = {
            'Result #': i,
            'Semantic Score': f"{semantic_score:.3f}",
            'LLM Relevance': f"{llm_relevance:.3f}" if llm_relevance is not None else "N/A",
            'Comment Preview': doc.page_content[:100] + "..." if len(doc.page_content) > 100 else doc.page_content,
            'LLM Explanation': explanation[:150] + "..." if len(explanation) > 150 else explanation,
            'Key Similarities': ', '.join(result['key_similarities'][:3]) if result['key_similarities'] else "None",
            'Concerns': ', '.join(result['concerns'][:2]) if result['concerns'] else "None"
        }
        
        # Add all metadata fields dynamically
        for key, value in metadata.items():
            if value is not None and str(value) != 'nan':
                # Clean up key names for better display
                clean_key = key.replace('_', ' ').title()
                if isinstance(value, str) and len(value) > 50:
                    row[clean_key] = value[:50] + "..."
                else:
                    row[clean_key] = str(value)
        
        table_data.append(row)
    
    # Create DataFrame
    df_results = pd.DataFrame(table_data)
    
    # Display the table
    display(df_results)
    
    # Also show detailed view for each result
    print(f"\n📋 Detailed Results:")
    print("=" * 80)
    
    for i, result in enumerate(final_results, 1):
        doc = result['document']
        semantic_score = result['semantic_score']
        llm_relevance = result['llm_relevance']
        explanation = result['explanation']
        metadata = doc.metadata
        
        print(f"\n📄 Result #{i} Details:")
        print("-" * 50)
        
        # Scores section
        print(f"🎯 Scores:")
        print(f"   • Semantic Score: {semantic_score:.3f}")
        if llm_relevance is not None:
            print(f"   • LLM Relevance: {llm_relevance:.3f}")
        
        # Full comment
        print(f"\n💬 Full Comment:")
        print(f"   {doc.page_content}")
        
        # LLM Analysis
        if explanation and explanation != 'LLM validation not performed':
            print(f"\n🧠 LLM Analysis:")
            print(f"   {explanation}")
        
        # Key similarities and concerns
        if result['key_similarities']:
            print(f"\n🔗 Key Similarities: {', '.join(result['key_similarities'])}")
        if result['concerns']:
            print(f"\n⚠️ Concerns: {', '.join(result['concerns'])}")
        
        # Complete metadata
        print(f"\n📝 Complete Metadata:")
        for key, value in metadata.items():
            if value is not None and str(value) != 'nan':
                clean_key = key.replace('_', ' ').title()
                print(f"   • {clean_key}: {value}")
        
        print("-" * 50)
    
    # Show recommendations
    print(f"\n💡 Recommendations: {search_results['recommendations']}")
    print("=" * 80)

print("📊 Enhanced table display function created successfully!")

📊 Enhanced table display function created successfully!


### 📊 Enhanced Table Display for LLM Validation Results

The new **structured table display** function (`display_validated_results_table()`) provides a comprehensive view of LLM validation results in multiple formats:

#### 🏗️ **Table Structure**

1. **📊 Pandas DataFrame**: Comprehensive overview table with all key information
   - Semantic and LLM scores
   - Comment previews
   - LLM explanations
   - Key similarities and concerns
   - All metadata fields from the Excel data

2. **📋 Detailed Individual Results**: Complete information for each result
   - Full comment text (not truncated)
   - Complete LLM analysis
   - All metadata fields with clean formatting
   - Scores and validation details

#### 🔍 **Key Features**

- **Dynamic Metadata**: Automatically includes all available Excel columns
- **Clean Formatting**: Removes underscores, handles long text gracefully
- **Export Ready**: Pandas DataFrame can be easily exported to CSV/Excel
- **Complete Information**: No data is lost in the display
- **Visual Hierarchy**: Clear separation between overview and details

#### 💡 **When to Use**

- **Table Format**: When you need a quick overview and want to analyze multiple results
- **Original Format**: When you prefer a narrative, easy-to-read display
- **Both**: For comprehensive analysis and documentation

This dual approach gives you the best of both worlds: structured data analysis and human-readable results.

In [36]:
# Demonstration: Structured Table Display for LLM Validation Results
print("📊 Demonstration: Enhanced Table Display")
print("=" * 50)

# Example query for demonstration
demo_query = "seatbelt buckle mechanism failure"
print(f"🔍 Demo Query: '{demo_query}'")
print(f"📋 This will show results in both table and detailed formats")

# Use a lower threshold for demonstration to ensure we get results
demo_threshold = SIMILARITY_THRESHOLD - 0.2

try:
    # Run enhanced search with LLM validation
    demo_results = semantic_search_with_llm_validation(
        query=demo_query,
        k=3,  # Get 3 results for demonstration
        similarity_threshold=demo_threshold,
        enable_llm_validation=True,
        max_validate=3
    )
    
    if demo_results['final_results']:
        print(f"\n✅ Found {len(demo_results['final_results'])} validated results")
        print("📊 Displaying in structured table format...")
        
        # Display using the new table format
        display_validated_results_table(demo_results)
        
    else:
        print(f"\n⚠️ No validated results found for demonstration")
        print(f"   Trying with original display format...")
        
        # Fallback to original display
        display_validated_results(demo_results)
    
except Exception as e:
    print(f"❌ Demo failed: {e}")
    print("💡 This might be due to API limits or connectivity issues")

print(f"\n" + "="*50)
print("💡 The table format provides:")
print("   • 📊 Comprehensive overview in pandas DataFrame")
print("   • 🔍 Detailed view with complete metadata")
print("   • 📋 All available fields from the Excel data")
print("   • 🧠 LLM analysis and explanations")
print("   • ⚡ Easy to export or further analyze")

📊 Demonstration: Enhanced Table Display
🔍 Demo Query: 'seatbelt buckle mechanism failure'
📋 This will show results in both table and detailed formats
🚀 Enhanced Search Pipeline Starting...
   Query: 'seatbelt buckle mechanism failure'
   LLM Validation: Enabled

1️⃣ Running semantic search...
🔍 Searching for: 'seatbelt buckle mechanism failure'
📊 Similarity threshold: 0.49999999999999994
📅 Results will be ranked by date (newest first)
✅ Found 3 results above threshold 0.49999999999999994

2️⃣ Running LLM validation...
🧠 Validating 3 results with LLM...
✅ Found 3 results above threshold 0.49999999999999994

2️⃣ Running LLM validation...
🧠 Validating 3 results with LLM...

3️⃣ Processing final results...

✅ Found 3 validated results
📊 Displaying in structured table format...

🎯 Enhanced Search Results for: 'seatbelt buckle mechanism failure'
📊 Summary:
   • Semantic matches found: 3
   • LLM validated matches: 3
   • Validation enabled: Yes
   • LLM Assessment: All search results describ

Unnamed: 0,Result #,Semantic Score,LLM Relevance,Comment Preview,LLM Explanation,Key Similarities,Concerns,Completedate,Unitcst,Hours,Comment Length,Qty,Orderid,Linetotal,Compcode,Row Index,Orderstatus,Linetype,Compcodedescrip
0,1,0.732,0.85,<Correction>Replaced seatbelt buckle</Correcti...,The result describes a seatbelt buckle replace...,"seatbelt buckle, broke, mechanism failure",,2025-04-10T07:32:00,74.0,0.55,173,0,12729042,40.7,002-011-003,67,REPAIR,LABOR,Seat Belt & Retainer Assembly ...
1,2,0.732,0.85,<Correction>Replaced seatbelt buckle</Correcti...,This result also involves replacing a seatbelt...,"seatbelt buckle, broke, mechanism failure",,2025-04-17T11:42:00,109.3225,0.0,173,1,12795355,109.32,002-011-003,99,REPAIR,PART,Seat Belt & Retainer Assembly ...
2,3,0.732,0.85,<Correction>Replaced seatbelt buckle</Correcti...,The replacement of a broken seatbelt buckle su...,"seatbelt buckle, broke, mechanism failure",,2025-05-19T09:38:00,109.5494,0.0,173,1,12971816,109.55,002-011-003,157,REPAIR,PART,Seat Belt & Retainer Assembly ...



📋 Detailed Results:

📄 Result #1 Details:
--------------------------------------------------
🎯 Scores:
   • Semantic Score: 0.732
   • LLM Relevance: 0.850

💬 Full Comment:
   <Correction>Replaced seatbelt buckle</Correction><Notes></Notes><Cause>Seatbelt buckle broke</Cause><Complaint>Mechanic inspection</Complaint><CauseOfRepair></CauseOfRepair>

🧠 LLM Analysis:
   The result describes a seatbelt buckle replacement due to a broken buckle, which directly relates to a mechanism failure.

🔗 Key Similarities: seatbelt buckle, broke, mechanism failure

⚠️ Concerns: None

📝 Complete Metadata:
   • Completedate: 2025-04-10T07:32:00
   • Unitcst: 74.0
   • Hours: 0.55
   • Comment Length: 173
   • Qty: 0
   • Orderid: 12729042
   • Linetotal: 40.7
   • Compcode: 002-011-003 
   • Row Index: 67
   • Orderstatus: REPAIR      
   • Linetype: LABOR       
   • Compcodedescrip: Seat Belt & Retainer Assembly                               
--------------------------------------------------

📄 Resu

## ✅ LLM Validation System Complete!

### 🎯 **What We've Built**

The enhanced semantic search system now includes a sophisticated **LLM Validation Pipeline** that adds intelligence to the search results:

### 🔄 **Two-Stage Process**

1. **Stage 1 - Semantic Search**: 
   - Uses OpenAI embeddings to find semantically similar maintenance comments
   - Applies similarity thresholds and date-based ranking
   - Returns initial candidates based on vector similarity

2. **Stage 2 - LLM Validation**:
   - Uses **GPT-4o** (configured via `.env` file) to analyze search results
   - Validates if the diagnostic descriptions actually match the user's query
   - Provides detailed explanations and confidence scores
   - Filters out false positives from semantic search

### 🧠 **LLM Validation Features**

- **Diagnostic Accuracy**: Validates that maintenance issues truly match the user's problem
- **Relevance Scoring**: Provides confidence scores (0.0-1.0) for each result
- **Detailed Explanations**: Explains why results match or don't match
- **Key Similarities**: Identifies specific matching elements
- **Concern Flagging**: Highlights potential issues with matches
- **Recommendations**: Provides actionable advice based on validated results

### ⚙️ **Configuration**

- **Model**: Reads from `.env` file (`LLM_MODEL=gpt-4o`)
- **Temperature**: Optimized for consistent validation (0.1)
- **Fallback**: Graceful degradation if LLM validation fails
- **Customizable**: Can be enabled/disabled per search

### 🚀 **Usage Examples**

```python
# With LLM validation (recommended)
results = semantic_search_with_llm_validation(
    query="broken seatbelt buckle won't latch",
    k=5,
    similarity_threshold=0.6,
    enable_llm_validation=True
)

# Display enhanced results
display_validated_results(results)

# Without LLM validation (faster, semantic only)
results = semantic_search_with_llm_validation(
    query="broken seatbelt buckle won't latch", 
    enable_llm_validation=False
)
```

This creates a **production-ready intelligent search system** that combines the speed of vector search with the accuracy of LLM validation!