# Complete RAG System with Company-Aware Retrieval

## Overview
This notebook implements a Retrieval-Augmented Generation (RAG) system that:
1. Loads company data and creates a vector database
2. Detects company names in user queries
3. Expands queries in up to 5 different perspectives
4. Retrieves documents with metadata filtering (company-aware)
5. Generates answers strictly from retrieved context

## 1. Environment Setup

In [5]:
# !pip install -q langchain langchain-community langchain-openai chromadb openai pandas

In [2]:
import pandas as pd
import os
from typing import List
from pydantic import BaseModel, Field
from langchain_core.documents import Document
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import JsonOutputParser, StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

print("‚úì All imports successful")

‚úì All imports successful


## 2. Configuration

In the following cell, we will configure variables such as the OpenAI and OpenRouter API keys for the embedding model and the LLM. The `K_DOCUMENTS = 5` variable determines the number of documents to retrieve for each query. For this task we will use `Deepseek` for the LLM.

In [None]:
# IMPORTANT: Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "API_KEY"
Open_Router_API_Key = "API_KEY"
# Configuration
K_DOCUMENTS = 5
# LLM_MODEL = "nex-agi/deepseek-v3.1-nex-n1:free" # From OpenRouter
# LLM_MODEL = "amazon/nova-2-lite-v1:free"
LLM_MODEL = 'gpt-5-nano-2025-08-07'
CHROMA_DIR = "/Users/alihamzeh/Downloads/ILI-Digital/chroma_db"

## 3. Load Data

The vector database will be constructed utilizing the refined and enriched dataset. To make the vector database more comprehensive, I enriched the descriptions by adding additional structured information using `DeepSeek v3.1`, including Primary Industry, Related Industries, Products and Services, and Expanded Keywords. The purpose of this step was to increase contextual depth by expanding key terms. For example, terms such as Wind Power were expanded to include Energy, Renewable Energy, Green Energy, and similar concepts, thereby improving the likelihood of more accurate and relevant retrieval.

In [7]:
# Load company dataset
df = pd.read_csv('/Users/alihamzeh/Downloads/ILI-Digital/Data Manipulation/companies_enriched_deepseek.csv')
print(f"‚úì Loaded {len(df)} companies")
df.head()

‚úì Loaded 105 companies


Unnamed: 0,companyName,original_description,primary_industry,related_industries,products_services,expanded_keywords,searchable_text
0,Traton SE,Traton SE manufactures commercial vehicles wor...,commercial vehicle manufacturing,"[""automotive"", ""transportation equipment""]","[""trucks"", ""buses"", ""vans"", ""construction vehi...","[""MAN"", ""Scania"", ""Navistar"", ""Volkswagen Cami...",Company: Traton SE\nDescription: Traton SE man...
1,2G Energy AG,"2G Energy AG, together with its subsidiaries, ...",Energy Systems,"[""Power Generation"", ""Combined Heat and Power""...","[""Combined Heat and Power (CHP) Systems"", ""g-b...","[""CHP Systems"", ""Cogeneration"", ""Decentralized...",Company: 2G Energy AG\nDescription: 2G Energy ...
2,MTU Aero Engines AG,"MTU Aero Engines AG, together with its subsidi...",Aerospace and Defense,"[""Aviation"", ""Aerospace Manufacturing"", ""Defen...","[""Commercial aircraft engines"", ""Military airc...","[""Aero engines"", ""Jet engines"", ""Gas turbines""...",Company: MTU Aero Engines AG\nDescription: MTU...
3,Deutsche Lufthansa AG,Deutsche Lufthansa AG operates as an aviation ...,Aviation,"[""Airlines"", ""Aerospace"", ""Transportation""]","[""Passenger services"", ""Cargo transport servic...","[""Lufthansa"", ""German airline"", ""International...",Company: Deutsche Lufthansa AG\nDescription: D...
4,Siemens Energy AG,Siemens Energy AG operates as an energy techno...,Energy Technology,"[""Power Generation"", ""Renewable Energy"", ""Oil ...","[""gas turbines"", ""steam turbines"", ""generators...","[""energy technology"", ""power generation"", ""gas...",Company: Siemens Energy AG\nDescription: Sieme...


## 4. Create Vector Database

To make our RAG system work better, we will use the `searchable_text` column. This column has extra details and keywords added by an LLM, and it will be the main content for our vector database. We will also use `companyName` to filter search results if the questions mention specific companies. The `OpenAI embedding model` is used to create this vector database, and we'll keep using this same model for all future steps to ensure consistency. For chunking, each row will be treated as a separate chunk, as the length of each row is sufficiently short to be considered an appropriate split. Additionally, each row contains enough contextual information related to its corresponding company.

In [8]:
df["description_word_count"] = (
    df["searchable_text"]
    .astype(str)
    .str.split()
    .str.len()
)
df.describe()

Unnamed: 0,description_word_count
count,105.0
mean,223.761905
std,69.790693
min,72.0
25%,176.0
50%,219.0
75%,271.0
max,384.0


In [9]:
# Prepare documents with metadata
documents = [
    Document(
        page_content=row['searchable_text'],
        metadata={'company_name': row['companyName']}
    )
    for _, row in df.iterrows()
]

# Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory=CHROMA_DIR
)

print(f"‚úì Vector database created with {len(documents)} documents")

‚úì Vector database created with 105 documents


## 5. Company Detection System

In this section, to filter companies mentioned in the user query, we first use an LLM to analyze the query and identify any referenced companies. The model extracts the correct spelling of each company based on a provided list, after which the output is parsed to construct a metadata-based search query. This approach enables precise filtering of the vector database, allowing us to retrieve the exact companies prior to performing similarity-based searches.

In [10]:
VALID_COMPANIES = df['companyName'].unique().tolist()

class CompanyExtraction(BaseModel):
    company_names: List[str] = Field(
        default_factory=list,
        description=f"Company names from this list: {VALID_COMPANIES}. Empty if none found."
    )

# llm = ChatOpenAI(
#     model=LLM_MODEL, # For changing the model refer to the Config section
#     temperature=0,
#     openai_api_base="https://openrouter.ai/api/v1",
#     openai_api_key= Open_Router_API_Key,
# )


llm = ChatOpenAI(
    model=LLM_MODEL, # For changing the model refer to the Config section
    temperature=0,
)

company_parser = JsonOutputParser(pydantic_object=CompanyExtraction)

company_extraction_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "Your task is to extract company names explicitly mentioned in the user query.\n\n"
        "Rules:\n"
        "- Do NOT infer, guess, or predict company names.\n"
        "- Only extract a company if it is explicitly mentioned in the query.\n"
        "- Only return companies that exist in the following list: {company_list}.\n"
        "- If the query mentions a partial, lowercase, or variant form of a company name "
        "(e.g., 'traton'), map it to the exact company name as written in the company list "
        "(e.g., 'Traton SE').\n"
        "- If no company from the list is explicitly mentioned, return an empty list.\n\n"
        "{format_instructions}"
    ),
    ("human", "{query}")
]).partial(
    format_instructions=company_parser.get_format_instructions(),
    company_list=str(VALID_COMPANIES)
)

company_extraction_chain = company_extraction_prompt | llm | company_parser
print("‚úì Company detection system ready")

‚úì Company detection system ready


## 6. Query Expansion System

To further improve the precision and comprehensiveness of the RAG system, I implemented a query expansion mechanism that generates up to five alternative queries from the original input. This approach ensures broader coverage of the query‚Äôs different aspects, enabling more focused and effective retrieval.

In [11]:
class ExpandedQueries(BaseModel):
    queries: List[str] = Field(
        ...,
        description="3 expanded queries rephrasing the original from different perspectives"
    )

query_expansion_parser = JsonOutputParser(pydantic_object=ExpandedQueries)

query_expansion_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "Your task is to expand the original user query into up to five distinct, non-overlapping "
        "sub-queries by decomposing it into smaller, well-defined components.\n\n"
        "Each generated sub-query must focus on a DIFFERENT aspect, condition, or constraint "
        "explicitly implied by the original query. For example, if the original query is "
        "`companies in renewable energy and financial services`, you must generate at least one "
        "sub-query targeting the renewable energy aspect and one targeting the financial services "
        "aspect.\n\n"
        "Do NOT generate multiple sub-queries with the same meaning.\n\n"
        "The generated sub-queries will be used for semantic search. Therefore:\n"
        "- Make each sub-query long, detailed, and information-dense.\n"
        "- Preserve the original intent exactly.\n"
        "- Include all relevant context from the original query in each sub-query where applicable.\n"
        "- Prefer explicit, descriptive phrasing over short or vague questions.\n\n"
        "When the query contains industrial, sectoral, or domain-specific terms, expand them using "
        "closely related synonyms or equivalent industry terminology "
        "(e.g., 'wind power' ‚Üí 'wind energy', 'renewable energy', 'green energy', 'energy sector').\n\n"
        "RULES:\n"
        "1. Generate between 1 and 5 sub-queries (maximum of 5).\n"
        "2. Do NOT introduce new facts, entities, industries, or constraints that are not implied "
        "by the original query.\n"
        "3. Each sub-query must represent a unique aspect or constraint of the original query; "
        "avoid semantic overlap.\n"
        "4. Paraphrasing and decomposition are allowed only within the scope of the original intent.\n"
        "5. If multiple conditions are implied (e.g., sector A AND sector B), separate them into "
        "individual sub-queries.\n"
        "6. If company names are explicitly mentioned, keep their spelling unchanged.\n"
        "7. Avoid speculative, hypothetical, or predictive wording.\n\n"
        "{format_instructions}"
    ),
    ("human", "Original query: {query}")
]).partial(
    format_instructions=query_expansion_parser.get_format_instructions()
)


query_expansion_chain = query_expansion_prompt | llm | query_expansion_parser
print("‚úì Query expansion system ready")

‚úì Query expansion system ready


## 7. Contextual Retriever

In the following section, we integrate all the components developed so far to retrieve the documents relevant to the user query and prepare them for input to the model.

In [12]:
def contextual_retriever(queries: List[str], k: int = K_DOCUMENTS) -> List[Document]:
    retrieved_docs = []
    seen_contents = set()

    for query in queries:
        extraction = company_extraction_chain.invoke({"query": query})
        companies = extraction.get('company_names', [])

        if companies:
            print(f"üîç '{query[:50]}...' ‚Üí Companies: {companies}")
            for company in companies:
                retriever = vectorstore.as_retriever(
                    search_kwargs={"filter": {"company_name": company}, "k": k}
                )
                docs = retriever.invoke(query)
                for doc in docs:
                    if doc.page_content not in seen_contents:
                        seen_contents.add(doc.page_content)
                        retrieved_docs.append(doc)
        else:
            print(f"üîç '{query[:50]}...' ‚Üí Broad search")
            retriever = vectorstore.as_retriever(search_kwargs={"k": k})
            docs = retriever.invoke(query)
            for doc in docs:
                if doc.page_content not in seen_contents:
                    seen_contents.add(doc.page_content)
                    retrieved_docs.append(doc)

    print(f"‚úì Retrieved {len(retrieved_docs)} unique documents\n")
    return retrieved_docs

## 8. Answer Generation

Here, we define the model input prompt by specifying the system message, the retrieved context, and the original user query.

In [16]:
answer_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are a precise QA assistant. CRITICAL:\n"
        "1. Answer ONLY from the provided context\n"
        "2. No external knowledge\n"
        "3. Say 'Cannot answer' if context insufficient\n"
        "4. Keep the answer short and concise\n"
        "5. Be specific and cite details\n\n"
        "Context:\n{context}"
    ),
    ("human", "{question}")
])

## 9. Complete RAG Pipeline

In [17]:
def rag_pipeline(user_query: str, verbose: bool = True) -> str:
    if verbose:
        print("="*80)
        print(f"Query: {user_query}\n")

    # Expand query
    expansion = query_expansion_chain.invoke({"query": user_query})
    expanded_queries = expansion['queries']

    if verbose:
        print("Expanded queries:")
        for i, q in enumerate(expanded_queries, 1):
            print(f"  {i}. {q}")
        print()

    # Retrieve documents
    all_queries = [user_query] + expanded_queries
    docs = contextual_retriever(all_queries)

    # Format context
    context = "\n\n---\n\n".join(d.page_content for d in docs)

    if verbose:
        companies = set(d.metadata['company_name'] for d in docs)
        print(f"Context: {len(context)} chars from companies: {companies}\n")

    # Generate answer
    answer_chain = answer_prompt | llm | StrOutputParser()
    answer = answer_chain.invoke({"context": context, "question": user_query})

    if verbose:
        print("="*80)
        print(f"context: {context}\n")
        print(f"question: {user_query}\n")
        print("ANSWER:\n")

    return answer

## 10. Test Cases

In [15]:
# Test 1: Single company
q1 = "What are the main products of Siemens Energy AG?"
a1 = rag_pipeline(q1)
print(a1)

Query: What are the main products of Siemens Energy AG?

Expanded queries:
  1. Provide a comprehensive, detailed listing of Siemens Energy AG's main product categories as presented by the company, enumerating the core hardware and equipment lines that constitute its primary product portfolio (for example, power generation equipment, power transmission and grid technology), and identify representative sub-products within each category along with any notable flagship offerings.
  2. Describe Siemens Energy AG's main products from the vantage point of customer segments, distinguishing between products aimed at utility-scale power generation and those designed for industrial energy infrastructure projects, and specify the principal product offerings within each market segment.
  3. Explain how Siemens Energy AG organizes and presents its main product lineup across technology families (for instance gas turbines, steam turbines, generators, transformers, and grid solutions), including the r

In [18]:
# Test 2: Multi-company comparison
q2 = "Compare Traton SE and Deutsche Lufthansa AG"
a2 = rag_pipeline(q2)
print(a2)

Query: Compare Traton SE and Deutsche Lufthansa AG

Expanded queries:
  1. Provide a comprehensive side-by-side financial comparison of Traton SE and Deutsche Lufthansa AG, detailing the latest full-year results and the most recent quarterly figures, including revenue, operating income (EBIT) and EBITDA, net income, gross and operating margins, segment contributions (Traton's trucks and industrial mobility divisions, including vehicle systems and services, vs Lufthansa's passenger transport, cargo, and related service segments), year-over-year deltas, and any notable one-off items or non-recurring adjustments that affect comparability.
  2. Analyze and compare the business models, value chains, and competitive positioning of Traton SE and Deutsche Lufthansa AG, outlining their core products and services (commercial vehicles and fleet solutions for Traton; airline operations, network reach, and ancillary services for Lufthansa), primary customer bases, revenue streams and cost structure

In [None]:
# Test 3: Broad query
q3 = "Which company is involved in both energy and transportation sec?"
a3 = rag_pipeline(q3)
print(a3)

## 11. Interactive Interface

In [None]:
def ask(question: str):
    """Simple Q&A interface"""
    answer = rag_pipeline(question, verbose=False)
    print(f"\nQ: {question}")
    print(f"A: {answer}\n")

# Usage: ask("Your question here")

In [None]:
ask("Which company is involved in both energy and transportation sectors?")

## System Architecture

```
User Query ‚Üí Query Expansion (3x) ‚Üí Company Detection
                                          ‚Üì
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚Üì                                             ‚Üì
            Companies Found                              No Companies
                    ‚Üì                                             ‚Üì
        Filtered Retrieval (metadata)                  Broad Retrieval
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                          ‚Üì
                              Deduplicate Documents
                                          ‚Üì
                              Aggregate Context
                                          ‚Üì
                           Generate Answer (context-only)
```

# Evaluation Pipeline

In this section, we construct a test pipeline that uses an LLM to compare the response generated by the RAG system with the ground-truth answer.

Let's tune the Evaluator llm

In [91]:
# Initialize LLM Judge
judge_llm = llm

# judge_llm = ChatOpenAI(
#     model=LLM_MODEL, # For changing the model refer to the Config section
#     temperature=0,
#     openai_api_base="https://openrouter.ai/api/v1",
#     openai_api_key= Open_Router_API_Key,
# )

# Create Judge Prompt
judge_prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "You are an evaluation judge. Compare the expected answer with the predicted answer.\n"
        "Determine if the predicted answer is correct (captures the same meaning/information).\n\n"
        "Format your response as:\n"
        "Decision: True/False\n"
        "Reason: [One short sentence explaining why]"
    ),
    (
        "human",
        "Question: {question}\n\n"
        "Expected Answer: {expected}\n\n"
        "Predicted Answer: {predicted}\n\n"
        "Evaluate if the predicted answer is correct:"
    )
])

# Create Judge Chain
judge_chain = judge_prompt | judge_llm | StrOutputParser()

In this step, we define a function to evaluate the accuracy of the RAG system.

The function begins by loading a test CSV file that contains two required fields: `question` and `expected_answer`. It then iterates over each row of the dataset, generates a response using the RAG system for the given question, and employs a judge LLM to compare the generated answer with the expected answer, returning a Boolean result (`True` or `False`) for each comparison.

F


In [92]:
def evaluate_with_llm_judge(rag_function, test_csv_path: str) -> float:
    """
    Evaluate RAG system using LLM as judge.

    Args:
        rag_function: Your RAG pipeline function
        test_csv_path: Path to test CSV with 'question' and 'expected_answer' columns

    Returns:
        Accuracy (0 to 1)
    """
    # Load test data
    test_df = pd.read_csv(test_csv_path)

    results = []
    correct_count = 0
    total_count = len(test_df)

    print(f"Evaluating {total_count} test cases...\n")

    # Evaluate each test case
    for idx, row in test_df.iterrows():
        question = row['question']
        expected = row['expected_answer']

        # Get RAG prediction
        predicted = rag_function(question, verbose=False)

        # Get LLM judge verdict
        verdict = judge_chain.invoke({
            "question": question,
            "expected": expected,
            "predicted": predicted
        })

        # Parse verdict and reason
        lines = verdict.strip().split('\n')
        decision_line = lines[0] if len(lines) > 0 else ""
        reason_line = lines[1] if len(lines) > 1 else ""

        is_correct = "true" in decision_line.lower()
        reason = reason_line.replace("Reason:", "").strip()

        results.append({
            'question': question,
            'expected_answer': expected,
            'rag_answer': predicted,
            'decision': is_correct,
            'reason': reason
        })

        if is_correct:
            correct_count += 1

        # Print progress
        status = "‚úì" if is_correct else "‚úó"
        print(f"{status} [{idx+1}/{total_count}] {question[:50]}...")
        print(f"   Reason: {reason}\n")

    # Calculate accuracy
    accuracy = correct_count / total_count

    # Print summary
    print("\n" + "="*80)
    print("EVALUATION RESULTS")
    print("="*80)
    print(f"Total Tests:     {total_count}")
    print(f"Correct:         {correct_count}")
    print(f"Incorrect:       {total_count - correct_count}")
    print(f"Accuracy:        {accuracy:.1%}")
    print("="*80 + "\n")

    # Save detailed results
    results_df = pd.DataFrame(results)
    results_df.to_csv('llm_judge_results.csv', index=False)
    print("‚úì Detailed results saved to 'llm_judge_results.csv'")
    print("  Columns: question | expected_answer | rag_answer | decision | reason\n")

    return accuracy

In [93]:
accuracy = evaluate_with_llm_judge(
    rag_function=rag_pipeline,
    test_csv_path='/content/references.csv'
)

print(f"Final Accuracy: {accuracy:.1%}")

Evaluating 10 test cases...

üîç 'What does Traton SE specialize in?...' ‚Üí Companies: ['Traton SE']
üîç 'Provide a comprehensive explanation of the core bu...' ‚Üí Companies: ['Traton SE']
üîç 'Describe the brand architecture of Traton SE, incl...' ‚Üí Companies: ['Traton SE']
üîç 'Explain the corporate structure and ownership cont...' ‚Üí Companies: ['Traton SE']
üîç 'Identify Traton SE's geographic and production foo...' ‚Üí Companies: ['Traton SE']
üîç 'Outline the target customer segments and market po...' ‚Üí Companies: ['Traton SE']
‚úì Retrieved 1 unique documents

‚úì [1/10] What does Traton SE specialize in?...
   Reason: The predicted answer preserves the core: TRATON SE manufactures commercial vehicles worldwide, and it also provides accurate extra details (spare parts/services, RIO platform, financial services, and brand lineup).

üîç 'Which company operates in the aviation industry?...' ‚Üí Broad search
üîç 'Which company operates in the aviation industry, c...' 

TypeError: cannot unpack non-iterable float object

In [95]:
results_df = pd.read_csv('llm_judge_results.csv')
results_df.head()

Unnamed: 0,question,expected_answer,rag_answer,decision,reason
0,What does Traton SE specialize in?,Traton SE manufactures commercial vehicles wor...,Traton SE specializes in manufacturing commerc...,True,The predicted answer preserves the core: TRATO...
1,Which company operates in the aviation industry?,Deutsche Lufthansa AG operates in the aviation...,- Deutsche Lufthansa AG: Described as an aviat...,True,It correctly identifies Deutsche Lufthansa AG ...
2,What type of products does 2G Energy AG produce?,2G Energy AG produces combined heat and power ...,2G Energy AG produces combined heat and power ...,True,It correctly states that 2G Energy AG produces...
3,Name a company that is involved in energy tech...,Siemens Energy AG is involved in energy techno...,Siemens Energy AG. It is described as an energ...,True,The predicted answer correctly names Siemens E...
4,Which company is involved in the manufacturing...,MTU Aero Engines AG manufactures aircraft engi...,"MTU Aero Engines AG. It develops, manufactures...",True,It correctly identifies MTU Aero Engines AG an...
