**Notebook 03 is a "run-once" setup**

- üìù NOTEBOOK 3 - SETUP ONLY
- ‚úÖ LLM client configured
- ‚úÖ Prompt templates defined  
- ‚úÖ Answer generator ready

No files saved - this notebook only needs to run once per session

# LLM Response Generation

**Why we're doing this:**
 Take retrieved document chunks and generate coherent answers using a language model.

**What we're doing:**

- Setting up first prototype - done
- Setting up the LLM client (Groq/Llama) - done
- Creating prompt templates for TRL questions - done
- Generating answers from retrieved context - done 

In [1]:
# PERMANENT WORKING IMPORT - USE THIS EVERYWHERE
import sys
import os
import importlib.util

def import_rag_components():
    """Import RAG components"""
    current_dir = os.getcwd()
    
    # Import retriever
    retriever_path = os.path.join(current_dir, 'rag_components', 'retriever.py')
    spec = importlib.util.spec_from_file_location("retriever", retriever_path)
    retriever_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(retriever_module)
    
    # Import query_interface  
    query_interface_path = os.path.join(current_dir, 'rag_components', 'query_interface.py')
    spec = importlib.util.spec_from_file_location("query_interface", query_interface_path)
    query_interface_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(query_interface_module)
    
    # Import answer_generator
    answer_generator_path = os.path.join(current_dir, 'rag_components', 'answer_generator.py')
    spec = importlib.util.spec_from_file_location("answer_generator", answer_generator_path)
    answer_generator_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(answer_generator_module)
    
    return (retriever_module.DocumentAwareRetriever, 
            query_interface_module.SimpleQueryInterface,
            answer_generator_module.RAGAnswerGenerator)

# Import the components
DocumentAwareRetriever, SimpleQueryInterface, RAGAnswerGenerator = import_rag_components()
print("üéâ COMPONENTS IMPORTED SUCCESSFULLY!")

# Continue with code
VECTOR_INDEX_PATH = "../../04_models/vector_index"
retriever = DocumentAwareRetriever(VECTOR_INDEX_PATH)
query_interface = SimpleQueryInterface(retriever)
answer_generator = RAGAnswerGenerator(query_interface)
print("‚úÖ Generation pipeline ready!")

üéâ COMPONENTS IMPORTED SUCCESSFULLY!
‚úì TF-IDF retriever loaded successfully
‚úì Template-based RAG answer generator initialized
‚úÖ Generation pipeline ready!


In [2]:
pip install groq

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
# CELL: LLM Client Setup
import os
from groq import Groq
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Initialize Groq client
def setup_groq_client():
    """Set up and return Groq client with error handling"""
    api_key = os.getenv('GROQ_API_KEY')
    
    if not api_key:
        raise ValueError("‚ùå GROQ_API_KEY not found in environment variables")
    
    client = Groq(api_key=api_key)
    print("‚úÖ Groq client initialized successfully")
    return client

# Test the client
try:
    groq_client = setup_groq_client()
    print("üéâ LLM client ready for integration!")
except Exception as e:
    print(f"‚ùå Failed to initialize LLM client: {e}")

‚úÖ Groq client initialized successfully
üéâ LLM client ready for integration!


In [4]:
# CELL: Test LLM Connection
# Why: Verify Groq API works and model responds correctly
# What: Send simple test query to confirm setup is functional
def test_llm_connection():
    try:
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",  # Fast, free model for testing
            messages=[{"role": "user", "content": "Reply only with 'API connected'"}],
            max_tokens=10,
            temperature=0.1
        )
        print(f"‚úÖ LLM Connected: {response.choices[0].message.content}")
        return True
    except Exception as e:
        print(f"‚ùå LLM Failed: {e}")
        return False

test_llm_connection()

‚úÖ LLM Connected: API connected


True

In [5]:
# CELL: Integrate with Your Generator
def generate_with_llm(query, context):
    """Generate answer using Groq/Llama"""
    prompt = f"""
    Based on the following context, answer the user's question.
    
    Context: {context}
    
    Question: {query}
    
    Answer:
    """
    
    response = groq_client.chat.completions.create(
        model="llama3-8b-8192",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        temperature=0.3
    )
    
    return response.choices[0].message.content

print("üöÄ LLM integration code ready!")

üöÄ LLM integration code ready!


In [None]:
# CELL: Universal Prompt Template with Patent Definitions
# Why: Single template that adapts to TRL, patent, and regular queries automatically
# What: Smart template that detects when to include maturity analysis AND patent definitions

UNIVERSAL_PROMPT_TEMPLATE = """
CONTEXT:
{context}

USER QUESTION:
{question}

ANALYSIS INSTRUCTIONS:
1. Provide a comprehensive answer based strictly on the context provided
2. Cite specific sources for each key point using [Source: filename]
3. If the context is insufficient, acknowledge what cannot be answered

{trl_section}
{patent_section}
{startup_section}

ADDITIONAL GUIDELINES:
- For technology maturity questions: assess development stage and transition evidence
- For patent questions: consider jurisdiction and document type implications
- For trend questions: identify velocity, drivers, and key players  
- For forecasting: distinguish near-term vs long-term developments
- For descriptive questions: provide specific examples and entities

ANSWER:
"""

def build_smart_prompt(question, context):
    """Build adaptive prompt that includes TRL and patent guidance only when needed"""
    
    # Detect if this is a technology maturity question
    maturity_keywords = ['trl', 'mature', 'transition', 'academy to application', 
                        'commercial', 'moving from academy', 'readiness', 'development stage']
    
    # Detect if this is a patent-related question
    patent_keywords = ['patent', 'intellectual property', 'ip', 'jurisdiction', 'ep', 'us', 'wo',
                      'kind', 'a1', 'b2', 'filing', 'protection', 'patent office', 'lens']
    
    # Detect if this is a startup-related question
    startup_keywords = ['startup', 'startups', 'company', 'companies', 'venture', 'business', 
                       'funding', 'investment', 'series a', 'series b', 'series c', 'backed']
    
    question_lower = question.lower()
    is_maturity_question = any(keyword in question_lower for keyword in maturity_keywords)
    is_patent_question = any(keyword in question_lower for keyword in patent_keywords)
    is_startup_question = any(keyword in question_lower for keyword in startup_keywords)
    
    # Include TRL section only for maturity questions
    if is_maturity_question:
        trl_section = """
TECHNOLOGY MATURITY ASSESSMENT:
- When discussing technology readiness, reference these stages:
  * Research Phase (TRL 1-4): Basic research, lab validation
  * Development Phase (TRL 5-6): Prototyping, testing  
  * Commercialization Phase (TRL 7-9): Deployment, scaling
- Assess current stage based on evidence in context
- Identify transition indicators and timelines
- Include a definition of TRL stages in the answer
"""
    else:
        trl_section = ""
    
    # Include patent definitions only for patent questions
    if is_patent_question:
        patent_section = """
PATENT DOCUMENT INTERPRETATION:
- JURISDICTION indicates geographic protection scope:
  * EP: European Patent Office (multiple European countries)
  * US: United States Patent and Trademark Office
  * WO: World Intellectual Property Organization (PCT international applications)
  
- KIND CODES indicate document type and status:
  * A1: Patent application with search report
  * A2: Patent application without search report  
  * A3: Search report published separately
  * B1: Granted patent (examined and approved)
  * B2: Amended/revised granted patent
  
- Consider jurisdiction for market focus and protection scope
- Use kind codes to distinguish between applications (A) and granted patents (B)
"""
    else:
        patent_section = ""
    
    # Include startup guidance only for startup questions
    if is_startup_question:
        startup_section = """
CRITICAL INSTRUCTIONS FOR STARTUP QUERIES:
1. **EXTRACT ALL SPECIFIC STARTUP/COMPANY NAMES** mentioned in the context
2. **FOCUS ON STARTUP DATABASES**: Pay special attention to sections from "Automotive Startup Profiles & Tracker" and "Automotive Industry Startups to Watch in 2025"
3. **FOR EACH STARTUP FOUND**:
   * State the company name clearly and prominently
   * Describe their primary technology or business focus
   * Include location information if available
   * Mention any funding details (rounds raised, investors)
   * Note their automotive/AI specialization
4. **REQUIRED ANSWER STRUCTURE**:
   - Start with a summary of findings
   - Then provide a CLEAR, NUMBERED LIST of startups
   - Format: "1. **Company Name**: [description] [Source: filename]"
   - Cite the specific source file for each piece of information
5. **IF STARTUPS EXIST IN CONTEXT BUT AREN'T EXPLICITLY MENTIONED**, still extract them
6. **IF NO STARTUPS ARE FOUND**, clearly state: "No specific startup companies were found in the available documents."
7. **PRIORITIZE INFORMATION FROM STARTUP DATABASES** over general reports when answering startup questions

EXAMPLE FORMAT:
"Based on the startup databases, I found these automotive AI companies:

1. **Company X**: Develops AI perception systems for autonomous vehicles. Based in Berlin. [Source: Automotive Startup Profiles & Tracker]
2. **Company Y**: Specializes in battery management AI for electric vehicles. Raised $20M Series A. [Source: Automotive Industry Startups to Watch in 2025]"
"""
    else:
        startup_section = ""
    
    prompt = UNIVERSAL_PROMPT_TEMPLATE.format(
        context=context,
        question=question,
        trl_section=trl_section,
        patent_section=patent_section,
        startup_section=startup_section
    )
    
    return prompt

# Test the universal template
def test_universal_prompt():
    """Test that the template adapts to different question types"""
    
    test_context = "Sample context about technology development and patents..."
    
    print("üß™ TESTING UNIVERSAL PROMPT TEMPLATE:")
    print("=" * 50)
    
    # Test regular question
    regular_question = "Which startups work on AI for automotive?"
    regular_prompt = build_smart_prompt(regular_question, test_context)
    print("üîπ STARTUP QUESTION:")
    print(f"Question: {regular_question}")
    print("Includes TRL section:", "TECHNOLOGY MATURITY ASSESSMENT" in regular_prompt)
    print("Includes Patent section:", "PATENT DOCUMENT INTERPRETATION" in regular_prompt)
    print("Includes Startup section:", "STARTUP INFORMATION EXTRACTION" in regular_prompt)
    print("---")
    
    # Test TRL question  
    trl_question = "Which quantum computing research is moving from academy to application?"
    trl_prompt = build_smart_prompt(trl_question, test_context)
    print("üîπ TRL QUESTION:")
    print(f"Question: {trl_question}")
    print("Includes TRL section:", "TECHNOLOGY MATURITY ASSESSMENT" in trl_prompt)
    print("Includes Patent section:", "PATENT DOCUMENT INTERPRETATION" in trl_prompt)
    print("Includes Startup section:", "STARTUP INFORMATION EXTRACTION" in trl_prompt)
    print("---")
    
    # Test patent question
    patent_question = "What are the recent US patents in autonomous driving?"
    patent_prompt = build_smart_prompt(patent_question, test_context)
    print("üîπ PATENT QUESTION:")
    print(f"Question: {patent_question}")
    print("Includes TRL section:", "TECHNOLOGY MATURITY ASSESSMENT" in patent_prompt)
    print("Includes Patent section:", "PATENT DOCUMENT INTERPRETATION" in patent_prompt)
    print("Includes Startup section:", "STARTUP INFORMATION EXTRACTION" in patent_prompt)
    print("---")
    
    # Test combined question
    combined_question = "Which AI startups show commercial readiness with significant funding?"
    combined_prompt = build_smart_prompt(combined_question, test_context)
    print("üîπ COMBINED QUESTION:")
    print(f"Question: {combined_question}")
    print("Includes TRL section:", "TECHNOLOGY MATURITY ASSESSMENT" in combined_prompt)
    print("Includes Patent section:", "PATENT DOCUMENT INTERPRETATION" in combined_prompt)
    print("Includes Startup section:", "STARTUP INFORMATION EXTRACTION" in combined_prompt)
    
    return regular_prompt, trl_prompt, patent_prompt, combined_prompt

# Run test
regular_prompt, trl_prompt, patent_prompt, combined_prompt = test_universal_prompt()

print("\n" + "=" * 50)
print("‚úÖ Universal prompt template ready!")
print("‚úÖ Automatically includes TRL guidance for maturity questions")
print("‚úÖ Automatically includes patent definitions for IP questions") 
print("‚úÖ Automatically includes startup extraction for company questions")
print("‚úÖ Single template adapts to all query types")

üß™ TESTING UNIVERSAL PROMPT TEMPLATE:
üîπ STARTUP QUESTION:
Question: Which startups work on AI for automotive?
Includes TRL section: False
Includes Patent section: True
Includes Startup section: False
---
üîπ TRL QUESTION:
Question: Which quantum computing research is moving from academy to application?
Includes TRL section: True
Includes Patent section: False
Includes Startup section: False
---
üîπ PATENT QUESTION:
Question: What are the recent US patents in autonomous driving?
Includes TRL section: False
Includes Patent section: True
Includes Startup section: False
---
üîπ COMBINED QUESTION:
Question: Which AI startups show commercial readiness with significant funding?
Includes TRL section: True
Includes Patent section: False
Includes Startup section: False

‚úÖ Universal prompt template ready!
‚úÖ Automatically includes TRL guidance for maturity questions
‚úÖ Automatically includes patent definitions for IP questions
‚úÖ Automatically includes startup extraction for company 

# Response Quality Setup

**Why we're doing this:** 
Ensure answers are relevant and properly cite sources.

**What we're doing:**

- Checking if the pipeline works and our LLM integration and prompt template can return something nice. 


In [8]:
# CELL: Test All User Queries with Dynamic Source Count & Startup Booster (aggressive filtering)
# Why: Validate pipeline performance with intelligent source retrieval and startup boosting
# What: Run all 8 user queries with dynamic k-value and startup file enhancement

import json
import os
from datetime import datetime

def determine_source_count(question):
    """Dynamically determine how many sources to retrieve based on question type"""
    question_lower = question.lower()
    
    if any(keyword in question_lower for keyword in ['summarize', 'trends', 'overview', 'comprehensive']):
        return 5  # More sources for comprehensive questions
    elif any(keyword in question_lower for keyword in ['which', 'list', 'show me']):
        return 4  # Medium for listing questions
    elif any(keyword in question_lower for keyword in ['specific', 'exact', 'precise']):
        return 2  # Fewer for very specific questions
    else:
        return 3  # Default

def format_source_name(source_file):
    """Convert file names to human-readable format for better UX"""
    name_mapping = {
        # Automotive Papers
        'a_benchmark_framework_for_AL_models_in_automotive_aerodynamics.txt': 'Benchmark Framework for AI Models in Automotive Aerodynamics',
        'AL_agents_in_engineering_design_a_multiagent_framework_for_aesthetic_and_aerodynamic_car_design.txt': 'AI Agents in Engineering Design: Multiagent Framework for Car Design',
        'automating_automotive_software_development_a_synergy_of_generative_AL_and_formal_methods.txt': 'Automating Automotive Software Development: Generative AI and Formal Methods',
        'automotive-software-and-electronics-2030-full-report.txt': 'Automotive Software and Electronics 2030 Report',
        'drive_disfluency-rich_synthetic_dialog_data_generation_framework_for_intelligent_vehicle_environments.txt': 'DRIVE Framework: Synthetic Dialog Data for Intelligent Vehicles',
        'Embedded_acoustic_intelligence_for_automotive_systems.txt': 'Embedded Acoustic Intelligence for Automotive Systems',
        'enhanced_drift_aware_computer_vision_achitecture_for_autonomous_driving.txt': 'Enhanced Drift-Aware Computer Vision for Autonomous Driving',
        'Gen_AL_in_automotive_applications_challenges_and_opportunities_with_a_case_study_on_in-vehicle_experience.txt': 'Generative AI in Automotive: Applications and Challenges',
        'generative_AL_for_autonomous_driving_a_review.txt': 'Generative AI for Autonomous Driving: A Review',
        'leveraging_vision_language_models_for_visual_grounding_and_analysis_of_automative_UI.txt': 'Vision-Language Models for Automotive UI Analysis',
        
        # Tech Reports
        'bog_ai_value_2025.txt': 'Boston Consulting Group: AI Value Creation 2025',
        'mckinsey_tech_trends_2025.txt': 'McKinsey Technology Trends Outlook 2025',
        'wef_emerging_tech_2025.txt': 'World Economic Forum: Emerging Technologies 2025',
        
        # New Processed Files
        'autotechinsight_startups_processed.txt': 'Automotive Startup Profiles & Tracker',
        'seedtable_startups_processed.txt': 'Automotive Industry Startups to Watch in 2025',
        'automotive_papers_processed.txt': 'Automotive Research Papers Database',
        'automotive_patents_processed.txt': 'Automotive Technology Patents Database',
    }
    return name_mapping.get(source_file, source_file.replace('.txt', '').replace('_', ' ').title())

# Define user queries - UPDATED to include patent and automotive-specific questions
USER_QUERIES = {
    1: "Which startups work on AI for automotive?",
    2: "Summarize the latest research on autonomous driving.",
    3: "What are the latest tech trends in development of AI agents",
    4: "Summarize the key pain points/use cases in automotive AI.",
    5: "Show me recent patents on AI for automotive.",
    6: "Which technologies are likely to mature next year?",
    7: "Which AI research topics in automotive are growing fastest?",
    8: "Which automotive technologies are moving from academy to application?"
}

def test_complete_pipeline(question, query_id):
    """Test the full RAG pipeline with dynamic source count and startup boosting"""
    print(f"üß™ QUERY {query_id}: '{question}'")
    print("=" * 60)
    
    try:
        # Step 1: Determine optimal source count
        k = determine_source_count(question)
        print(f"1. üîç Retrieving documents (k={k})...")
        
        # Step 2: Retrieve documents
        retrieved_data = retriever.retrieve_with_sources(question, k=k)
        
        # üöÄ STARTUP BOOSTER: FORCE-INCLUDE startup files for startup-related queries
        startup_boost_applied = False
        if any(keyword in question.lower() for keyword in ['startup', 'company', 'venture', 'business', 'firm']):
            print("   üöÄ FORCING STARTUP FILES for this query...")
            
            # FIRST: Get startup-specific results with expanded query
            expanded_query = question + " automotive AI technology machine learning companies"
            startup_data = retriever.retrieve_with_sources(expanded_query, k=4)
            
            # Filter to ONLY include our startup files
            startup_items = []
            for item in startup_data:
                if any(startup_file in item['source_file'] for startup_file in ['autotechinsight_startups_processed.txt', 'seedtable_startups_processed.txt']):
                    # Check if this content is already in retrieved_data
                    is_duplicate = any(
                        item['content'][:100] == existing['content'][:100]  # Check first 100 chars for duplicates
                        for existing in retrieved_data
                    )
                    if not is_duplicate:
                        startup_items.append(item)
            
            # SECOND: If we still don't have enough startup results, force a generic search on startup files
            if len(startup_items) < 2:
                print("   üîç Force-searching startup files directly...")
                # Search specifically in startup files
                for startup_file in ['autotechinsight_startups_processed.txt', 'seedtable_startups_processed.txt']:
                    # Create a query that should match startup content
                    generic_startup_query = "automotive AI technology startup company"
                    force_results = retriever.retrieve_with_sources(generic_startup_query, k=3)
                    
                    for item in force_results:
                        if startup_file in item['source_file']:
                            # Check for duplicates
                            is_duplicate = any(
                                item['content'][:100] == existing['content'][:100]
                                for existing in retrieved_data + startup_items
                            )
                            if not is_duplicate:
                                startup_items.append(item)
            
            # Add startup items to the beginning of results
            if startup_items:
                # Take up to 2 startup items (prioritize them)
                startup_to_add = startup_items[:2]
                retrieved_data = startup_to_add + retrieved_data
                retrieved_data = retrieved_data[:k]  # Keep original k limit
                startup_boost_applied = True
                
                # Debug info
                startup_files = set(item['source_file'] for item in startup_to_add)
                print(f"   ‚úÖ FORCED {len(startup_to_add)} startup chunks into results from:")
                for file in startup_files:
                    readable = format_source_name(file)
                    count = sum(1 for item in startup_to_add if item['source_file'] == file)
                    print(f"      - {readable}: {count} chunks")
            else:
                print("   ‚ö†Ô∏è WARNING: Could not find any startup content despite forcing")     
        
        # üÜï PATENT BOOSTER: Enhance results for patent-related queries
        patent_boost_applied = False
        if any(keyword in question.lower() for keyword in ['patent', 'jurisdiction', 'ep', 'us', 'wo', 'intellectual property']):
            print("   üìú Boosting patents file for this query...")
            # Get additional results focusing on patents
            patent_data = retriever.retrieve_with_sources(question + " patents intellectual property", k=2)
            
            # Filter to only include patents file and avoid duplicates
            patent_items = []
            for item in patent_data:
                if 'automotive_patents_processed.txt' in item['source_file']:
                    # Check if this content is already in retrieved_data
                    is_duplicate = any(
                        item['content'] == existing['content'] 
                        for existing in retrieved_data
                    )
                    if not is_duplicate:
                        patent_items.append(item)
            
            # Add patent items to the beginning of results
            if patent_items:
                retrieved_data = patent_items + retrieved_data
                retrieved_data = retrieved_data[:k]  # Keep original k limit
                patent_boost_applied = True
                print(f"   ‚úÖ Added {len(patent_items)} patent-specific results")
        
        print(f"   ‚úÖ Found {len(retrieved_data)} relevant chunks")
        
        # Step 3: Format context with human-readable source names
        context = "\n\n".join([
            f"Source: {format_source_name(item['source_file'])} | Type: {item['doc_type']}\nContent: {item['content']}"
            for item in retrieved_data
        ])
        
        # Step 4: Build smart prompt (now includes patent definitions when needed)
        print("2. üìù Building prompt...")
        prompt = build_smart_prompt(question, context)
        
        # Step 5: Generate answer using LLM
        print("3. ü§ñ Generating answer with LLM...")
        response = groq_client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=500,
            temperature=0.3
        )
        
        answer = response.choices[0].message.content
        
        # Step 6: Prepare results
        result = {
            'query_id': query_id,
            'question': question,
            'answer': answer,
            'sources': retrieved_data,
            'retrieved_chunks': len(retrieved_data),
            'source_count_used': k,
            'startup_boost_applied': startup_boost_applied,
            'patent_boost_applied': patent_boost_applied,  # üÜï Track if patent booster was used
            'timestamp': datetime.now().isoformat(),
            'model_used': 'llama-3.1-8b-instant'
        }
        
        # Display results
        print("4. üìä RESULTS:")
        print(f"ANSWER: {answer}")
        print(f"SOURCES: {len(retrieved_data)} documents (k={k})")
        
        # Show boost indicators
        boost_info = []
        if startup_boost_applied:
            boost_info.append("üöÄ Startup boost")
        if patent_boost_applied:
            boost_info.append("üìú Patent boost")
        if boost_info:
            print(f"   {' + '.join(boost_info)} applied")
            
        for i, item in enumerate(retrieved_data):
            readable_name = format_source_name(item['source_file'])
            # Add boost indicators to source listing
            boost_indicator = ""
            if any(startup_file in item['source_file'] for startup_file in ['autotechinsight_startups_processed.txt', 'seedtable_startups_processed.txt']) and startup_boost_applied:
                boost_indicator = "üöÄ "
            elif 'automotive_patents_processed.txt' in item['source_file'] and patent_boost_applied:
                boost_indicator = "üìú "
                
            print(f"   {i+1}. {boost_indicator}{readable_name} (Score: {item['similarity_score']:.3f})")
        
        print("‚úÖ Query completed successfully!\n")
        return result
        
    except Exception as e:
        print(f"‚ùå Pipeline error: {e}")
        import traceback
        traceback.print_exc()
        return None

# Create output directory
output_dir = "../../07_testsdemo/test_outputs/demo_results"
os.makedirs(output_dir, exist_ok=True)

# Test all queries
print("üöÄ TESTING ALL USER QUERIES WITH DYNAMIC SOURCE COUNT & MULTI-BOOSTER SYSTEM")
print("Note: Now includes patent boosting and updated query set\n")

all_results = []
successful_queries = 0

for query_id, question in USER_QUERIES.items():
    result = test_complete_pipeline(question, query_id)
    if result:
        all_results.append(result)
        successful_queries += 1
        
        # Save individual query result
        individual_file = f"{output_dir}/user_query_{query_id}_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
        with open(individual_file, 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

# Save consolidated results
if all_results:
    consolidated_file = f"{output_dir}/all_user_queries_with_multi_boost_{datetime.now().strftime('%Y%m%d_%H%M')}.json"
    with open(consolidated_file, 'w', encoding='utf-8') as f:
        json.dump(all_results, f, indent=2, ensure_ascii=False)
    
    print("üéâ TESTING COMPLETE!")
    print(f"‚úÖ Successful queries: {successful_queries}/{len(USER_QUERIES)}")
    print(f"üìÅ Individual results saved to: {output_dir}/")
    print(f"üìä Consolidated results: {consolidated_file}")
    
    # Summary with source count and boost info
    print("\nüìà QUERY PERFORMANCE SUMMARY:")
    for result in all_results:
        boost_info = []
        if result['startup_boost_applied']:
            boost_info.append("üöÄ")
        if result['patent_boost_applied']:
            boost_info.append("üìú")
        boost_str = " " + "".join(boost_info) if boost_info else ""
        
        print(f"  Q{result['query_id']}: k={result['source_count_used']}, {len(result['sources'])} sources{boost_str}, {len(result['answer'])} chars")
        
else:
    print("üí• No queries completed successfully")

print(f"\nüìù Enhanced pipeline with patent definitions and multi-booster system ready!")

üöÄ TESTING ALL USER QUERIES WITH DYNAMIC SOURCE COUNT & MULTI-BOOSTER SYSTEM
Note: Now includes patent boosting and updated query set

üß™ QUERY 1: 'Which startups work on AI for automotive?'
1. üîç Retrieving documents (k=4)...
   üöÄ FORCING STARTUP FILES for this query...
   üîç Force-searching startup files directly...
   ‚úÖ FORCED 2 startup chunks into results from:
      - Automotive Startup Profiles & Tracker: 2 chunks
   üìú Boosting patents file for this query...
   ‚úÖ Found 4 relevant chunks
2. üìù Building prompt...
3. ü§ñ Generating answer with LLM...
4. üìä RESULTS:
ANSWER: Based on the provided context, the following startups work on AI for automotive:

1. **2021.AI** (Europe)
	* Primary focus area or technology specialization: AI acceleration solution for organizations
	* Location and key business details: Europe, specifically Austria
	* Funding status: Not available in the context
	* Notable products or services: GRACE AI platform for standardizing processes

**Embedding Model check**

In [13]:
# SIMPLE DIAGNOSIS - NO NUMPY
print("üîç SIMPLE EMBEDDING DIAGNOSIS")

# 1. Check what embedding model you're using
print("\n1. Checking embedding model...")
# Look in your notebook 02 - what model did you use?
# Common ones: 'all-MiniLM-L6-v2', 'BAAI/bge-small-en', 'sentence-transformers/...'

# 2. Check actual query results
print("\n2. Checking query results...")

test_query = "autonomous vehicles AI"
print(f"Query: '{test_query}'")

results = retriever.retrieve_with_sources(test_query, k=3)

if not results:
    print("‚ùå No results at all!")
else:
    print(f"‚úÖ Found {len(results)} results")
    
    for i, doc in enumerate(results, 1):
        score = doc.get('similarity_score', 0)
        source = doc.get('source_file', 'unknown')
        
        print(f"\nResult {i}:")
        print(f"  Source: {source}")
        print(f"  Score: {score:.3f}")
        
        # Check content relevance
        content_lower = doc['content'].lower()
        
        # Check for key terms
        checks = [
            ("autonomous", "autonomous" in content_lower),
            ("vehicle", "vehicle" in content_lower or "car" in content_lower),
            ("AI", " ai " in content_lower or "artificial intelligence" in content_lower),
            ("self-driving", "self-driving" in content_lower or "self driving" in content_lower)
        ]
        
        print("  Contains:")
        for term, found in checks:
            if found:
                print(f"    ‚úì {term}")
            else:
                print(f"    ‚úó {term}")
        
        # Preview
        preview = doc['content'][:150].replace('\n', ' ')
        print(f"  Preview: {preview}...")

# 3. Check your vector store type
print("\n3. Checking vector store type...")
try:
    # Check if it's ChromaDB
    import chromadb
    print("‚úÖ Using ChromaDB")
    
    # Count documents
    client = chromadb.PersistentClient(path=index_path)
    collection = client.get_or_create_collection(name="documents")
    count = collection.count()
    print(f"   Documents in index: {count}")
    
except ImportError:
    try:
        # Check if it's FAISS
        import faiss
        print("‚úÖ Using FAISS")
    except:
        print("‚ùì Unknown vector store")

üîç SIMPLE EMBEDDING DIAGNOSIS

1. Checking embedding model...

2. Checking query results...
Query: 'autonomous vehicles AI'
‚úÖ Found 3 results

Result 1:
  Source: automotive_papers_processed.txt
  Score: 0.539
  Contains:
    ‚úì autonomous
    ‚úì vehicle
    ‚úì AI
    ‚úó self-driving
  Preview: RESEARCH PAPER #304:   Title: The Road to Autonomy: A Systematic Review Through AI in Autonomous Vehicles   Year Published: 2025   Authors: Adrian Dom...

Result 2:
  Source: automotive_papers_processed.txt
  Score: 0.504
  Contains:
    ‚úì autonomous
    ‚úó vehicle
    ‚úó AI
    ‚úó self-driving
  Preview: RESEARCH PAPER #1702:   Title: Unintended Consequences: Investigating AI-Induced Fatalities in Autonomous System   Year Published: 2025   Authors: Mr....

Result 3:
  Source: automotive_patents_processed.txt
  Score: 0.458
  Contains:
    ‚úì autonomous
    ‚úì vehicle
    ‚úó AI
    ‚úó self-driving
  Preview: PATENT #1574:   Lens ID: 018-487-472-994-877   Jurisdiction: US   Kind:

In [14]:
# TEST QUERY EXPANSION MANUALLY
print("üß™ TESTING QUERY EXPANSION MANUALLY")

original_query = "autonomous vehicles AI"
expanded_queries = [
    "self-driving cars artificial intelligence",
    "automated vehicles machine learning", 
    "AI for driverless automobiles",
    "autonomous automotive technology"
]

all_results = []
for query in [original_query] + expanded_queries:
    results = retriever.retrieve_with_sources(query, k=2)
    all_results.extend(results)

# Remove duplicates
unique_results = []
seen = set()
for doc in all_results:
    key = doc['content'][:100]  # First 100 chars as ID
    if key not in seen:
        seen.add(key)
        unique_results.append(doc)

print(f"\nOriginal query found: 3 documents")
print(f"With expansion found: {len(unique_results)} unique documents")
print(f"Improvement: {((len(unique_results)-3)/3*100):.0f}% more documents!")

üß™ TESTING QUERY EXPANSION MANUALLY

Original query found: 3 documents
With expansion found: 10 unique documents
Improvement: 233% more documents!


Query expansion finds more documents (233% increase)
But similarity scores remain low (<0.7)
This means embedding quality is the bottleneck, not terminology

In [15]:
# DIAGNOSE EMBEDDING QUALITY
print("üîç DIAGNOSING EMBEDDING QUALITY ISSUE")

# 1. Test with VERY SIMILAR text
test_pairs = [
    ("autonomous vehicles use AI", "self-driving cars use artificial intelligence"),
    ("electric vehicle battery", "EV battery technology"),
    ("automotive startup funding", "car company venture capital"),
    ("lidar sensor for cars", "light detection and ranging for automobiles")
]

print("\n1. Testing semantic similarity of paraphrases:")
for text1, text2 in test_pairs:
    # You need your embedding model here
    # If using sentence-transformers:
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')  # Change to your model
    
    emb1 = model.encode(text1)
    emb2 = model.encode(text2)
    
    # Manual cosine similarity
    similarity = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))
    
    print(f"   '{text1[:20]}...' vs '{text2[:20]}...': {similarity:.3f}")
    
    if similarity < 0.7:
        print(f"   ‚ö†Ô∏è LOW: Model doesn't see these as similar!")
    else:
        print(f"   ‚úÖ OK: Model recognizes similarity")

# 2. Check your actual embedding model
print("\n2. What embedding model are you using?")
# Look in notebook 02 where you created embeddings
# Common issue: all-MiniLM-L6-v2 is too weak for technical terms

# 3. Test document-chunk similarity
print("\n3. Testing document-to-query similarity")
query = "autonomous vehicles AI"
results = retriever.retrieve_with_sources(query, k=1)

if results:
    doc_content = results[0]['content'][:500]  # First 500 chars
    print(f"Query: {query}")
    print(f"Top document preview: {doc_content[:200]}...")
    
    # Manually check overlap
    query_words = set(query.lower().split())
    doc_words = set(doc_content.lower().split())
    overlap = query_words.intersection(doc_words)
    
    print(f"\nWord overlap: {overlap}")
    print(f"Overlap ratio: {len(overlap)/len(query_words):.1%}")
    
    if len(overlap) > 0:
        print("‚úÖ At least some word overlap")
    else:
        print("‚ùå NO word overlap - embedding model failing!")

üîç DIAGNOSING EMBEDDING QUALITY ISSUE

1. Testing semantic similarity of paraphrases:


RuntimeError: Numpy is not available

The Problem:

TF-IDF = Bag-of-words, no semantic understanding
Word2Vec/Transformers = Semantic understanding, contextual meaning
Why Scores Are Low (<0.7) with TF-IDF:

TF-IDF gives high scores ONLY for exact word matches. For automotive AI:

Query: "AI for autonomous vehicles"
Document: "artificial intelligence in self-driving cars"
TF-IDF score: LOW (no word overlap!)
Embedding score: HIGH (semantic match!)

In [17]:
# NO NUMPY NEEDED - TF-IDF vs EMBEDDINGS PROOF
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer

print("üß™ TF-IDF vs EMBEDDINGS - SIMPLE PROOF")
print("=" * 60)

# Sample automotive texts
texts = [
    "autonomous vehicles use AI for perception",
    "self-driving cars employ artificial intelligence systems",
]

queries = [
    "AI for self-driving cars",
]

print("\n1. TF-IDF SCORES (Your current system):")
print("-" * 40)

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

for query in queries:
    query_vec = vectorizer.transform([query])
    
    # Manual calculation without numpy
    scores = []
    for i in range(len(texts)):
        # Get non-zero elements
        doc_vec = tfidf_matrix[i]
        score = 0
        if query_vec.nnz > 0 and doc_vec.nnz > 0:
            # Simple dot product approximation
            for word in query.split():
                if word in vectorizer.vocabulary_:
                    word_id = vectorizer.vocabulary_[word]
                    score += query_vec[0, word_id] * doc_vec[0, word_id]
        scores.append(score)
    
    print(f"\nQuery: '{query}'")
    for i, (text, score) in enumerate(zip(texts, scores)):
        print(f"  Text {i+1}: {score:.3f} - '{text}'")
    
    print(f"\n  ‚ùå PROBLEM: 'AI' ‚â† 'artificial intelligence' for TF-IDF")
    print(f"  Text 1 has 'AI', Text 2 has 'artificial intelligence'")
    print(f"  TF-IDF sees them as DIFFERENT words!")

print("\n\n2. WHY EMBEDDINGS ARE BETTER:")
print("-" * 40)
print("""
Embeddings understand SEMANTIC meaning:
- 'AI' and 'artificial intelligence' ‚Üí SIMILAR vectors
- 'autonomous' and 'self-driving' ‚Üí SIMILAR vectors  
- 'vehicle' and 'car' ‚Üí SIMILAR vectors

Even with basic embeddings:
Query: "AI for self-driving cars"

Will match:
‚úì "autonomous vehicles use AI for perception" 
‚úì "self-driving cars employ artificial intelligence systems"

TF-IDF only matches:
‚úì "autonomous vehicles use AI for perception" (has 'AI')
‚úó "self-driving cars employ artificial intelligence systems" (no 'AI')
""")

print("\n" + "=" * 60)
print("üéØ CONCLUSION:")
print("Your low scores (<0.7) are because:")
print("1. TF-IDF = exact word matching only")
print("2. Automotive AI uses varied terminology")
print("3. Embeddings understand semantic similarity")
print("\nüí° SOLUTION: Switch to embeddings for 2-3x better results!")

üß™ TF-IDF vs EMBEDDINGS - SIMPLE PROOF

1. TF-IDF SCORES (Your current system):
----------------------------------------

Query: 'AI for self-driving cars'
  Text 1: 0.183 - 'autonomous vehicles use AI for perception'
  Text 2: 0.169 - 'self-driving cars employ artificial intelligence systems'

  ‚ùå PROBLEM: 'AI' ‚â† 'artificial intelligence' for TF-IDF
  Text 1 has 'AI', Text 2 has 'artificial intelligence'
  TF-IDF sees them as DIFFERENT words!


2. WHY EMBEDDINGS ARE BETTER:
----------------------------------------

Embeddings understand SEMANTIC meaning:
- 'AI' and 'artificial intelligence' ‚Üí SIMILAR vectors
- 'autonomous' and 'self-driving' ‚Üí SIMILAR vectors  
- 'vehicle' and 'car' ‚Üí SIMILAR vectors

Even with basic embeddings:
Query: "AI for self-driving cars"

Will match:
‚úì "autonomous vehicles use AI for perception" 
‚úì "self-driving cars employ artificial intelligence systems"

TF-IDF only matches:
‚úì "autonomous vehicles use AI for perception" (has 'AI')
‚úó "sel

**Test call after new ebedding and retriever**

In [20]:
# IN NOTEBOOK 03 - USE THIS:
print("üîç Testing NEW embedding-based retriever")

index_path = "../../04_models/vector_index"
retriever = DocumentAwareRetriever(index_path)

print(f"Retrieval method: {retriever.retrieval_method}")

# Use the correct attribute name (chunks_metadata, not chunks)
if hasattr(retriever, 'chunks_metadata'):
    print(f"Chunks loaded: {len(retriever.chunks_metadata)}")
elif hasattr(retriever, 'chunks'):
    print(f"Chunks loaded: {len(retriever.chunks)}")
else:
    print("‚ö†Ô∏è Could not find chunks attribute")

# Test
results = retriever.retrieve_with_sources("autonomous vehicles", k=2)

if results:
    for i, doc in enumerate(results, 1):
        print(f"{i}. Score: {doc['similarity_score']:.3f} | Source: {doc['source_file']}")
        print(f"   Preview: {doc['content'][:80]}...")
else:
    print("‚ùå No results found")

üîç Testing NEW embedding-based retriever
‚ùå TF-IDF loading failed: [Errno 2] No such file or directory: '../../04_models/vector_index/tfidf_embeddings.pkl'
Retrieval method: none
Chunks loaded: 18717
‚ùå No results found
