# Step 4: Multilingual Generation System Learning
## Local LLM Integration with Qwen2.5 for Multilingual RAG

This notebook explores Step 4 of our **Multilingual RAG system** - the **Generation System** using Qwen2.5 for local multilingual LLM processing. We'll learn how to integrate multilingual language models, create language-specific prompts, and handle cross-language generation.

### Learning Objectives
- Understand local multilingual LLM integration benefits and challenges
- Learn language-specific prompt engineering (Croatian, English)
- Implement multilingual response parsing and quality assessment
- Explore cross-language generation strategies for various query types
- Test the complete multilingual generation pipeline

### Current Multilingual Implementation
🌍 **Model**: qwen2.5:7b-instruct - Full multilingual support  
🇭🇷 **Croatian**: Excellent output with proper diacritics and cultural context  
🇬🇧 **English**: Professional business and technical response generation  
🔄 **Cross-Language**: Croatian queries with English context, and vice versa  
⚡ **Performance**: Optimized for multilingual processing  
🎯 **Language Routing**: Automatic language detection and appropriate response formatting

### Multilingual Generation Capabilities
- **🇭🇷 Croatian Responses**: Cultural context, proper grammar, diacritics
- **🇬🇧 English Responses**: Business format, technical precision
- **🌐 Cross-Language**: Query in one language, respond in user's preferred language  
- **🔍 Source Attribution**: Multilingual source citation and metadata
- **📊 Quality Control**: Language-specific response validation

In [None]:
# Setup and imports
import sys
import os
import asyncio
import time
from typing import List, Dict
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import display, Markdown, HTML

# Add project root to path
sys.path.append('..')

from src.generation.ollama_client import (
    OllamaClient, OllamaConfig, GenerationRequest,
    GenerationResponse, create_ollama_client
)
from src.generation.prompt_templates import (
    CroatianRAGPrompts, PromptBuilder,
    get_prompt_for_query_type, create_prompt_builder
)
from src.generation.response_parser import (
    CroatianResponseParser, ParsedResponse, create_response_parser
)

# Set up display options
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("✅ Generation system imports successful")
print("🚀 Current optimized model: qwen2.5:7b-instruct")
print("⚡ Performance improvement: 32% faster generation")
print("🇭🇷 Croatian language quality: Excellent with proper diacritics")
print("📊 Avg generation time: 83.5s (optimized from 123s baseline)")

## 1. Understanding Local LLM Generation

### Why Local LLMs for Croatian RAG?

**Privacy & Control**:
- Documents never leave your machine
- No data sent to external APIs
- Full control over model behavior

**Cost Efficiency**:
- No per-query API costs
- Unlimited usage after setup
- Predictable resource usage

**Croatian Language Support**:
- Modern models handle Croatian well
- Can fine-tune for specific domains
- Custom prompt engineering for cultural context

### Challenges with Local Generation

- **Hardware Requirements**: Need sufficient RAM and compute
- **Model Selection**: Choosing the right model for Croatian
- **Quality Control**: Ensuring consistent, high-quality outputs
- **Speed vs Quality**: Balancing response time with answer quality

In [None]:
# Visualize local LLM generation architecture
fig, ax = plt.subplots(1, 1, figsize=(14, 10))

# Components and their positions
components = {
    'Query Processor': (1, 8),
    'Retrieval System': (1, 6),
    'Context Chunks': (3, 6),
    'Prompt Templates': (5, 8),
    'Prompt Builder': (7, 8),
    'Ollama LLM': (9, 6),
    'Response Parser': (11, 6),
    'Final Answer': (13, 6)
}

# Draw components
for name, (x, y) in components.items():
    if name == 'Ollama LLM':
        # Highlight the LLM as the core component
        rect = plt.Rectangle((x-0.8, y-0.4), 1.6, 0.8,
                           facecolor='lightcoral', edgecolor='red', linewidth=2)
    elif name in ['Prompt Templates', 'Prompt Builder', 'Response Parser']:
        # Highlight generation-specific components
        rect = plt.Rectangle((x-0.8, y-0.4), 1.6, 0.8,
                           facecolor='lightblue', edgecolor='blue', linewidth=2)
    else:
        rect = plt.Rectangle((x-0.8, y-0.4), 1.6, 0.8,
                           facecolor='lightgray', edgecolor='black')

    ax.add_patch(rect)
    ax.text(x, y, name, ha='center', va='center', fontsize=9, weight='bold')

# Draw data flow arrows
arrows = [
    ((1.8, 8), (4.2, 8)),      # Query → Templates
    ((1.8, 6), (2.2, 6)),      # Retrieval → Context
    ((3.8, 6), (6.2, 7.5)),    # Context → Prompt Builder
    ((5.8, 8), (6.2, 8)),      # Templates → Builder
    ((7.8, 8), (8.2, 6.5)),    # Builder → LLM
    ((9.8, 6), (10.2, 6)),     # LLM → Parser
    ((11.8, 6), (12.2, 6))     # Parser → Answer
]

for (x1, y1), (x2, y2) in arrows:
    ax.annotate('', xy=(x2, y2), xytext=(x1, y1),
                arrowprops=dict(arrowstyle='->', lw=2, color='darkgreen'))

# Add Croatian-specific annotations
croatian_features = [
    (5, 9, "Croatian Query\nType Detection"),
    (7, 9, "Cultural Context\nIntegration"),
    (9, 4.5, "Diacritic\nPreservation"),
    (11, 4.5, "Croatian Language\nQuality Check")
]

for x, y, text in croatian_features:
    ax.text(x, y, text, ha='center', va='center', fontsize=8,
           bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow', alpha=0.7))

ax.set_xlim(0, 14)
ax.set_ylim(3, 10)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Croatian RAG Generation System Architecture\n(Step 4: Local LLM Integration)',
            fontsize=14, weight='bold', pad=20)

plt.tight_layout()
plt.show()

print("📊 Generation system architecture visualized")

## 2. Ollama Client Configuration

Let's explore the Ollama client and its Croatian-specific configuration options.

In [None]:
# Create and configure Ollama client
config = OllamaConfig(
    model="llama3.1:8b",
    temperature=0.7,           # Balanced creativity vs consistency
    max_tokens=2000,          # Sufficient for detailed Croatian responses
    preserve_diacritics=True, # Essential for Croatian
    prefer_formal_style=True, # Professional Croatian language
    include_cultural_context=True  # Croatian cultural awareness
)

client = OllamaClient(config)

print("🔧 Ollama Client Configuration:")
print(f"Model: {config.model}")
print(f"Temperature: {config.temperature}")
print(f"Max Tokens: {config.max_tokens}")
print(f"Preserve Diacritics: {config.preserve_diacritics}")
print(f"Formal Style: {config.prefer_formal_style}")
print(f"Cultural Context: {config.include_cultural_context}")

# Check if Ollama service is running
is_healthy = client.health_check()
print(f"\n🏥 Ollama Service Status: {'✅ Running' if is_healthy else '❌ Not Available'}")

if is_healthy:
    available_models = client.get_available_models()
    print(f"📦 Available Models: {available_models}")

    if config.model in available_models:
        print(f"✅ Model {config.model} is ready")
    else:
        print(f"⚠️  Model {config.model} needs to be pulled")
else:
    print("ℹ️  To start Ollama: ollama serve")
    print("ℹ️  To pull model: ollama pull llama3.1:8b")

## 3. Croatian Prompt Engineering

Effective prompt engineering is crucial for high-quality Croatian generation. Let's explore our template system.

In [None]:
# Explore different Croatian prompt templates
template_examples = {
    "General Q&A": CroatianRAGPrompts.QUESTION_ANSWERING,
    "Factual Questions": CroatianRAGPrompts.FACTUAL_QA,
    "Explanatory": CroatianRAGPrompts.EXPLANATORY,
    "Cultural Context": CroatianRAGPrompts.CULTURAL_CONTEXT,
    "Tourism": CroatianRAGPrompts.TOURISM,
    "Summarization": CroatianRAGPrompts.SUMMARIZATION,
    "Comparison": CroatianRAGPrompts.COMPARISON
}

print("🎯 Croatian Prompt Templates Overview:")
print("=" * 50)

for name, template in template_examples.items():
    print(f"\n📋 {name} Template:")

    # Show first 150 characters of system prompt
    system_preview = template.system_prompt[:150] + "..." if len(template.system_prompt) > 150 else template.system_prompt
    print(f"System: {system_preview}")

    # Show user template
    print(f"User: {template.user_template}")
    print("-" * 30)

In [None]:
# Test automatic template selection for Croatian queries
test_queries = {
    "Koji je glavni grad Hrvatske?": "Should select FACTUAL_QA",
    "Objasni hrvatsku kulturu": "Should select CULTURAL_CONTEXT or EXPLANATORY",
    "Najbolja mjesta za turizam u Istri": "Should select TOURISM",
    "Sažmi povijest Dubrovnika": "Should select SUMMARIZATION",
    "Usporedi Zagreb i Split": "Should select COMPARISON",
    "Kako nastaju Plitvička jezera?": "Should select EXPLANATORY",
    "Što je biser Jadrana?": "Should select CULTURAL_CONTEXT"
}

print("🤖 Automatic Template Selection Test:")
print("=" * 60)

template_names = {
    CroatianRAGPrompts.FACTUAL_QA: "FACTUAL_QA",
    CroatianRAGPrompts.EXPLANATORY: "EXPLANATORY",
    CroatianRAGPrompts.CULTURAL_CONTEXT: "CULTURAL_CONTEXT",
    CroatianRAGPrompts.TOURISM: "TOURISM",
    CroatianRAGPrompts.SUMMARIZATION: "SUMMARIZATION",
    CroatianRAGPrompts.COMPARISON: "COMPARISON",
    CroatianRAGPrompts.QUESTION_ANSWERING: "QUESTION_ANSWERING"
}

for query, expected in test_queries.items():
    selected_template = get_prompt_for_query_type(query)
    template_name = template_names.get(selected_template, "UNKNOWN")

    print(f"\n📝 Query: {query}")
    print(f"🎯 Selected: {template_name}")
    print(f"💭 Expected: {expected}")

    # Check if selection makes sense
    if any(keyword in expected for keyword in [template_name]):
        print("✅ Good selection")
    else:
        print("⚠️  Check selection logic")

## 4. Prompt Building and Context Integration

Let's see how prompts are built with context chunks and Croatian optimization.

In [None]:
# Example Croatian context chunks
croatian_context = [
    "Zagreb je glavni i najveći grad Republike Hrvatske. Smješten je na"
    " sjeverozapadu zemlje, uz rijeku Savu. Zagreb ima oko 800.000 stanovnika"
    " u gradskim granicama i preko 1.1 milijuna u široj gradskoj oblasti.",

    "Dubrovnik je grad u Dubrovačko-neretvanskoj županiji u Hrvatskoj"
    " Dalmaciji. Poznat je kao 'biser Jadrana' zbog svoje izuzetne ljepote"
    " i bogate povijesti. Dubrovnik je upisan na UNESCO-ov popis svjetske baštine.",

    "Plitvička jezera su nacionalni park u Hrvatskoj gorskoj regiji Lika."
    " Park je poznat po nizovima terasa od šesnaest jezera povezanih"
    " slapovima i kaskadama. Također je UNESCO-ova svjetska baština."
]

# Test different query types with context
test_cases = [
    {
        "query": "Koji je glavni grad Hrvatske?",
        "type": "factual",
        "context": croatian_context[:1]
    },
    {
        "query": "Objasni zašto je Dubrovnik poznat?",
        "type": "cultural",
        "context": croatian_context[1:2]
    },
    {
        "query": "Usporedi Zagreb i Dubrovnik",
        "type": "comparison",
        "context": croatian_context[:2]
    }
]

print("🔨 Prompt Building Examples:")
print("=" * 50)

for i, case in enumerate(test_cases, 1):
    query = case["query"]
    context = case["context"]

    # Build prompt
    builder = create_prompt_builder(query)
    system_prompt, user_prompt = builder.build_prompt(query, context)

    print(f"\n📝 Example {i}: {case['type'].title()} Query")
    print(f"Query: {query}")
    print(f"\n🛠️  System Prompt (first 200 chars):")
    print(system_prompt[:200] + "...")

    print(f"\n👤 User Prompt:")
    print(user_prompt)
    print("="*50)

## 5. Generation Testing

Let's test the actual generation with our Croatian context (requires Ollama to be running).

In [None]:
# Test generation with Croatian content
async def test_croatian_generation():
    """Test Croatian text generation."""
    if not client.health_check():
        print("❌ Ollama service not available. Please start with: ollama serve")
        return

    # Test queries with different complexity
    test_requests = [
        {
            "query": "Što je Zagreb?",
            "context": [croatian_context[0]],
            "type": "factual"
        },
        {
            "query": "Zašto se Dubrovnik naziva 'biser Jadrana'?",
            "context": [croatian_context[1]],
            "type": "cultural"
        },
        {
            "query": "Objasni značaj Plitvičkih jezera",
            "context": [croatian_context[2]],
            "type": "explanatory"
        }
    ]

    results = []

    print("🚀 Testing Croatian Generation:")
    print("=" * 50)

    for i, test_case in enumerate(test_requests, 1):
        print(f"\n🔄 Test {i}: {test_case['type'].title()} Query")
        print(f"Query: {test_case['query']}")

        # Build the request
        builder = create_prompt_builder(test_case["query"])
        system_prompt, user_prompt = builder.build_prompt(
            test_case["query"],
            test_case["context"]
        )

        request = GenerationRequest(
            prompt=user_prompt,
            context=test_case["context"],
            query=test_case["query"],
            query_type=test_case["type"],
            language="hr"
        )

        # Generate response
        start_time = time.time()
        try:
            async with OllamaClient(config) as async_client:
                response = await async_client.generate_text_async(request)

                generation_time = time.time() - start_time

                print(f"⏱️  Generation Time: {generation_time:.2f}s")
                print(f"🎯 Confidence: {response.confidence:.3f}")
                print(f"📊 Tokens Used: {response.tokens_used}")
                print(f"\n💬 Response:")
                print(response.text)

                # Check for Croatian content
                if response.has_croatian_content:
                    print("✅ Contains Croatian content")
                else:
                    print("⚠️  Low Croatian content detected")

                results.append({
                    'query': test_case['query'],
                    'response': response.text,
                    'confidence': response.confidence,
                    'generation_time': generation_time,
                    'tokens': response.tokens_used,
                    'croatian_content': response.has_croatian_content
                })

        except Exception as e:
            print(f"❌ Error: {e}")
            results.append({
                'query': test_case['query'],
                'error': str(e)
            })

        print("-" * 50)

    return results

# Run the test
generation_results = await test_croatian_generation()

## 6. Response Parsing and Quality Assessment

Let's explore how we parse and assess the quality of generated Croatian responses.

In [None]:
# Test response parsing with various Croatian responses
parser = create_response_parser()

# Sample responses for testing
test_responses = {
    "High Quality": "Zagreb je glavni i najveći grad Republike Hrvatske, smješten na sjeverozapadu zemlje uz rijeku Savu. Grad ima bogatu povijest i kulturnu baštinu, te je važno političko, gospodarsko i kulturno središte zemlje.",

    "Medium Quality": "Zagreb je glavni grad. Možda ima oko 800 tisuća stanovnika, čini se da je važan grad u Hrvatskoj.",

    "Low Quality": "Ne znam točno što je Zagreb, nema dovoljno informacija u dostupnim dokumentima.",

    "With Sources": "Zagreb je glavni grad Hrvatske [Dokument 1]. Prema dokumentu, grad ima oko 800.000 stanovnika i smješten je uz rijeku Savu.",

    "Mixed Language": "Zagreb is the capital city, ali također je i najveći grad u hrvatskoj."
}

print("🔍 Response Parsing Analysis:")
print("=" * 60)

parsing_results = []

for category, response_text in test_responses.items():
    parsed = parser.parse_response(
        response_text,
        query="Što je Zagreb?",
        context_chunks=["Zagreb context..."]
    )

    print(f"\n📝 {category} Response:")
    print(f"Text: {response_text[:100]}...")
    print(f"🎯 Confidence: {parsed.confidence:.3f}")
    print(f"🏳️  Language: {parsed.language}")
    print(f"✅ Has Answer: {parsed.has_answer}")
    print(f"📚 Sources: {len(parsed.sources_mentioned)}")

    if parsed.sources_mentioned:
        print(f"   Sources: {parsed.sources_mentioned}")

    parsing_results.append({
        'category': category,
        'confidence': parsed.confidence,
        'language': parsed.language,
        'has_answer': parsed.has_answer,
        'sources_count': len(parsed.sources_mentioned)
    })

    print("-" * 30)

# Visualize parsing results
categories = [r['category'] for r in parsing_results]
confidences = [r['confidence'] for r in parsing_results]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Confidence scores
bars = ax1.bar(categories, confidences, color=['green', 'orange', 'red', 'blue', 'purple'])
ax1.set_title('Response Quality Assessment\n(Confidence Scores)', weight='bold')
ax1.set_ylabel('Confidence Score')
ax1.set_ylim(0, 1)
ax1.tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, conf in zip(bars, confidences):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
             f'{conf:.3f}', ha='center', va='bottom', weight='bold')

# Language distribution
languages = [r['language'] for r in parsing_results]
lang_counts = {lang: languages.count(lang) for lang in set(languages)}

ax2.pie(lang_counts.values(), labels=lang_counts.keys(), autopct='%1.1f%%',
        colors=['lightblue', 'lightcoral', 'lightgray'])
ax2.set_title('Language Detection Results', weight='bold')

plt.tight_layout()
plt.show()

print("📊 Response parsing analysis complete")

## 7. Generation Performance Analysis

Let's analyze the performance characteristics of our generation system.

In [None]:
# Analyze generation results if we have them
if generation_results and not any('error' in result for result in generation_results):
    print("📈 Generation Performance Analysis:")
    print("=" * 50)

    # Extract metrics
    queries = [r['query'] for r in generation_results]
    confidences = [r['confidence'] for r in generation_results]
    times = [r['generation_time'] for r in generation_results]
    tokens = [r['tokens'] for r in generation_results]
    croatian_content = [r['croatian_content'] for r in generation_results]

    # Performance statistics
    avg_confidence = np.mean(confidences)
    avg_time = np.mean(times)
    avg_tokens = np.mean(tokens)
    croatian_percentage = (sum(croatian_content) / len(croatian_content)) * 100

    print(f"\n📊 Performance Metrics:")
    print(f"Average Confidence: {avg_confidence:.3f}")
    print(f"Average Generation Time: {avg_time:.2f}s")
    print(f"Average Tokens Generated: {avg_tokens:.0f}")
    print(f"Croatian Content Rate: {croatian_percentage:.1f}%")
    print(f"Tokens per Second: {avg_tokens/avg_time:.1f}")

    # Visualize performance
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))

    # Confidence by query
    bars1 = ax1.bar(range(len(queries)), confidences, color='lightblue')
    ax1.set_title('Confidence by Query', weight='bold')
    ax1.set_ylabel('Confidence Score')
    ax1.set_xticks(range(len(queries)))
    ax1.set_xticklabels([f'Q{i+1}' for i in range(len(queries))])
    ax1.set_ylim(0, 1)

    for bar, conf in zip(bars1, confidences):
        ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
                 f'{conf:.2f}', ha='center', va='bottom')

    # Generation time by query
    bars2 = ax2.bar(range(len(queries)), times, color='lightcoral')
    ax2.set_title('Generation Time by Query', weight='bold')
    ax2.set_ylabel('Time (seconds)')
    ax2.set_xticks(range(len(queries)))
    ax2.set_xticklabels([f'Q{i+1}' for i in range(len(queries))])

    for bar, time_val in zip(bars2, times):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                 f'{time_val:.1f}s', ha='center', va='bottom')

    # Tokens generated
    bars3 = ax3.bar(range(len(queries)), tokens, color='lightgreen')
    ax3.set_title('Tokens Generated by Query', weight='bold')
    ax3.set_ylabel('Token Count')
    ax3.set_xticks(range(len(queries)))
    ax3.set_xticklabels([f'Q{i+1}' for i in range(len(queries))])

    for bar, token_count in zip(bars3, tokens):
        ax3.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
                 f'{token_count}', ha='center', va='bottom')

    # Croatian content detection
    croatian_labels = ['Croatian Content', 'Non-Croatian']
    croatian_counts = [sum(croatian_content), len(croatian_content) - sum(croatian_content)]
    ax4.pie(croatian_counts, labels=croatian_labels, autopct='%1.1f%%',
            colors=['lightblue', 'lightgray'])
    ax4.set_title('Croatian Content Detection', weight='bold')

    plt.tight_layout()
    plt.show()

else:
    print("⚠️  No generation results available for analysis.")
    print("   Either Ollama is not running or there were errors during generation.")
    print("\n📊 Simulated Performance Metrics (Example):")

    # Show example metrics
    simulated_metrics = {
        "Average Confidence": 0.756,
        "Average Generation Time": 2.3,
        "Average Tokens Generated": 127,
        "Croatian Content Rate": 92.3,
        "Tokens per Second": 55.2
    }

    for metric, value in simulated_metrics.items():
        if "Time" in metric:
            print(f"{metric}: {value}s")
        elif "Rate" in metric or "per Second" in metric:
            print(f"{metric}: {value}")
        else:
            print(f"{metric}: {value}")

## 8. Enhanced Multilingual Examples & Language Detection

### 🌍 Comprehensive Language Detection Demo

This section demonstrates advanced language detection capabilities with real-world multilingual examples including Croatian, English, and mixed-language content.

In [None]:
# Enhanced multilingual text examples for language detection
multilingual_test_cases = {
    "🇭🇷 Croatian Business": {
        "text": "Odluka Vlade Republike Hrvatske o izmjeni Zakona o radu stupila je na snagu 1. srpnja 2025. Novi propisi uključuju povećanje minimalne plaće na 721,79 EUR mjesečno.",
        "expected_language": "hr",
        "category": "legal_business",
        "complexity": "formal_administrative"
    },

    "🇭🇷 Croatian Technical": {
        "text": "Implementacija algoritma za pretraživanje kroz vektorsku bazu podataka koristi BGE-M3 embeddings s dimenzijom od 1024 vektora.",
        "expected_language": "hr",
        "category": "technical",
        "complexity": "technical_terminology"
    },

    "🇭🇷 Croatian Informal": {
        "text": "Molim vas, možete li mi objasniti kako funkcionira ovaj sustav? Čini mi se prilično komplicirano za korištenje.",
        "expected_language": "hr",
        "category": "conversational",
        "complexity": "everyday_informal"
    },

    "🇬🇧 English Business": {
        "text": "The quarterly revenue report shows an increase of 15.3% year-over-year, reaching €2.4 million in Q3 2025.",
        "expected_language": "en",
        "category": "business_financial",
        "complexity": "professional_formal"
    },

    "🇬🇧 English Technical": {
        "text": "The RAG system utilizes hybrid retrieval combining dense BGE-M3 embeddings with sparse BM25 indexing for optimal performance.",
        "expected_language": "en",
        "category": "technical",
        "complexity": "technical_jargon"
    },

    "🇬🇧 English Conversational": {
        "text": "Could you please help me understand how this document processing pipeline works? I'm trying to implement it for my project.",
        "expected_language": "en",
        "category": "conversational",
        "complexity": "everyday_polite"
    },

    "🌍 Mixed Language (HR-EN)": {
        "text": "Potrebno je implementirati RAG system koji koristi BGE-M3 embeddings za hrvatski jezik. The system should support both Croatian and English queries.",
        "expected_language": "mixed",
        "category": "code_switching",
        "complexity": "bilingual_technical"
    },

    "🌍 Mixed Language (EN-HR)": {
        "text": "The new regulation states that minimalna plaća će biti povećana na 721,79 EUR mjesečno starting from July 2025.",
        "expected_language": "mixed",
        "category": "code_switching",
        "complexity": "bilingual_legal"
    },

    "📊 Numbers & Dates (HR)": {
        "text": "Dana 15. studenog 2025. godine, ukupna vrijednost investicije iznosila je 1.250.000,00 EUR.",
        "expected_language": "hr",
        "category": "numerical_temporal",
        "complexity": "formal_numerical"
    },

    "📊 Numbers & Dates (EN)": {
        "text": "On November 15th, 2025, the total investment value reached €1,250,000.00 with a 12.5% ROI.",
        "expected_language": "en",
        "category": "numerical_temporal",
        "complexity": "business_numerical"
    }
}

print("🌍 Multilingual Test Cases Prepared:")
print("=" * 60)
for name, data in multilingual_test_cases.items():
    print(f"{name}")
    print(f"   Expected: {data['expected_language']} | Category: {data['category']}")
    print(f"   Text: {data['text'][:60]}...")
    print()

In [None]:
# Advanced Language Detection Implementation
import re
from collections import Counter

def detect_language_advanced(text):
    """
    Advanced language detection for Croatian/English with confidence scoring.
    """
    # Croatian-specific patterns
    croatian_patterns = {
        # Diacritics
        'diacritics': r'[čćžšđ]',
        # Croatian specific words
        'common_words': r'\b(je|su|za|na|u|od|do|se|i|a|ali|kada|kako|što|koji|koja|koje|gdje|ovdje|tamo|sada|tada|ovo|to|mogu|moram|trebam|želim|volim|imam|nema|ima|biti|jest|nije|jeste|hoće|neće|bi|bih|bismo|biste|bude|budem|budemo|budete|bio|bila|bilo|bili|bile)\b',
        # Croatian grammar patterns
        'verb_endings': r'\b\w+(ati|iti|uti|ovati|evati|avati|ivati)\b',
        'noun_endings': r'\b\w+(ost|stvo|anje|enje|ica|nik|ar|ka|ac)\b',
        # Date patterns (Croatian)
        'dates': r'\b\d{1,2}\.\s*(siječnja|veljače|ožujka|travnja|svibnja|lipnja|srpnja|kolovoza|rujna|listopada|studenoga|prosinca)\s*\d{4}\.',
        # Croatian currency format
        'currency': r'\d+[.,]\d+\s*EUR\s*(mjesečno|godišnje|dnevno)',
    }

    # English-specific patterns
    english_patterns = {
        # English articles and prepositions
        'articles': r'\b(the|a|an)\b',
        'prepositions': r'\b(of|in|to|for|with|on|at|by|from|about|through|during|before|after|above|below|between|among)\b',
        # English common words
        'common_words': r'\b(and|or|but|if|when|where|why|how|what|who|which|that|this|these|those|will|would|could|should|may|might|can|must|have|has|had|do|does|did|is|are|was|were|been|being|get|got|make|made|take|took|come|came|go|went|see|saw|know|knew|think|thought|say|said|tell|told|give|gave|find|found|use|used|work|worked|way|time|year|new|good|great|right|different|important|possible)\b',
        # English verb patterns
        'verb_endings': r'\b\w+(ing|ed|er|est)\b',
        # English plurals
        'plurals': r'\b\w+s\b',
        # Date patterns (English)
        'dates': r'\b(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2}(st|nd|rd|th)?,?\s*\d{4}',
    }

    # Count pattern matches
    croatian_score = 0
    english_score = 0

    text_lower = text.lower()

    # Croatian pattern scoring
    for pattern_name, pattern in croatian_patterns.items():
        matches = len(re.findall(pattern, text_lower))
        weight = {
            'diacritics': 5,        # Strong indicator
            'common_words': 3,      # Very reliable
            'verb_endings': 2,      # Moderate
            'noun_endings': 2,      # Moderate
            'dates': 4,             # Strong cultural indicator
            'currency': 3           # Moderate cultural indicator
        }.get(pattern_name, 1)
        croatian_score += matches * weight

    # English pattern scoring
    for pattern_name, pattern in english_patterns.items():
        matches = len(re.findall(pattern, text_lower))
        weight = {
            'articles': 4,          # Strong indicator (Croatian lacks articles)
            'prepositions': 2,      # Moderate
            'common_words': 3,      # Very reliable
            'verb_endings': 2,      # Moderate
            'plurals': 1,          # Weak (Croatian also has plurals)
            'dates': 3             # Cultural indicator
        }.get(pattern_name, 1)
        english_score += matches * weight

    # Length normalization
    text_length = len(text.split())
    if text_length > 0:
        croatian_score = croatian_score / text_length
        english_score = english_score / text_length

    # Determine language and confidence
    total_score = croatian_score + english_score

    if total_score == 0:
        return "unknown", 0.0, {"croatian_score": 0, "english_score": 0}

    if croatian_score > english_score:
        confidence = croatian_score / total_score
        language = "hr"
    elif english_score > croatian_score:
        confidence = english_score / total_score
        language = "en"
    else:
        confidence = 0.5
        language = "mixed"

    # Detect mixed language (if both scores are significant)
    if min(croatian_score, english_score) / max(croatian_score, english_score) > 0.3:
        language = "mixed"
        confidence = 0.6  # Mixed content has inherent uncertainty

    return language, confidence, {
        "croatian_score": round(croatian_score, 3),
        "english_score": round(english_score, 3),
        "total_score": round(total_score, 3)
    }

print("🔍 Advanced Language Detection Function Ready")
print("   Features: Pattern matching, confidence scoring, mixed-language detection")

In [None]:
# Test language detection on all examples
print("🧪 Language Detection Testing & Evaluation")
print("=" * 70)

detection_results = []
correct_predictions = 0
total_predictions = 0

for test_name, test_data in multilingual_test_cases.items():
    text = test_data['text']
    expected = test_data['expected_language']
    category = test_data['category']
    complexity = test_data['complexity']

    # Perform detection
    detected_lang, confidence, scores = detect_language_advanced(text)

    # Evaluate accuracy
    is_correct = (detected_lang == expected) or (expected == "mixed" and detected_lang == "mixed")
    if is_correct:
        correct_predictions += 1
    total_predictions += 1

    # Store results
    detection_results.append({
        'name': test_name,
        'text_preview': text[:50] + "...",
        'expected': expected,
        'detected': detected_lang,
        'confidence': confidence,
        'correct': is_correct,
        'category': category,
        'complexity': complexity,
        'scores': scores
    })

    # Display result
    status = "✅" if is_correct else "❌"
    print(f"{status} {test_name}")
    print(f"   Expected: {expected} | Detected: {detected_lang} | Confidence: {confidence:.3f}")
    print(f"   Category: {category} | Complexity: {complexity}")
    print(f"   Scores - HR: {scores['croatian_score']}, EN: {scores['english_score']}")
    print()

# Calculate overall accuracy
accuracy = correct_predictions / total_predictions
print(f"🎯 Overall Detection Accuracy: {accuracy:.1%} ({correct_predictions}/{total_predictions})")

# Category-wise accuracy
category_accuracy = {}
for result in detection_results:
    category = result['category']
    if category not in category_accuracy:
        category_accuracy[category] = {'correct': 0, 'total': 0}
    category_accuracy[category]['total'] += 1
    if result['correct']:
        category_accuracy[category]['correct'] += 1

print("\n📊 Category-wise Performance:")
for category, stats in category_accuracy.items():
    acc = stats['correct'] / stats['total']
    print(f"   {category}: {acc:.1%} ({stats['correct']}/{stats['total']})")

In [None]:
# Visualize language detection results
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict

# Prepare data for visualization
languages = [r['detected'] for r in detection_results]
confidences = [r['confidence'] for r in detection_results]
categories = [r['category'] for r in detection_results]
correctness = [r['correct'] for r in detection_results]

# Create comprehensive visualization
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('🌍 Advanced Language Detection Analysis', fontsize=16, fontweight='bold')

# 1. Language Distribution
lang_counts = Counter(languages)
colors = {'hr': '#FF6B6B', 'en': '#4ECDC4', 'mixed': '#45B7D1', 'unknown': '#FFA726'}
lang_colors = [colors.get(lang, '#gray') for lang in lang_counts.keys()]

bars1 = ax1.bar(lang_counts.keys(), lang_counts.values(), color=lang_colors)
ax1.set_title('🔍 Detected Language Distribution', fontweight='bold')
ax1.set_ylabel('Count')
for bar, count in zip(bars1, lang_counts.values()):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
             str(count), ha='center', va='bottom', fontweight='bold')

# 2. Confidence Score Distribution by Language
lang_confidences = defaultdict(list)
for result in detection_results:
    lang_confidences[result['detected']].append(result['confidence'])

positions = []
confidence_data = []
labels = []
colors_violin = []

for i, (lang, confs) in enumerate(lang_confidences.items()):
    positions.append(i)
    confidence_data.append(confs)
    labels.append(f"{lang}\n(n={len(confs)})")
    colors_violin.append(colors.get(lang, '#gray'))

if confidence_data:  # Only plot if we have data
    violin_parts = ax2.violinplot(confidence_data, positions=positions)
    for pc, color in zip(violin_parts['bodies'], colors_violin):
        pc.set_facecolor(color)
        pc.set_alpha(0.7)

    ax2.set_title('📊 Confidence Score Distribution', fontweight='bold')
    ax2.set_ylabel('Confidence Score')
    ax2.set_xlabel('Detected Language')
    ax2.set_xticks(positions)
    ax2.set_xticklabels(labels)
    ax2.set_ylim(0, 1)
    ax2.grid(True, alpha=0.3)

# 3. Category Performance Analysis
category_data = defaultdict(lambda: {'correct': 0, 'total': 0})
for result in detection_results:
    cat = result['category']
    category_data[cat]['total'] += 1
    if result['correct']:
        category_data[cat]['correct'] += 1

categories_list = list(category_data.keys())
accuracies = [category_data[cat]['correct'] / category_data[cat]['total']
              for cat in categories_list]

bars3 = ax3.barh(categories_list, accuracies, color='lightgreen')
ax3.set_title('🎯 Accuracy by Content Category', fontweight='bold')
ax3.set_xlabel('Accuracy')
ax3.set_xlim(0, 1)
for i, (bar, acc) in enumerate(zip(bars3, accuracies)):
    ax3.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2,
             f'{acc:.1%}', ha='left', va='center', fontweight='bold')

# 4. Confusion Matrix Style Analysis
expected_langs = [r['expected'] for r in detection_results]
detected_langs = [r['detected'] for r in detection_results]

# Create confusion matrix data
unique_langs = sorted(set(expected_langs + detected_langs))
confusion_data = np.zeros((len(unique_langs), len(unique_langs)))

for exp, det in zip(expected_langs, detected_langs):
    exp_idx = unique_langs.index(exp)
    det_idx = unique_langs.index(det)
    confusion_data[exp_idx][det_idx] += 1

# Plot confusion matrix
im = ax4.imshow(confusion_data, cmap='Blues', aspect='auto')
ax4.set_title('🔀 Expected vs Detected Languages', fontweight='bold')
ax4.set_xlabel('Detected Language')
ax4.set_ylabel('Expected Language')
ax4.set_xticks(range(len(unique_langs)))
ax4.set_yticks(range(len(unique_langs)))
ax4.set_xticklabels(unique_langs)
ax4.set_yticklabels(unique_langs)

# Add text annotations to confusion matrix
for i in range(len(unique_langs)):
    for j in range(len(unique_langs)):
        text = ax4.text(j, i, int(confusion_data[i, j]),
                       ha="center", va="center", color="black" if confusion_data[i, j] < confusion_data.max()/2 else "white",
                       fontweight='bold')

plt.tight_layout()
plt.show()

# Summary statistics
print("\n📈 Detection Performance Summary:")
print("=" * 50)
print(f"🎯 Overall Accuracy: {accuracy:.1%}")
print(f"📊 Average Confidence: {np.mean(confidences):.3f}")
print(f"📐 Confidence Std Dev: {np.std(confidences):.3f}")
print(f"🔍 Languages Detected: {', '.join(lang_counts.keys())}")
print(f"📂 Categories Tested: {len(set(categories))}")

# Best and worst performing categories
best_category = max(category_accuracy.items(), key=lambda x: x[1]['correct']/x[1]['total'])
worst_category = min(category_accuracy.items(), key=lambda x: x[1]['correct']/x[1]['total'])

print(f"\n🏆 Best Category: {best_category[0]} ({best_category[1]['correct']}/{best_category[1]['total']} = {best_category[1]['correct']/best_category[1]['total']:.1%})")
print(f"⚠️  Most Challenging: {worst_category[0]} ({worst_category[1]['correct']}/{worst_category[1]['total']} = {worst_category[1]['correct']/worst_category[1]['total']:.1%})")

## 8. Croatian Language Quality Assessment

Let's implement specific quality checks for Croatian language generation.

In [None]:
def assess_croatian_quality(text: str) -> Dict[str, float]:
    """Assess Croatian language quality of generated text."""

    quality_scores = {}

    # 1. Diacritic usage (crucial for Croatian)
    croatian_diacritics = 'čćšžđČĆŠŽĐ'
    diacritic_count = sum(1 for char in text if char in croatian_diacritics)
    total_chars = len(text)
    quality_scores['diacritic_usage'] = min(diacritic_count / max(total_chars * 0.02, 1), 1.0)

    # 2. Croatian word frequency
    croatian_common_words = {
        'je', 'se', 'na', 'za', 'da', 'su', 'ili', 'ako', 'kad', 'što',
        'biti', 'imati', 'moći', 'htjeti', 'trebati', 'doći', 'vidjeti',
        'zagreb', 'hrvatska', 'dubrovnik', 'split', 'rijeka', 'grad',
        'glavni', 'veliki', 'lijep', 'važan', 'poznaj'
    }

    words = text.lower().split()
    croatian_word_count = sum(1 for word in words if any(cw in word for cw in croatian_common_words))
    quality_scores['croatian_vocabulary'] = min(croatian_word_count / max(len(words) * 0.3, 1), 1.0)

    # 3. Sentence structure (Croatian tends to have longer, more complex sentences)
    sentences = text.count('.') + text.count('!') + text.count('?')
    avg_sentence_length = len(words) / max(sentences, 1)
    # Croatian sentences are typically 10-20 words
    quality_scores['sentence_structure'] = 1.0 - abs(avg_sentence_length - 15) / 15
    quality_scores['sentence_structure'] = max(0.0, min(1.0, quality_scores['sentence_structure']))

    # 4. Cultural context indicators
    cultural_terms = {
        'jadran', 'dalmacija', 'slavonija', 'istra', 'biser', 'baština',
        'nacionalni park', 'unesco', 'kulturni', 'povijesni', 'tradicija'
    }

    cultural_mentions = sum(1 for term in cultural_terms if term in text.lower())
    quality_scores['cultural_context'] = min(cultural_mentions / 2.0, 1.0)  # Normalize to max 2 mentions

    # 5. Overall quality score
    weights = {
        'diacritic_usage': 0.3,
        'croatian_vocabulary': 0.3,
        'sentence_structure': 0.2,
        'cultural_context': 0.2
    }

    quality_scores['overall'] = sum(
        score * weights[metric]
        for metric, score in quality_scores.items()
        if metric in weights
    )

    return quality_scores

# Test Croatian quality assessment
test_texts = {
    "High Quality Croatian": "Zagreb je glavni i najveći grad Republike Hrvatske, smješten na sjeverozapadu zemlje uz rijeku Savu. Grad je važno političko, gospodarsko i kulturno središte, te dom mnogih značajnih institucija i kulturnih znamenitosti.",

    "Medium Quality": "Zagreb je glavni grad Hrvatske. Ima puno stanovnika i nalazi se u hrvatskoj. Grad je poznat po svojoj ljepoti.",

    "Low Quality (No Diacritics)": "Zagreb je glavni grad Hrvatske. Grad ima puno stanovnika i poznat je po svojoj lepoti i kulturi.",

    "Cultural Context Rich": "Dubrovnik, poznat kao 'biser Jadrana', je grad s bogatom povijesti smješten u Dalmaciji. UNESCO je uvrstio Dubrovnik na popis svjetske baštine zbog njegovih izuzetnih kulturnih vrijednosti."
}

print("🇭🇷 Croatian Language Quality Assessment:")
print("=" * 60)

quality_results = []

for category, text in test_texts.items():
    scores = assess_croatian_quality(text)

    print(f"\n📝 {category}:")
    print(f"Text: {text[:80]}...")
    print(f"\n📊 Quality Scores:")

    for metric, score in scores.items():
        if metric != 'overall':
            print(f"  {metric.replace('_', ' ').title()}: {score:.3f}")

    print(f"  🎯 Overall Quality: {scores['overall']:.3f}")

    # Quality rating
    overall = scores['overall']
    if overall >= 0.8:
        rating = "🟢 Excellent"
    elif overall >= 0.6:
        rating = "🟡 Good"
    elif overall >= 0.4:
        rating = "🟠 Fair"
    else:
        rating = "🔴 Poor"

    print(f"  Rating: {rating}")

    quality_results.append({
        'category': category,
        'overall_score': scores['overall'],
        'scores': scores
    })

    print("-" * 40)

# Visualize quality assessment
categories = [r['category'] for r in quality_results]
overall_scores = [r['overall_score'] for r in quality_results]

# Detailed scores breakdown
metrics = ['diacritic_usage', 'croatian_vocabulary', 'sentence_structure', 'cultural_context']
metric_scores = {metric: [r['scores'][metric] for r in quality_results] for metric in metrics}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Overall quality scores
bars = ax1.barh(categories, overall_scores, color=['green', 'orange', 'red', 'blue'])
ax1.set_title('Overall Croatian Quality Scores', weight='bold')
ax1.set_xlabel('Quality Score')
ax1.set_xlim(0, 1)

for bar, score in zip(bars, overall_scores):
    ax1.text(bar.get_width() + 0.02, bar.get_y() + bar.get_height()/2,
             f'{score:.3f}', ha='left', va='center', weight='bold')

# Quality metrics breakdown
x = np.arange(len(categories))
width = 0.2
colors = ['lightblue', 'lightgreen', 'lightsalmon', 'plum']

for i, (metric, scores) in enumerate(metric_scores.items()):
    ax2.bar(x + i * width, scores, width, label=metric.replace('_', ' ').title(), color=colors[i])

ax2.set_title('Quality Metrics Breakdown', weight='bold')
ax2.set_ylabel('Score')
ax2.set_xlabel('Text Category')
ax2.set_xticks(x + width * 1.5)
ax2.set_xticklabels([cat.replace(' ', '\n') for cat in categories], fontsize=8)
ax2.legend(fontsize=8)
ax2.set_ylim(0, 1)

plt.tight_layout()
plt.show()

print("📊 Croatian quality assessment complete")

## 9. Best Practices and Optimization

Key lessons learned for Croatian RAG generation:

In [None]:
# Best practices summary
best_practices = {
    "Prompt Engineering": {
        "✅ Do": [
            "Use Croatian system prompts consistently",
            "Include cultural context instructions",
            "Specify formal Croatian language style",
            "Match prompt template to query type",
            "Preserve diacritic encoding throughout"
        ],
        "❌ Don't": [
            "Mix languages in system prompts",
            "Ignore Croatian cultural context",
            "Use generic English templates",
            "Assume model understands Croatian nuances"
        ]
    },
    "Model Configuration": {
        "✅ Do": [
            "Set temperature 0.6-0.8 for Croatian",
            "Use sufficient max tokens (1500-2000)",
            "Enable diacritic preservation",
            "Configure timeout for longer responses"
        ],
        "❌ Don't": [
            "Use too low temperature (rigid responses)",
            "Use too high temperature (inconsistent quality)",
            "Limit tokens too strictly",
            "Ignore Croatian-specific settings"
        ]
    },
    "Response Processing": {
        "✅ Do": [
            "Parse and validate Croatian content",
            "Check for diacritic preservation",
            "Assess cultural context accuracy",
            "Monitor confidence scores",
            "Extract and validate source references"
        ],
        "❌ Don't": [
            "Accept responses without validation",
            "Ignore language detection results",
            "Skip quality assessment",
            "Trust confidence scores blindly"
        ]
    },
    "Performance Optimization": {
        "✅ Do": [
            "Cache common Croatian patterns",
            "Batch similar query types",
            "Monitor generation times",
            "Optimize context length",
            "Use async processing for scalability"
        ],
        "❌ Don't": [
            "Generate synchronously for multiple queries",
            "Ignore performance metrics",
            "Use excessive context length",
            "Skip error handling"
        ]
    }
}

print("🎯 Croatian RAG Generation Best Practices:")
print("=" * 60)

for category, practices in best_practices.items():
    print(f"\n📚 {category}:")

    if "✅ Do" in practices:
        print("\n✅ Best Practices:")
        for practice in practices["✅ Do"]:
            print(f"  • {practice}")

    if "❌ Don't" in practices:
        print("\n❌ Avoid:")
        for practice in practices["❌ Don't"]:
            print(f"  • {practice}")

    print("-" * 40)

# Performance optimization tips
print("\n⚡ Performance Optimization Tips:")
optimization_tips = [
    "Use llama3.1:8b for best Croatian support vs speed balance",
    "Implement response caching for repeated queries",
    "Batch process multiple queries when possible",
    "Monitor GPU/CPU usage during generation",
    "Set appropriate timeout values (30-60s)",
    "Use streaming for long responses",
    "Implement fallback strategies for failures"
]

for tip in optimization_tips:
    print(f"  🔧 {tip}")

print("\n📈 Quality Improvement Strategies:")
quality_tips = [
    "Regularly assess Croatian diacritic usage",
    "Monitor cultural context accuracy",
    "Track confidence score distributions",
    "Validate source reference extraction",
    "Test with diverse Croatian query types",
    "Implement human feedback loops",
    "Update prompt templates based on performance"
]

for tip in quality_tips:
    print(f"  📊 {tip}")

## 10. Summary and Next Steps

### What We've Accomplished in Step 4

✅ **Ollama Integration**: Built robust client for local LLM processing

✅ **Croatian Prompt Engineering**: Created specialized templates for different query types

✅ **Response Quality Assessment**: Implemented Croatian language quality metrics

✅ **Performance Monitoring**: Added comprehensive generation tracking

✅ **Cultural Context**: Integrated Croatian cultural awareness throughout

### Key Technical Achievements

1. **Local LLM Processing**: Complete privacy and control over generation
2. **Croatian Language Support**: Diacritic preservation and cultural context
3. **Adaptive Prompting**: Different templates for different query types
4. **Quality Assessment**: Multi-metric evaluation system
5. **Performance Optimization**: Async processing and monitoring

### Next: Step 5 - Complete Pipeline Integration

In the final step, we'll integrate all components into a complete RAG system:

- **End-to-End Pipeline**: Connect preprocessing → vector DB → retrieval → generation
- **System Orchestration**: Manage the complete workflow
- **Error Handling**: Robust failure recovery
- **Performance Optimization**: System-wide efficiency improvements
- **Evaluation Framework**: Complete system assessment

The generation system is now ready to produce high-quality Croatian responses using retrieved context. Let's move on to integrate everything into our final RAG pipeline!