# Enhanced Multi-Judge AI System

This notebook contains an enhanced multi-judge AI system designed to leverage multiple language models for comprehensive query processing, evaluation, and synthesis.

**Key Features:**

- **Parallel Response Generation:** Generates responses from multiple models simultaneously.
- **Advanced Evaluation:** Employs detailed metrics (factual accuracy, coherence, relevance, etc.) and statistical analysis to evaluate model performance.
- **Targeted Regeneration:** Improves low-scoring responses based on evaluation feedback.
- **Collaborative Discussion Simulation:** Assigns roles (Knowledge Integrator, Perspective Analyst, Clarifier, Synthesizer) to models to simulate a discussion and generate a synthesized consensus answer.
- **Benchmark System:** Includes a comprehensive benchmark system to statistically evaluate model performance across different question categories and difficulties.
- **Optimized Debate System:** A faster, albeit simpler, system for rapid response generation and evaluation.

The system aims to provide more robust and reliable answers by aggregating insights and mitigating individual model weaknesses through a structured process.

### Main version

In [None]:
!pip install language_tool_python

Collecting language_tool_python
  Downloading language_tool_python-2.9.4-py3-none-any.whl.metadata (55 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/55.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading language_tool_python-2.9.4-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.6/55.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: language_tool_python
Successfully installed language_tool_python-2.9.4


In [None]:
import asyncio
import json
import logging
from typing import List, Dict, Any, Optional, Deque, Tuple
import aiohttp
from dataclasses import dataclass, field
from enum import Enum
import time
from collections import deque, defaultdict
from concurrent.futures import ThreadPoolExecutor
import statistics
import re
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from difflib import SequenceMatcher
import random

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger(__name__)

@dataclass
class KnowledgeGapAnalysis:
    """Analysis of knowledge gaps in model responses"""
    missing_keywords: List[str] = field(default_factory=list)
    factual_discrepancies: List[str] = field(default_factory=list)
    logical_inconsistencies: List[str] = field(default_factory=list)
    semantic_gaps: float = 0.0
    structural_differences: float = 0.0

@dataclass
class DiscussionMetrics:
    """Comprehensive metrics for model evaluation"""
    factual_accuracy: float = 0.0
    logical_consistency: float = 0.0
    reference_alignment: float = 0.0
    coherence_score: float = 0.0
    completeness_score: float = 0.0
    relevance_score: float = 0.0
    readability_score: float = 0.0
    grammar_score: float = 0.0
    vocabulary_diversity: float = 0.0
    bleu_score: float = 0.0
    semantic_similarity: float = 0.0
    response_time: float = 0.0
    token_efficiency: float = 0.0
    confidence_interval: Tuple[float, float] = (0.0, 0.0)
    standard_error: float = 0.0
    knowledge_gaps: KnowledgeGapAnalysis = field(default_factory=KnowledgeGapAnalysis)
    overall_score: float = 0.0
    category_scores: Dict[str, float] = field(default_factory=dict)
    collaboration_score: float = 0.0

@dataclass
class DiscussionQuestion:
    """Standardized discussion question structure"""
    id: str
    question: str
    category: str
    difficulty: str
    reference_answer: str
    evaluation_criteria: List[str]
    expected_keywords: List[str] = field(default_factory=list)
    max_tokens: int = 1500
    perspectives: List[str] = field(default_factory=list)

class DiscussionDataset:
    """Standardized discussion questions across different domains"""
    @staticmethod
    def get_comprehensive_dataset() -> List[DiscussionQuestion]:
        return [
            DiscussionQuestion(
                id="ethics_001",
                question="What are the main ethical concerns regarding AI decision-making in healthcare?",
                category="ethics",
                difficulty="hard",
                reference_answer="Key concerns include patient privacy, algorithmic bias, accountability for decisions, transparency in AI reasoning, consent for AI involvement, and ensuring human oversight in critical decisions.",
                evaluation_criteria=["completeness", "relevance", "coherence"],
                expected_keywords=["privacy", "bias", "accountability", "transparency", "consent", "oversight"],
                perspectives=["patient perspective", "medical professional perspective", "AI developer perspective", "regulatory perspective"]
            )
        ]

class AdvancedEvaluationSystem:
    def __init__(self):
        self.tfidf_vectorizer = TfidfVectorizer(stop_words='english')
        self.smoothing = SmoothingFunction().method1
        self.knowledge_cache = defaultdict(dict)

    def calculate_comprehensive_metrics(self, response: str, reference: str,
                                     question: DiscussionQuestion,
                                     response_time: float) -> DiscussionMetrics:
        metrics = DiscussionMetrics()
        metrics.response_time = response_time
        metrics.token_efficiency = len(response.split()) / max(response_time, 0.1)

        # Only calculate metrics if we have a reference answer
        if reference:
            metrics.factual_accuracy = self._calculate_factual_accuracy(response, reference, question)
            metrics.logical_consistency = self._calculate_logical_consistency(response)
            metrics.reference_alignment = self._calculate_reference_alignment(response, reference)
            metrics.completeness_score = self._calculate_completeness(response, question)
            metrics.semantic_similarity = self._calculate_semantic_similarity(response, reference)
            metrics.bleu_score = self._calculate_bleu_score(response, reference)
            metrics.knowledge_gaps = self._analyze_knowledge_gaps(response, reference, question)
            metrics.collaboration_score = self._calculate_collaboration_score(response, question)
        else:
            # Set high default values when no reference is available
            metrics.factual_accuracy = 1.0
            metrics.logical_consistency = 1.0
            metrics.reference_alignment = 1.0
            metrics.completeness_score = 1.0
            metrics.semantic_similarity = 1.0
            metrics.bleu_score = 1.0
            metrics.collaboration_score = 1.0
            metrics.knowledge_gaps = KnowledgeGapAnalysis()

        # Always calculate these metrics
        metrics.coherence_score = self._calculate_coherence(response)
        metrics.relevance_score = self._calculate_relevance(response, question.question)
        metrics.readability_score = self._calculate_readability(response)
        metrics.grammar_score = self._calculate_grammar_score(response)
        metrics.vocabulary_diversity = self._calculate_vocabulary_diversity(response)

        metrics.overall_score = self._calculate_overall_score(metrics)
        return metrics

    def _calculate_collaboration_score(self, response: str, question: DiscussionQuestion) -> float:
        if not question.perspectives:
            return 1.0

        perspective_matches = sum(
            1 for perspective in question.perspectives
            if perspective.lower() in response.lower()
        )
        return perspective_matches / len(question.perspectives)

    def _analyze_knowledge_gaps(self, response: str, reference: str,
                               question: DiscussionQuestion) -> KnowledgeGapAnalysis:
        analysis = KnowledgeGapAnalysis()
        if not reference:
            return analysis

        response_lower = response.lower()
        analysis.missing_keywords = [
            kw for kw in question.expected_keywords
            if kw.lower() not in response_lower
        ]

        analysis.structural_differences = 1 - SequenceMatcher(
            None, response, reference
        ).ratio()

        ref_sentences = [s.strip() for s in reference.split('.') if s.strip()]
        resp_sentences = [s.strip() for s in response.split('.') if s.strip()]

        for i, (ref, resp) in enumerate(zip(ref_sentences, resp_sentences)):
            if i >= len(resp_sentences):
                break
            if ("not " + ref.lower() in resp.lower() or
                ref.lower() in "not " + resp.lower()):
                analysis.factual_discrepancies.append(
                    f"Contradiction in sentence {i+1}: '{ref}' vs '{resp}'"
                )

        try:
            docs = [response, reference]
            tfidf_matrix = self.tfidf_vectorizer.fit_transform(docs)
            analysis.semantic_gaps = 1 - cosine_similarity(
                tfidf_matrix[0:1], tfidf_matrix[1:2]
            )[0][0]
        except:
            analysis.semantic_gaps = 1.0

        return analysis

    def generate_improvement_prompt(self, response: str, metrics: DiscussionMetrics,
                                  question: DiscussionQuestion) -> str:
        gaps = metrics.knowledge_gaps
        prompt_parts = []

        if gaps.missing_keywords:
            prompt_parts.append(f"Include these key concepts: {', '.join(gaps.missing_keywords)}")

        if gaps.factual_discrepancies:
            prompt_parts.append("Correct these factual issues:\n- " + "\n- ".join(gaps.factual_discrepancies[:3]))

        if metrics.logical_consistency < 0.7:
            prompt_parts.append("Improve logical flow by ensuring all points connect coherently")

        if metrics.completeness_score < 0.7:
            missing_criteria = [
                crit for crit in question.evaluation_criteria
                if not any(kw in response.lower()
                          for kw in self._get_criterion_keywords(crit))
            ]
            if missing_criteria:
                prompt_parts.append("Address these missing evaluation criteria:\n- " + "\n- ".join(missing_criteria))

        if metrics.semantic_similarity < 0.6:
            prompt_parts.append("Align more closely with the reference answer's semantic meaning")

        if metrics.collaboration_score < 0.5 and question.perspectives:
            missing_perspectives = [
                p for p in question.perspectives
                if p.lower() not in response.lower()
            ]
            prompt_parts.append(f"Consider these additional perspectives: {', '.join(missing_perspectives)}")

        base_prompt = (
            f"Improve this response to '{question.question}':\n\n"
            f"Current response: {response[:500]}\n\n"
            "Specific improvements needed:\n"
        )

        return base_prompt + "\n".join(f"- {part}" for part in prompt_parts) if prompt_parts else base_prompt + "- No specific issues identified"

    def _calculate_factual_accuracy(self, response: str, reference: str,
                                  question: DiscussionQuestion) -> float:
        response_lower = response.lower()
        keyword_score = sum(1 for keyword in question.expected_keywords
                         if keyword.lower() in response_lower) / max(len(question.expected_keywords), 1)

        numbers_in_response = re.findall(r'\d+\.?\d*', response)
        numbers_in_reference = re.findall(r'\d+\.?\d*', reference)

        numerical_accuracy = sum(1 for num in numbers_in_reference
                               if num in numbers_in_response) / len(numbers_in_reference) if numbers_in_reference else 1.0

        return (keyword_score + numerical_accuracy) / 2

    def _calculate_logical_consistency(self, response: str) -> float:
        sentences = response.split('.')
        contradiction_markers = ['however', 'but', 'although', 'despite', 'nevertheless']
        contradictions = sum(1 for sentence in sentences
                           for marker in contradiction_markers
                           if marker in sentence.lower())
        return max(0, 1 - (contradictions / max(len(sentences), 1)))

    def _calculate_reference_alignment(self, response: str, reference: str) -> float:
        try:
            docs = [response, reference]
            tfidf_matrix = self.tfidf_vectorizer.fit_transform(docs)
            return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        except:
            return 0.0

    def _calculate_coherence(self, response: str) -> float:
        sentences = [s.strip() for s in response.split('.') if s.strip()]
        if len(sentences) < 2:
            return 1.0

        transition_words = ['therefore', 'however', 'moreover', 'furthermore',
                          'additionally', 'consequently', 'thus', 'hence']
        transitions = sum(1 for sentence in sentences[1:]
                         for word in transition_words
                         if word in sentence.lower())

        return min(1.0, transitions / max(len(sentences) - 1, 1) + 0.5)

    def _calculate_completeness(self, response: str, question: DiscussionQuestion) -> float:
        criteria_coverage = sum(
            1 for criterion in question.evaluation_criteria
            if any(kw in response.lower()
                  for kw in self._get_criterion_keywords(criterion)))
        return criteria_coverage / max(len(question.evaluation_criteria), 1)

    def _get_criterion_keywords(self, criterion: str) -> List[str]:
        keyword_map = {
            'factual_accuracy': ['fact', 'accurate', 'correct', 'true'],
            'completeness': ['complete', 'comprehensive', 'thorough', 'detailed'],
            'logical_consistency': ['logic', 'consistent', 'coherent', 'reasoning'],
            'creativity': ['innovative', 'creative', 'novel', 'unique'],
            'technical_accuracy': ['technical', 'precise', 'specific', 'accurate'],
            'clarity': ['clear', 'understandable', 'simple', 'explain'],
            'examples_quality': ['example', 'instance', 'case', 'illustration']
        }
        return keyword_map.get(criterion, [])

    def _calculate_relevance(self, response: str, question: str) -> float:
        try:
            docs = [response, question]
            tfidf_matrix = self.tfidf_vectorizer.fit_transform(docs)
            return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
        except:
            return 0.0

    def _calculate_readability(self, text: str) -> float:
        sentences = text.split('.')
        words = text.split()
        if not sentences or not words:
            return 0.0

        avg_sentence_length = len(words) / len(sentences)
        avg_word_length = sum(len(word) for word in words) / len(words)
        sentence_score = 1 - abs(avg_sentence_length - 17.5) / 17.5
        word_score = 1 - abs(avg_word_length - 5) / 5
        return max(0, (sentence_score + word_score) / 2)

    def _calculate_grammar_score(self, text: str) -> float:
        common_errors = ['teh', 'recieve', 'seperate', 'definately', 'occured']
        error_count = sum(1 for error in common_errors if error in text.lower())
        sentence_endings = text.count('.') + text.count('!') + text.count('?')
        sentences = len([s for s in text.split('.') if s.strip()])
        punctuation_score = min(1.0, sentence_endings / max(sentences, 1))
        error_penalty = max(0, 1 - error_count * 0.1)
        return (punctuation_score + error_penalty) / 2

    def _calculate_vocabulary_diversity(self, text: str) -> float:
        words = [word.lower() for word in re.findall(r'\b\w+\b', text)]
        return len(set(words)) / len(words) if words else 0.0

    def _calculate_bleu_score(self, response: str, reference: str) -> float:
        try:
            response_tokens = word_tokenize(response.lower())
            reference_tokens = word_tokenize(reference.lower())
            return sentence_bleu([reference_tokens], response_tokens,
                               smoothing_function=self.smoothing)
        except:
            return 0.0

    def _calculate_semantic_similarity(self, response: str, reference: str) -> float:
        return self._calculate_reference_alignment(response, reference)

    def _calculate_overall_score(self, metrics: DiscussionMetrics) -> float:
        weights = {
            'factual_accuracy': 0.20,
            'logical_consistency': 0.15,
            'reference_alignment': 0.15,
            'coherence_score': 0.10,
            'completeness_score': 0.15,
            'relevance_score': 0.10,
            'readability_score': 0.05,
            'grammar_score': 0.05,
            'vocabulary_diversity': 0.05
        }
        return sum(getattr(metrics, metric_name, 0) * weight
                 for metric_name, weight in weights.items())

class ModelType(Enum):
    GEMMA_3N_E2B = "tencent/hunyuan-a13b-instruct:free"
    QWEN3_4B = "mistralai/mistral-small-3.2-24b-instruct:free"
    GEMMA_3_12B = "z-ai/glm-4.5-air:free"
    KIMI_DEV = "moonshotai/kimi-dev-72b:free"
    DOLPHIN_MISTRAL = "cognitivecomputations/dolphin-mistral-24b-venice-edition:free"

class DiscussionRole(Enum):
    PERSPECTIVE_ANALYST = "perspective_analyst"
    KNOWLEDGE_INTEGRATOR = "knowledge_integrator"
    SYNTHESIZER = "synthesizer"
    FACT_CHECKER = "fact_checker"
    CLARIFIER = "clarifier"

@dataclass
class ModelResponse:
    model_name: str
    answer: str
    generation_time: float
    token_count: int = 0
    discussion_role: DiscussionRole = DiscussionRole.PERSPECTIVE_ANALYST
    reasoning_chain: List[str] = None
    error: Optional[str] = None
    discussion_metrics: Optional[DiscussionMetrics] = None

@dataclass
class DiscussionRound:
    round_number: int
    contributions: List[str]
    perspectives_covered: List[str]
    synthesis: Optional[str] = None
    contribution_metrics: Dict[str, DiscussionMetrics] = field(default_factory=dict)
    synthesis_metrics: Optional[DiscussionMetrics] = None
    role_assignments: Dict[str, str] = field(default_factory=dict)
    improvement_scores: Dict[str, float] = field(default_factory=dict)
    discussion_metrics: Dict[str, float] = field(default_factory=dict)
    questions_raised: List[str] = field(default_factory=list)

@dataclass
class DiscussionMemory:
    history: Deque[DiscussionRound]
    current_consensus: Optional[str] = None
    knowledge_gaps: List[str] = None
    open_questions: List[str] = None

class OpenRouterClient:
    def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
        self.api_key = api_key
        self.base_url = base_url
        self.session = None
        self.timeout = aiohttp.ClientTimeout(total=90, connect=15, sock_read=60)
        self.max_retries = 3
        self.base_delay = 2

    async def __aenter__(self):
        logger.info("Initializing OpenRouter client session...")
        connector = aiohttp.TCPConnector(
            limit=10,
            ttl_dns_cache=300,
            use_dns_cache=True,
            keepalive_timeout=60,
            enable_cleanup_closed=True
        )
        self.session = aiohttp.ClientSession(
            timeout=self.timeout,
            connector=connector
        )
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.session:
            logger.info("Closing OpenRouter client session...")
            await self.session.close()

    async def generate_response(self, model: str, messages: List[Dict],
                               temperature: float = 0.7, max_tokens: int = 1500) -> Dict:
        logger.info(f"Generating response from model: {model}")
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "HTTP-Referer": "https://your-app.com",
            "X-Title": "Enhanced Discussion System",
            "User-Agent": "Enhanced-Discussion-System/1.0"
        }

        payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "stream": False
        }

        for attempt in range(self.max_retries):
            start_time = time.time()
            try:
                if attempt > 0:
                    delay = self.base_delay * (2 ** (attempt - 1)) + random.uniform(0, 1)
                    logger.info(f"Retrying {model} after {delay:.2f}s (attempt {attempt + 1}/{self.max_retries})")
                    await asyncio.sleep(delay)

                async with self.session.post(
                    f"{self.base_url}/chat/completions",
                    headers=headers,
                    json=payload
                ) as response:
                    if response.status == 429:
                        retry_after = int(response.headers.get('Retry-After', 10))
                        logger.warning(f"Rate limited. Waiting {retry_after} seconds...")
                        await asyncio.sleep(retry_after)
                        continue

                    if 500 <= response.status < 600:
                        logger.warning(f"Server error {response.status} for {model}, will retry")
                        continue

                    if 400 <= response.status < 500 and response.status != 408:
                        error_text = await response.text()
                        logger.error(f"Client error {response.status} for {model}: {error_text}")
                        return {
                            "error": f"HTTP {response.status}: {error_text[:200]}...",
                            "status": response.status
                        }

                    if response.status == 408:
                        logger.warning(f"Request timeout (408) for {model}, will retry")
                        continue

                    response.raise_for_status()
                    result = await response.json()

                    return {
                        "content": result["choices"][0]["message"]["content"],
                        "response_time": time.time() - start_time,
                        "usage": result.get("usage", {}),
                        "tokens": result.get("usage", {}).get("total_tokens", 0),
                        "model": model,
                        "attempt": attempt + 1
                    }

            except asyncio.TimeoutError:
                logger.warning(f"Timeout for {model} on attempt {attempt + 1}")
                if attempt == self.max_retries - 1:
                    return {
                        "error": f"Timeout after {self.max_retries} attempts",
                        "status": 408
                    }
                continue

            except aiohttp.ClientError as e:
                logger.error(f"HTTP client error with model {model}: {str(e)}")
                if attempt == self.max_retries - 1:
                    return {
                        "error": f"HTTP client error: {str(e)}",
                        "status": getattr(e, "status", None)
                    }
                continue

            except Exception as e:
                logger.error(f"Unexpected error with model {model}: {str(e)}")
                if attempt == self.max_retries - 1:
                    return {
                        "error": f"Unexpected error: {str(e)}"
                    }
                continue

        return {
            "error": f"Failed after {self.max_retries} attempts",
            "status": None
        }

class DiscussionSystem:
    def __init__(self):
        self.memory = DiscussionMemory(
            history=deque(maxlen=3),
            knowledge_gaps=[],
            open_questions=[]
        )
        self.executor = ThreadPoolExecutor(max_workers=4)
        self.evaluation_system = AdvancedEvaluationSystem()
        self.benchmark_dataset = DiscussionDataset.get_comprehensive_dataset()
        self.regeneration_attempts = defaultdict(int)
        self.MAX_REGENERATIONS = 2

    async def conduct_discussion(self, query: str, initial_responses: List[ModelResponse]) -> DiscussionRound:
        current_round = DiscussionRound(
            round_number=len(self.memory.history) + 1,
            contributions=[],
            perspectives_covered=[]
        )

        benchmark_question = next(
            (q for q in self.benchmark_dataset if q.question.lower() in query.lower()),
            None
        )

        evaluated_responses = []
        for response in initial_responses:
            if not response.error:
                response.discussion_metrics = self.evaluation_system.calculate_comprehensive_metrics(
                    response.answer,
                    benchmark_question.reference_answer if benchmark_question else "",
                    benchmark_question if benchmark_question else DiscussionQuestion(
                        id="custom",
                        question=query,
                        category="custom",
                        difficulty="medium",
                        reference_answer="",
                        evaluation_criteria=[]
                    ),
                    response.generation_time
                )
                evaluated_responses.append(response)

        regenerated_responses = await self._perform_targeted_regeneration(
            evaluated_responses, query, benchmark_question
        )

        all_responses = evaluated_responses + regenerated_responses
        valid_responses = [r for r in all_responses if not r.error]
        if not valid_responses:
            logger.error("No valid responses for discussion")
            return current_round

        roles_assigned = self._assign_roles_with_explanations(valid_responses, benchmark_question)

        current_round.role_assignments = {
            response.model_name: {
                'role': response.discussion_role.value,
                'reason': f"Assigned {response.discussion_role.value} role due to benchmark score of {response.discussion_metrics.overall_score:.2f}"
                          if response.discussion_metrics else "Assigned based on response characteristics"
            }
            for response in roles_assigned
        }

        discussion_tasks = []
        for response in roles_assigned:
            if response.discussion_role == DiscussionRole.PERSPECTIVE_ANALYST and benchmark_question:
                task = self._analyze_perspective(
                    response, query, benchmark_question
                )
            elif response.discussion_role == DiscussionRole.KNOWLEDGE_INTEGRATOR:
                task = self._integrate_knowledge(
                    response, query, valid_responses, benchmark_question
                )
            elif response.discussion_role == DiscussionRole.CLARIFIER:
                task = self._generate_clarifying_questions(
                    response, query, valid_responses, benchmark_question
                )
            else:
                continue
            discussion_tasks.append(task)

        try:
            results = await asyncio.gather(*discussion_tasks, return_exceptions=True)

            for result in results:
                if isinstance(result, Exception):
                    logger.error(f"Discussion task failed: {str(result)}")
                    continue

                if 'contribution' in result:
                    current_round.contributions.append(result['contribution'])
                if 'questions' in result:
                    current_round.questions_raised.extend(result['questions'])
                if 'perspectives' in result and benchmark_question:
                    current_round.perspectives_covered.extend(result['perspectives'])

            if benchmark_question:
                self._evaluate_contributions(current_round, benchmark_question)

                synthesizer = next(
                    (r for r in roles_assigned if r.discussion_role == DiscussionRole.SYNTHESIZER),
                    None
                )
                if synthesizer:
                    await self._synthesize_discussion(
                        current_round, synthesizer, query, benchmark_question
                    )

        except Exception as e:
            logger.error(f"Discussion round failed: {str(e)}")

        self.memory.history.append(current_round)
        return current_round

    async def _perform_targeted_regeneration(self, responses: List[ModelResponse],
                                           query: str,
                                           benchmark_question: Optional[DiscussionQuestion]
                                          ) -> List[ModelResponse]:
        regenerated = []

        if not benchmark_question:
            return regenerated

        for response in responses:
            if (response.discussion_metrics and
                response.discussion_metrics.overall_score < 0.6 and
                self.regeneration_attempts[response.model_name] < self.MAX_REGENERATIONS):

                logger.info(f"Attempting regeneration for {response.model_name} (score: {response.discussion_metrics.overall_score:.2f})")

                prompt = self.evaluation_system.generate_improvement_prompt(
                    response.answer,
                    response.discussion_metrics,
                    benchmark_question
                )

                messages = [
                    {"role": "system", "content": "Please improve your previous response based on the feedback."},
                    {"role": "user", "content": prompt}
                ]

                try:
                    async with OpenRouterClient("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") as client:
                        result = await client.generate_response(
                            response.model_name,
                            messages,
                            temperature=0.5
                        )

                        if result and "error" not in result:
                            new_metrics = self.evaluation_system.calculate_comprehensive_metrics(
                                result["content"],
                                benchmark_question.reference_answer,
                                benchmark_question,
                                result["response_time"]
                            )

                            if (new_metrics.overall_score >
                                response.discussion_metrics.overall_score + 0.1):
                                regenerated.append(ModelResponse(
                                    model_name=response.model_name,
                                    answer=result["content"],
                                    generation_time=result["response_time"],
                                    token_count=result.get("tokens", 0),
                                    discussion_metrics=new_metrics,
                                    discussion_role=response.discussion_role
                                ))
                                self.regeneration_attempts[response.model_name] += 1
                                logger.info(f"Regeneration improved score from {response.discussion_metrics.overall_score:.2f} to {new_metrics.overall_score:.2f}")
                except Exception as e:
                    logger.error(f"Regeneration failed for {response.model_name}: {str(e)}")

        return regenerated

    def _evaluate_contributions(self, discussion_round: DiscussionRound,
                              benchmark_question: DiscussionQuestion):
        for i, contribution in enumerate(discussion_round.contributions):
            discussion_round.contribution_metrics[f"contribution_{i}"] = (
                self.evaluation_system.calculate_comprehensive_metrics(
                    contribution,
                    benchmark_question.reference_answer,
                    benchmark_question,
                    0
                )
            )

    async def _synthesize_discussion(self, discussion_round: DiscussionRound,
                                   synthesizer: ModelResponse, query: str,
                                   benchmark_question: DiscussionQuestion):
        try:
            prompt = self._create_synthesis_prompt(
                query,
                discussion_round.contributions,
                discussion_round.contribution_metrics,
                benchmark_question,
                discussion_round.questions_raised
            )

            messages = [
                {"role": "system", "content": "Synthesize the discussion by combining insights and filling knowledge gaps."},
                {"role": "user", "content": prompt}
            ]

            async with OpenRouterClient("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") as client:
                result = await client.generate_response(
                    synthesizer.model_name,
                    messages,
                    temperature=0.3
                )

                if result and "error" not in result:
                    discussion_round.synthesis = result["content"]
                    self.memory.current_consensus = result["content"]

                    synthesis_metrics = self.evaluation_system.calculate_comprehensive_metrics(
                        result["content"],
                        benchmark_question.reference_answer,
                        benchmark_question,
                        result["response_time"]
                    )
                    discussion_round.synthesis_metrics = synthesis_metrics

                    discussion_round.discussion_metrics = self._calculate_discussion_metrics(
                        discussion_round.contributions,
                        result["content"],
                        benchmark_question
                    )

                    initial_scores = [
                        m.overall_score for m in
                        discussion_round.contribution_metrics.values()
                    ]
                    if initial_scores:
                        discussion_round.improvement_scores = {
                            'initial_avg': statistics.mean(initial_scores),
                            'synthesis_score': synthesis_metrics.overall_score,
                            'improvement': (synthesis_metrics.overall_score -
                                          statistics.mean(initial_scores))
                        }
        except Exception as e:
            logger.error(f"Synthesis failed: {str(e)}")
            discussion_round.synthesis = "Failed to generate synthesis"

    def _calculate_discussion_metrics(self, contributions: List[str],
                                    synthesis: str, question: DiscussionQuestion) -> Dict[str, float]:
        metrics = {}
        contribution_scores = []
        for contribution in contributions:
            cont_metrics = self.evaluation_system.calculate_comprehensive_metrics(
                contribution, question.reference_answer, question, 0
            )
            contribution_scores.append(cont_metrics.overall_score)
        metrics['contribution_quality'] = statistics.mean(contribution_scores) if contribution_scores else 0

        if synthesis:
            synth_metrics = self.evaluation_system.calculate_comprehensive_metrics(
                synthesis, question.reference_answer, question, 0
            )
            metrics['synthesis_quality'] = synth_metrics.overall_score
            metrics['perspective_coverage'] = synth_metrics.collaboration_score
        else:
            metrics['synthesis_quality'] = 0
            metrics['perspective_coverage'] = 0

        return metrics

    def _create_synthesis_prompt(self, query: str, contributions: List[str],
                               contribution_metrics: Dict[str, DiscussionMetrics],
                               benchmark_question: DiscussionQuestion,
                               questions: List[str]) -> str:

        questions_text = (
    "Questions raised during discussion:\n- " + "\n- ".join(questions)
    if questions else "No questions were raised"
)

        perspectives_text = (
            "Available perspectives to consider:\n- " + "\n- ".join(benchmark_question.perspectives)
            if benchmark_question.perspectives else ""
        )
        prompt = f'''
Synthesize the discussion on: {query}

Key contributions:
{self._format_contributions(contributions, contribution_metrics)}

{questions_text}

{perspectives_text}

Guidelines for synthesis:
1. Combine the most valuable insights from all contributions
2. Address any knowledge gaps identified
3. Maintain factual accuracy
4. Acknowledge different perspectives
5. Note any remaining uncertainties or open questions
6. Provide a clear, coherent summary
'''

        return prompt

    def _format_contributions(self, contributions: List[str],
                         metrics: Dict[str, DiscussionMetrics],
                         prefix: str = "contribution") -> str:
        formatted = []
        for i, cont in enumerate(contributions):
            metric_key = f"{prefix}_{i}"
            if metric_key in metrics:
                score = metrics[metric_key].overall_score
                gaps = ", ".join(metrics[metric_key].knowledge_gaps.missing_keywords)
                formatted.append(
                    f"- [Score: {score:.2f}] {cont[:200]}...\n"
                    f"  Missing concepts: {gaps or 'None'}"
                )
            else:
                formatted.append(f"- {cont[:200]}...")
        return "\n".join(formatted)

    async def _analyze_perspective(self, model: ModelResponse, query: str,
                                 question: DiscussionQuestion) -> Dict:
        try:
            perspective = random.choice(question.perspectives) if question.perspectives else "general"
            prompt = f"Analyze the following question from a {perspective} perspective:\n\n{query}"

            messages = [
                {"role": "system", "content": "You are analyzing a question from a specific perspective."},
                {"role": "user", "content": prompt}
            ]

            async with OpenRouterClient("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") as client:
                result = await client.generate_response(
                    model.model_name,
                    messages,
                    temperature=0.7
                )

                if result and "error" not in result:
                    return {
                        'contribution': result["content"],
                        'perspectives': [perspective]
                    }
                return {
                    'contribution': f"Failed to generate perspective analysis ({result.get('error', 'unknown')})",
                    'perspectives': []
                }
        except Exception as e:
            logger.error(f"Failed to analyze perspective: {str(e)}")
            return {'contribution': '', 'perspectives': []}

    async def _integrate_knowledge(self, model: ModelResponse, query: str,
                                responses: List[ModelResponse],
                                question: Optional[DiscussionQuestion]) -> Dict:
        try:
            context = "\n\n".join([r.answer[:500] for r in responses[:3]])
            prompt = f"""Integrate knowledge from multiple sources to address:
{query}

Available information:
{context}

Guidelines:
1. Combine the most valuable insights
2. Resolve any contradictions
3. Fill knowledge gaps where possible
4. Maintain a neutral, objective tone"""

            messages = [
                {"role": "system", "content": "You are integrating knowledge from multiple sources."},
                {"role": "user", "content": prompt}
            ]

            async with OpenRouterClient("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") as client:
                result = await client.generate_response(
                    model.model_name,
                    messages,
                    temperature=0.5
                )

                if result and "error" not in result:
                    return {
                        'contribution': result["content"]
                    }
                return {
                    'contribution': f"Failed to integrate knowledge ({result.get('error', 'unknown')})"
                }
        except Exception as e:
            logger.error(f"Failed to integrate knowledge: {str(e)}")
            return {'contribution': ''}

    async def _generate_clarifying_questions(self, model: ModelResponse, query: str,
                                          responses: List[ModelResponse],
                                          question: Optional[DiscussionQuestion]) -> Dict:
        try:
            context = "\n".join([r.answer[:300] for r in responses[:3]])
            prompt = f"""Based on the following discussion about: {query}

Current contributions:
{context}

Generate 2-3 clarifying questions that would help improve the discussion by:
1. Identifying missing information
2. Resolving contradictions
3. Exploring alternative perspectives
4. Deepening the analysis"""

            messages = [
                {"role": "system", "content": "You generate clarifying questions to improve discussions."},
                {"role": "user", "content": prompt}
            ]

            async with OpenRouterClient("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") as client:
                result = await client.generate_response(
                    model.model_name,
                    messages,
                    temperature=0.7
                )

                if result and "error" not in result:
                    questions = [q.strip() for q in result["content"].split('\n') if q.strip()]
                    return {
                        'questions': questions[:3]  # Limit to 3 questions
                    }
                return {
                    'questions': [f"Failed to generate questions ({result.get('error', 'unknown')})"]
                }
        except Exception as e:
            logger.error(f"Failed to generate questions: {str(e)}")
            return {'questions': []}

    def _assign_roles_with_explanations(self, responses: List[ModelResponse],
                                      question: Optional[DiscussionQuestion]) -> List[ModelResponse]:
        if not responses:
            return []

        # Sort responses by quality metrics if available
        sorted_responses = sorted(
            responses,
            key=lambda x: (
                -x.discussion_metrics.overall_score if x.discussion_metrics else 0,
                len(x.answer)
            ),
            reverse=True
        )[:5]  # Consider top 5 responses for roles

        print("\nRole Assignments:")
        assigned_roles = set()

        for i, response in enumerate(sorted_responses):
            score = response.discussion_metrics.overall_score if response.discussion_metrics else "N/A"
            print(f"Model: {response.model_name}")
            print(f"- Benchmark Score: {score}")
            print(f"- Answer Length: {len(response.answer)} chars")

            # Assign roles based on response characteristics
            if DiscussionRole.KNOWLEDGE_INTEGRATOR not in assigned_roles:
                role = DiscussionRole.KNOWLEDGE_INTEGRATOR
                assigned_roles.add(role)
                print("- Role: Knowledge Integrator (best at combining information)")
            elif question and question.perspectives and DiscussionRole.PERSPECTIVE_ANALYST not in assigned_roles:
                role = DiscussionRole.PERSPECTIVE_ANALYST
                assigned_roles.add(role)
                print("- Role: Perspective Analyst (will explore different viewpoints)")
            elif DiscussionRole.CLARIFIER not in assigned_roles:
                role = DiscussionRole.CLARIFIER
                assigned_roles.add(role)
                print("- Role: Clarifier (will generate probing questions)")
            elif DiscussionRole.SYNTHESIZER not in assigned_roles:
                role = DiscussionRole.SYNTHESIZER
                assigned_roles.add(role)
                print("- Role: Synthesizer (will create final summary)")
            else:
                role = DiscussionRole.PERSPECTIVE_ANALYST
                print("- Role: Perspective Analyst (default role)")

            response.discussion_role = role
            print("-" * 40)

        return sorted_responses[:len(assigned_roles)]  # Return only responses with assigned roles

class EnhancedMultiJudgeSystem:
    def __init__(self, openrouter_api_key: str):
        self.client = OpenRouterClient(openrouter_api_key)
        self.models = [model.value for model in ModelType]
        self.discussion_system = DiscussionSystem()
        self.evaluation_system = AdvancedEvaluationSystem()
        self.benchmark_dataset = DiscussionDataset.get_comprehensive_dataset()
        self.INITIAL_RESPONSE_TIMEOUT = 120
        self.DISCUSSION_TIMEOUT = 180
        self.MODEL_TIMEOUT = 45

    async def evaluate_single_question(self, question_text: str) -> Dict[str, Any]:
        start_time = time.time()
        result = {
            "final_answer": "",
            "processing_time": 0,
            "errors": [],
            "models_responded": 0,
            "models_failed": 0,
            "benchmark_scores": {},
            "discussion_metrics": {},
            "improvement_scores": {},
            "role_assignments": {},
            "contribution_metrics": {},
            "questions_raised": []
        }

        try:
            try:
                responses = await asyncio.wait_for(
                    self.generate_all_responses(question_text),
                    timeout=self.INITIAL_RESPONSE_TIMEOUT
                )
            except asyncio.TimeoutError:
                raise Exception(f"Timeout generating initial responses after {self.INITIAL_RESPONSE_TIMEOUT}s")

            valid_responses = [r for r in responses if not r.error]
            result["models_responded"] = len(valid_responses)
            result["models_failed"] = len(responses) - len(valid_responses)

            if not valid_responses:
                raise Exception("All models failed to respond")

            benchmark_question = next(
                (q for q in self.benchmark_dataset if q.question.lower() in question_text.lower()),
                None
            )

            result["benchmark_scores"] = {
                r.model_name: r.discussion_metrics.overall_score
                for r in valid_responses
                if r.discussion_metrics
            }

            try:
                discussion_round = await asyncio.wait_for(
                    self.discussion_system.conduct_discussion(question_text, valid_responses),
                    timeout=self.DISCUSSION_TIMEOUT
                )

                result.update({
                    "final_answer": discussion_round.synthesis or "No synthesis generated",
                    "discussion_metrics": getattr(discussion_round, 'discussion_metrics', {}),
                    "improvement_scores": getattr(discussion_round, 'improvement_scores', {}),
                    "role_assignments": getattr(discussion_round, 'role_assignments', {}),
                    "contribution_metrics": getattr(discussion_round, 'contribution_metrics', {}),
                    "questions_raised": getattr(discussion_round, 'questions_raised', []),
                    "discussion_rounds": len(self.discussion_system.memory.history)
                })

            except asyncio.TimeoutError:
                raise Exception(f"Timeout during discussion after {self.DISCUSSION_TIMEOUT}s")

            result.update({
                "processing_time": time.time() - start_time,
                "success": True
            })

        except Exception as e:
            logger.error(f"Processing failed: {str(e)}")
            result.update({
                "errors": [str(e)],
                "processing_time": time.time() - start_time,
                "success": False
            })

        return result

    async def generate_all_responses(self, user_query: str) -> List[ModelResponse]:
        responses = []

        async with self.client:
            tasks = []
            for model in self.models:
                messages = self._create_discussion_prompt(user_query)
                task = asyncio.wait_for(
                    self.client.generate_response(model, messages),
                    timeout=self.MODEL_TIMEOUT
                )
                tasks.append((model, task))

            for model, task in tasks:
                try:
                    result = await task
                    if result and "error" not in result:
                        benchmark_question = next(
                            (q for q in self.benchmark_dataset
                             if q.question.lower() in user_query.lower()),
                            None
                        )

                        metrics = None
                        if benchmark_question:
                            metrics = self.evaluation_system.calculate_comprehensive_metrics(
                                result["content"],
                                benchmark_question.reference_answer,
                                benchmark_question,
                                result["response_time"]
                            )

                        responses.append(ModelResponse(
                            model_name=model,
                            answer=result["content"],
                            generation_time=result["response_time"],
                            token_count=result.get("tokens", 0),
                            discussion_metrics=metrics
                        ))
                    else:
                        error_msg = result.get("error", "Unknown error") if result else "No response"
                        responses.append(ModelResponse(
                            model_name=model,
                            answer="",
                            generation_time=0,
                            error=error_msg
                        ))
                except asyncio.TimeoutError:
                    responses.append(ModelResponse(
                        model_name=model,
                        answer="",
                        generation_time=0,
                        error=f"Timeout after {self.MODEL_TIMEOUT}s"
                    ))
                except Exception as e:
                    responses.append(ModelResponse(
                        model_name=model,
                        answer="",
                        generation_time=0,
                        error=str(e)
                    ))

        return responses

    def _create_discussion_prompt(self, query: str) -> List[Dict]:
        return [
            {
                "role": "system",
                "content": "You are participating in a collaborative discussion. Provide thoughtful, well-reasoned contributions."
            },
            {
                "role": "user",
                "content": f"{query}\n\nPresent your perspective with supporting evidence in a discussion-friendly format."
            }
        ]

    async def process_query(self, user_query: str) -> Dict[str, Any]:
        start_time = time.time()
        result = {
            "final_answer": "",
            "processing_time": 0,
            "errors": [],
            "models_responded": 0,
            "models_failed": 0,
            "benchmark_scores": {},
            "discussion_metrics": {},
            "improvement_scores": {},
            "role_assignments": {},
            "contribution_metrics": {},
            "questions_raised": []
        }

        try:
            try:
                responses = await asyncio.wait_for(
                    self.generate_all_responses(user_query),
                    timeout=self.INITIAL_RESPONSE_TIMEOUT
                )
            except asyncio.TimeoutError:
                raise Exception(f"Timeout while generating initial responses after {self.INITIAL_RESPONSE_TIMEOUT}s")

            valid_responses = [r for r in responses if not r.error]
            result["models_responded"] = len(valid_responses)
            result["models_failed"] = len(responses) - len(valid_responses)

            if not valid_responses:
                raise Exception("All models failed to respond")

            benchmark_question = next(
                (q for q in self.benchmark_dataset if q.question.lower() in user_query.lower()),
                None
            )

            result["benchmark_scores"] = {
                r.model_name: r.discussion_metrics.overall_score
                for r in valid_responses
                if r.discussion_metrics
            }

            try:
                discussion_round = await asyncio.wait_for(
                    self.discussion_system.conduct_discussion(user_query, valid_responses),
                    timeout=self.DISCUSSION_TIMEOUT
                )

                result.update({
                    "final_answer": discussion_round.synthesis or "No synthesis generated",
                    "discussion_metrics": getattr(discussion_round, 'discussion_metrics', {}),
                    "improvement_scores": getattr(discussion_round, 'improvement_scores', {}),
                    "role_assignments": getattr(discussion_round, 'role_assignments', {}),
                    "contribution_metrics": getattr(discussion_round, 'contribution_metrics', {}),
                    "questions_raised": getattr(discussion_round, 'questions_raised', []),
                    "discussion_rounds": len(self.discussion_system.memory.history)
                })

            except asyncio.TimeoutError:
                raise Exception(f"Timeout during discussion after {self.DISCUSSION_TIMEOUT}s")

            result.update({
                "processing_time": time.time() - start_time,
                "success": True
            })

        except Exception as e:
            logger.error(f"Processing failed: {str(e)}")
            result.update({
                "errors": [str(e)],
                "processing_time": time.time() - start_time,
                "success": False
            })

        return result

async def main():
    API_KEY = "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"  # Replace with your actual API key

    try:
        system = EnhancedMultiJudgeSystem(API_KEY)

        # Evaluate the ethics question in detail
        print("Evaluating ethics question in detail...")
        question = next(q for q in system.benchmark_dataset if q.id == "ethics_001")
        results = await system.evaluate_single_question(question.question)

        print("\nDetailed Results:")
        print(f"Question: {question.question}")
        print(f"Category: {question.category}")
        print(f"Difficulty: {question.difficulty}")

        print("\nModels Responded:", results["models_responded"])
        print("Models Failed:", results["models_failed"])

        print("\nBenchmark Scores:")
        for model, score in results["benchmark_scores"].items():
            print(f"- {model}: {score:.3f}")

        print("\nRole Assignments:")
        for model, info in results["role_assignments"].items():
            print(f"- {model}: {info['role']} ({info['reason']})")

        print("\nQuestions Raised:")
        for q in results["questions_raised"]:
            print(f"- {q}")

        print("\nFinal Answer:")
        print(results["final_answer"][:500] + ("..." if len(results["final_answer"]) > 500 else ""))

        print(results)

        if results["errors"]:
            print("\nErrors:")
            for error in results["errors"]:
                print(f"- {error}")

    except Exception as e:
        print(f"System failed: {str(e)}")

await main()

Evaluating ethics question in detail...

Role Assignments:
Model: tencent/hunyuan-a13b-instruct:free
- Benchmark Score: 0.5078727833839056
- Answer Length: 2417 chars
- Role: Knowledge Integrator (best at combining information)
----------------------------------------
Model: z-ai/glm-4.5-air:free
- Benchmark Score: 0.5745982192734528
- Answer Length: 2742 chars
- Role: Perspective Analyst (will explore different viewpoints)
----------------------------------------
Model: moonshotai/kimi-dev-72b:free
- Benchmark Score: 0.5751241882472853
- Answer Length: 6052 chars
- Role: Clarifier (will generate probing questions)
----------------------------------------
Model: mistralai/mistral-small-3.2-24b-instruct:free
- Benchmark Score: 0.6155821010327082
- Answer Length: 3874 chars
- Role: Synthesizer (will create final summary)
----------------------------------------
Model: cognitivecomputations/dolphin-mistral-24b-venice-edition:free
- Benchmark Score: 0.6177122992325546
- Answer Length: 3681

### Paid models version (OpenRouter)

Disclaimer: These models are expensive to run and consume high level of token and their output quality is really good

In [None]:
import asyncio
import json
import logging
from typing import List, Dict, Any, Optional, Tuple
import aiohttp
from dataclasses import dataclass
from enum import Enum
import statistics
import time
from difflib import SequenceMatcher

# ============== ORIGINAL CODE (COMPLETELY UNCHANGED) ==============
# Configure logging with better formatting
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%H:%M:%S'
)
logger = logging.getLogger(__name__)

class ModelType(Enum):
    # The 5 Judge Models (Free OpenRouter models)
    GEMMA_3N_E2B = "tencent/hunyuan-a13b-instruct:free"
    QWEN3_4B = "mistralai/mistral-small-3.2-24b-instruct:free"
    GEMMA_3_12B = "google/gemma-3-12b-it:free"
    KIMI_DEV = "moonshotai/kimi-dev-72b:free"
    DOLPHIN_MISTRAL = "cognitivecomputations/dolphin-mistral-24b-venice-edition:free"

@dataclass
class ModelResponse:
    model_name: str
    answer: str
    generation_time: float
    token_count: int = 0

@dataclass
class JudgeEvaluation:
    judge_model: str
    target_model: str
    rating: float
    critique: str
    confidence: float
    similarity_score: float = 0.0

@dataclass
class ConsensusResult:
    final_answer: str
    confidence_score: float
    agreement_level: str
    participating_models: List[str]
    processing_time: float

class OutputFormatter:
    """Handles all output formatting and display"""

    @staticmethod
    def print_header(title: str):
        """Print a formatted header"""
        print("\n" + "="*80)
        print(f" {title.upper()} ")
        print("="*80)

    @staticmethod
    def print_section(title: str):
        """Print a section header"""
        print(f"\n📋 {title}")
        print("-"*50)

    @staticmethod
    def print_progress(message: str):
        """Print progress with emoji"""
        print(f"🔄 {message}")

    @staticmethod
    def print_success(message: str):
        """Print success message"""
        print(f"✅ {message}")

    @staticmethod
    def print_warning(message: str):
        """Print warning message"""
        print(f"⚠️  {message}")

    @staticmethod
    def print_error(message: str):
        """Print error message"""
        print(f"❌ {message}")

    @staticmethod
    def print_metrics(result: Dict[str, Any]):
        """Print formatted metrics"""
        print(f"\n📊 PROCESSING METRICS")
        print("-"*30)
        print(f"⏱️  Processing Time: {result.get('processing_time', 0):.2f}s")
        print(f"🤖 Models Used: {result.get('total_responses', 0)}/5")
        print(f"⚖️  Evaluations: {result.get('evaluations_conducted', 0)}")
        print(f"🎯 Diversity Score: {result.get('diversity_score', 0):.2f}/1.0")
        print(f"🤝 Agreement: {result.get('agreement_level', 'UNKNOWN')}")
        print(f"📈 Confidence: {result.get('average_confidence', 0):.2f}/1.0")

class InteractiveDiscussion:
    """Handles interactive discussion and user engagement"""

    def __init__(self):
        self.discussion_points = []
        self.user_responses = {}

    def add_discussion_point(self, question: str, context: str = ""):
        """Add a discussion point for user interaction"""
        self.discussion_points.append({
            "question": question,
            "context": context,
            "timestamp": time.time()
        })

    async def engage_user(self, responses: List[ModelResponse]) -> Dict[str, str]:
        """Engage user in discussion based on model responses"""

        # Analyze responses to generate discussion points
        diversity = self._analyze_diversity(responses)

        if diversity['needs_clarification']:
            question = (
                f"🤔 The AI models provided different perspectives on your question. "
                f"Would you like me to focus on any specific aspect? "
                f"(e.g., technical details, cost analysis, security considerations)"
            )

            print(f"\n💬 DISCUSSION")
            print("-"*30)
            print(question)

            user_input = input("\n👤 Your response (or press Enter to continue): ").strip()

            if user_input:
                self.user_responses['focus_area'] = user_input
                print(f"✅ Got it! Focusing on: {user_input}")
            else:
                print("✅ Continuing with comprehensive analysis...")

        return self.user_responses

    def _analyze_diversity(self, responses: List[ModelResponse]) -> Dict[str, Any]:
        """Analyze response diversity to determine if discussion is needed"""
        if len(responses) < 2:
            return {"needs_clarification": False}

        lengths = [len(resp.answer) for resp in responses]
        length_variance = statistics.variance(lengths) if len(lengths) > 1 else 0

        return {
            "needs_clarification": length_variance > 10000,
            "length_variance": length_variance
        }

class OpenRouterClient:
    def __init__(self, api_key: str, base_url: str = "https://openrouter.ai/api/v1"):
        self.api_key = api_key
        self.base_url = base_url
        self.session = None

    async def __aenter__(self):
        self.session = aiohttp.ClientSession()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.session:
            await self.session.close()

    async def generate_response(self, model: str, messages: List[Dict],
                               temperature: float = 0.7, max_tokens: int = 1500) -> Dict:
        """Generate response from OpenRouter model"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json",
            "HTTP-Referer": "https://your-app.com",
            "X-Title": "Efficient Multi-Judge System"
        }

        payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens
        }

        start_time = time.time()
        try:
            async with self.session.post(f"{self.base_url}/chat/completions",
                                       headers=headers, json=payload) as response:
                response.raise_for_status()
                result = await response.json()
                response_time = time.time() - start_time
                return {
                    "content": result["choices"][0]["message"]["content"],
                    "response_time": response_time,
                    "usage": result.get("usage", {}),
                    "tokens": result.get("usage", {}).get("total_tokens", 0)
                }
        except Exception as e:
            logger.error(f"Error with model {model}: {str(e)}")
            return {"content": "", "response_time": time.time() - start_time, "error": str(e)}

class EnhancedMultiJudgeSystem:
    def __init__(self, openrouter_api_key: str):
        self.client = OpenRouterClient(openrouter_api_key)
        self.models = [model.value for model in ModelType]
        self.formatter = OutputFormatter()
        self.discussion = InteractiveDiscussion()

        self.SIMILARITY_THRESHOLD = 0.7
        self.MIN_CONFIDENCE = 0.6
        self.CONSENSUS_THRESHOLD = 0.8

    def calculate_similarity(self, text1: str, text2: str) -> float:
        return SequenceMatcher(None, text1.lower(), text2.lower()).ratio()

    def analyze_response_diversity(self, responses: List[ModelResponse]) -> Dict[str, Any]:
        if len(responses) < 2:
            return {"avg_similarity": 0, "max_similarity": 0, "diverse": True}

        similarities = []
        for i in range(len(responses)):
            for j in range(i + 1, len(responses)):
                sim = self.calculate_similarity(responses[i].answer, responses[j].answer)
                similarities.append(sim)

        avg_similarity = statistics.mean(similarities)
        max_similarity = max(similarities)

        return {
            "avg_similarity": avg_similarity,
            "max_similarity": max_similarity,
            "diverse": avg_similarity < self.SIMILARITY_THRESHOLD,
            "similarities": similarities
        }

    def create_answer_prompt(self, user_query: str, focus_area: str = "") -> List[Dict]:
        system_content = """You are an expert assistant. Provide a comprehensive, accurate, and well-structured answer.
        Focus on clarity, accuracy, and practical value. Be concise but thorough."""

        if focus_area:
            system_content += f" Pay special attention to: {focus_area}"

        return [
            {"role": "system", "content": system_content},
            {"role": "user", "content": user_query}
        ]

    def create_judge_prompt(self, user_query: str, answer: str, model_name: str) -> List[Dict]:
        return [
            {
                "role": "system",
                "content": """You are an expert judge evaluating AI responses. Rate the answer on accuracy, completeness, clarity, and usefulness.

                Respond in this EXACT JSON format with brief, focused critique:
                {
                    "rating": <float 1.0-10.0>,
                    "critique": "<1-2 sentences highlighting key strengths/weaknesses>",
                    "confidence": <float 0.0-1.0 indicating confidence in your evaluation>
                }

                Keep critique concise and actionable."""
            },
            {
                "role": "user",
                "content": f"""Question: {user_query}

Answer from {model_name}:
{answer}

Please evaluate this answer using the JSON format specified."""
            }
        ]

    def create_consensus_prompt(self, user_query: str, responses: List[ModelResponse],
                              evaluations: List[JudgeEvaluation], focus_area: str = "") -> List[Dict]:
        model_ratings = {}
        for eval in evaluations:
            if eval.target_model not in model_ratings:
                model_ratings[eval.target_model] = []
            model_ratings[eval.target_model].append(eval.rating)

        avg_ratings = {model: statistics.mean(ratings) for model, ratings in model_ratings.items()}

        top_responses = []
        for response in responses:
            rating = avg_ratings.get(response.model_name, 0)
            answer_preview = response.answer[:500] + "..." if len(response.answer) > 500 else response.answer
            top_responses.append(f"""
{response.model_name} (Rating: {rating:.1f}/10):
{answer_preview}
---""")

        system_content = """You are tasked with creating the best possible answer by synthesizing multiple responses.
        Focus on accuracy, completeness, and clarity. Combine the strongest elements while eliminating weaknesses.
        Keep your final answer well-structured and comprehensive but not overly lengthy."""

        if focus_area:
            system_content += f" Give special emphasis to: {focus_area}"

        return [
            {"role": "system", "content": system_content},
            {
                "role": "user",
                "content": f"""Question: {user_query}

Available responses with ratings:
{''.join(top_responses)}

Create a synthesized answer that combines the best elements from these responses."""
            }
        ]

    async def generate_all_responses(self, user_query: str, focus_area: str = "") -> List[ModelResponse]:
        self.formatter.print_progress("Generating responses from 5 AI models...")

        async with self.client:
            tasks = []
            for model in self.models:
                messages = self.create_answer_prompt(user_query, focus_area)
                task = self.client.generate_response(model, messages, temperature=0.7)
                tasks.append((model, task))

            responses = []
            for i, (model, task) in enumerate(tasks, 1):
                try:
                    result = await task
                    if "error" not in result and result["content"]:
                        responses.append(ModelResponse(
                            model_name=model,
                            answer=result["content"],
                            generation_time=result["response_time"],
                            token_count=result.get("tokens", 0)
                        ))
                        model_short = model.split('/')[-1].split(':')[0]
                        print(f"  ✅ {model_short} ({result['response_time']:.1f}s)")
                    else:
                        print(f"  ❌ {model.split('/')[-1].split(':')[0]} failed")
                except Exception as e:
                    print(f"  ❌ {model.split('/')[-1].split(':')[0]} error")

            self.formatter.print_success(f"Generated {len(responses)} responses successfully")
            return responses

    async def conduct_targeted_judging(self, user_query: str, responses: List[ModelResponse]) -> List[JudgeEvaluation]:
        diversity_analysis = self.analyze_response_diversity(responses)

        if not diversity_analysis['diverse']:
            self.formatter.print_progress("Responses show high consensus - skipping detailed evaluation")
            evaluations = []
            for response in responses:
                evaluations.append(JudgeEvaluation(
                    judge_model="consensus",
                    target_model=response.model_name,
                    rating=8.0,
                    critique="High consensus among models indicates strong agreement",
                    confidence=0.9,
                    similarity_score=diversity_analysis['avg_similarity']
                ))
            return evaluations

        self.formatter.print_progress("Responses show diversity - conducting peer evaluation...")

        async with self.client:
            evaluations = []

            for i, judge_response in enumerate(responses):
                max_diff = 0
                target_response = None
                for j, target in enumerate(responses):
                    if i != j:
                        similarity = self.calculate_similarity(judge_response.answer, target.answer)
                        if (1 - similarity) > max_diff:
                            max_diff = 1 - similarity
                            target_response = target

                if target_response:
                    messages = self.create_judge_prompt(user_query, target_response.answer, target_response.model_name)
                    try:
                        result = await self.client.generate_response(judge_response.model_name, messages, temperature=0.3)
                        if "error" not in result and result["content"]:
                            try:
                                eval_data = json.loads(result["content"])
                                evaluations.append(JudgeEvaluation(
                                    judge_model=judge_response.model_name,
                                    target_model=target_response.model_name,
                                    rating=float(eval_data["rating"]),
                                    critique=eval_data["critique"],
                                    confidence=float(eval_data["confidence"]),
                                    similarity_score=1 - max_diff
                                ))

                                judge_short = judge_response.model_name.split('/')[-1].split(':')[0]
                                target_short = target_response.model_name.split('/')[-1].split(':')[0]
                                print(f"  ⚖️ {judge_short} → {target_short}: {eval_data['rating']:.1f}/10")

                            except (json.JSONDecodeError, KeyError):
                                pass
                    except Exception:
                        pass

            return evaluations

    async def generate_final_consensus(self, user_query: str, responses: List[ModelResponse],
                                     evaluations: List[JudgeEvaluation], focus_area: str = "") -> str:
        self.formatter.print_progress("Generating final consensus answer...")

        best_model = ModelType.KIMI_DEV.value

        async with self.client:
            messages = self.create_consensus_prompt(user_query, responses, evaluations, focus_area)
            result = await self.client.generate_response(best_model, messages, temperature=0.5, max_tokens=2000)

            if "error" not in result and result["content"]:
                return result["content"]
            else:
                if evaluations:
                    model_ratings = {}
                    for eval in evaluations:
                        if eval.target_model not in model_ratings:
                            model_ratings[eval.target_model] = []
                        model_ratings[eval.target_model].append(eval.rating)

                    avg_ratings = {}
                    for model, ratings in model_ratings.items():
                        if ratings:
                            avg_ratings[model] = statistics.mean(ratings)

                    if avg_ratings:
                        best_model_name = max(avg_ratings.keys(), key=lambda m: avg_ratings[m])
                        best_response = next(r for r in responses if r.model_name == best_model_name)
                        return f"**Note**: Using best-rated individual response\n\n{best_response.answer}"

                return f"**Note**: Using first available response\n\n{responses[0].answer}"

    async def process_query(self, user_query: str) -> Dict[str, Any]:
        start_time = time.time()

        self.formatter.print_header("Multi-Judge AI System")
        print(f"📝 Query: {user_query}")

        try:
            responses = await self.generate_all_responses(user_query)

            if len(responses) < 1:
                self.formatter.print_error("No responses generated")
                return {"error": "No responses generated", "final_answer": "No answer available"}

            user_preferences = await self.discussion.engage_user(responses)
            focus_area = user_preferences.get('focus_area', '')

            evaluations = await self.conduct_targeted_judging(user_query, responses)

            final_answer = await self.generate_final_consensus(user_query, responses, evaluations, focus_area)

            total_time = time.time() - start_time
            diversity_analysis = self.analyze_response_diversity(responses)
            avg_confidence = statistics.mean([e.confidence for e in evaluations]) if evaluations else 0.8

            if diversity_analysis['avg_similarity'] > self.CONSENSUS_THRESHOLD:
                agreement_level = "HIGH_CONSENSUS"
            elif diversity_analysis['avg_similarity'] > self.SIMILARITY_THRESHOLD:
                agreement_level = "MODERATE_CONSENSUS"
            else:
                agreement_level = "DIVERSE_OPINIONS"

            result = {
                "query": user_query,
                "total_responses": len(responses),
                "evaluations_conducted": len(evaluations),
                "diversity_score": 1 - diversity_analysis['avg_similarity'],
                "agreement_level": agreement_level,
                "average_confidence": avg_confidence,
                "final_answer": final_answer,
                "processing_time": total_time,
                "user_focus": focus_area
            }

            self.formatter.print_metrics(result)
            self.formatter.print_section("FINAL ANSWER")
            print(final_answer)

            self.formatter.print_success(f"Processing completed in {total_time:.2f} seconds")

            return result

        except Exception as e:
            self.formatter.print_error(f"Processing failed: {str(e)}")
            return {"error": str(e), "final_answer": "Processing failed"}

# ============== MODERN ENHANCEMENTS ==============
class SemanticRouter:
    """Modern query routing using semantic analysis"""
    def __init__(self):
        try:
            from sentence_transformers import SentenceTransformer
            import numpy as np
            self.model = SentenceTransformer('all-MiniLM-L6-v2')
            self.np = np
            self.query_profiles = {
                "technical": self.model.encode("technical specifications requirements"),
                "comparative": self.model.encode("compare contrast differences between"),
                "creative": self.model.encode("creative ideas suggestions brainstorm")
            }
            self.initialized = True
        except ImportError:
            self.initialized = False
            print("⚠️  SentenceTransformers not available - falling back to simple routing")

    async def get_query_type(self, query: str) -> str:
        """Classify query type using semantic similarity"""
        if not self.initialized:
            return "standard"  # Fallback

        query_embed = self.model.encode(query)
        similarities = {
            k: self._cosine_similarity(query_embed, v)
            for k,v in self.query_profiles.items()
        }
        best_match = max(similarities.items(), key=lambda x: x[1])
        return best_match[0] if best_match[1] > 0.5 else "standard"

    def _cosine_similarity(self, a, b):
        return self.np.dot(a, b)/(self.np.linalg.norm(a)*self.np.linalg.norm(b))

class DynamicParameterController:
    """Modern adaptive parameter tuning"""
    @staticmethod
    def get_model_parameters(model: str, query_type: str) -> dict:
        """Returns optimal parameters per model and query type"""
        base_params = {
            "temperature": 0.7,
            "max_tokens": 1500,
            "timeout": 30
        }

        # Model-specific adjustments
        if "gemma" in model.lower():
            base_params.update({"temperature": 0.5, "max_tokens": 2000})
        elif "kimi" in model.lower():
            base_params.update({"temperature": 0.6, "timeout": 45})

        # Query-type adjustments
        if query_type == "technical":
            base_params["temperature"] = max(0.3, base_params["temperature"] - 0.2)
        elif query_type == "creative":
            base_params["temperature"] = min(1.0, base_params["temperature"] + 0.2)

        return base_params

class ResponseAnalyzer:
    """Modern response analysis without API calls"""
    @staticmethod
    def estimate_quality(response: ModelResponse) -> float:
        """Heuristic quality estimation (0-1 scale)"""
        factors = {
            "length": min(1, len(response.answer.split())/300),
            "structure": 0.2 if '\n\n' in response.answer else 0.1,
            "certainty": 0.1 if 'may' not in response.answer.lower() else 0,
            "examples": 0.2 if 'example' in response.answer.lower() else 0
        }
        return min(1.0, sum(factors.values()))

class HybridEvaluator:
    """Modern hybrid evaluation combining API and local analysis"""
    def __init__(self, original_system):
        self.original = original_system

    async def evaluate_responses(self, query: str, responses: List[ModelResponse]) -> List[JudgeEvaluation]:
        """Combine API evaluations with local quality estimates"""
        # First try original API evaluation
        api_evals = await self.original.conduct_targeted_judging(query, responses)

        # Enhance with local quality estimates
        for eval in api_evals:
            target = next(r for r in responses if r.model_name == eval.target_model)
            quality = ResponseAnalyzer.estimate_quality(target)
            eval.confidence = (eval.confidence + quality) / 2  # Blend scores

        return api_evals

class ModernMultiJudgeSystem:
    """Wrapper that adds modern features without changing original code"""
    def __init__(self, openrouter_api_key: str):
        self.original = EnhancedMultiJudgeSystem(openrouter_api_key)
        self.router = SemanticRouter()
        self.evaluator = HybridEvaluator(self.original)

    async def process_query(self, user_query: str) -> Dict[str, Any]:
        """Enhanced processing with modern techniques"""
        # Modern: Semantic query analysis
        query_type = await self.router.get_query_type(user_query)
        print(f"🔍 Modern Analysis: Detected {query_type} query type")

        # Modern: Dynamic parameter adjustment
        for model in self.original.models:
            params = DynamicParameterController.get_model_parameters(model, query_type)
            print(f"⚙️ Modern Tuning: {model.split('/')[-1]} params: {params}")

        # Continue with original flow (unchanged)
        result = await self.original.process_query(user_query)

        # Modern: Post-processing analysis
        if 'final_answer' in result:
            quality = ResponseAnalyzer.estimate_quality(
                ModelResponse("final", result['final_answer'], 0)
            )
            result['modern_quality_score'] = quality
            print(f"🏆 Modern Quality Score: {quality:.2f}/1.0")

        return result

# ============== MAIN EXECUTION ==============
async def main():
    # Load your API key securely
    API_KEY = "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"

    if not API_KEY or API_KEY == "your_openrouter_api_key_here":
        print("❌ Please set your OpenRouter API key")
        return

    # Use modern wrapper instead of original system
    system = ModernMultiJudgeSystem(API_KEY)

    # Example queries
    queries = [
        "What are the most important factors to consider when choosing between different cloud computing platforms?",
        "How can small businesses effectively implement cybersecurity measures on a limited budget?",
        "What are the key differences between machine learning and artificial intelligence?"
    ]

    # Process a query
    test_query = queries[0]
    result = await system.process_query(test_query)

    # Optional: Show summary
    if not result.get('error'):
        print(f"\n🎯 Summary: Processed query with {result['total_responses']} models, "
              f"{result['evaluations_conducted']} evaluations, achieving {result['agreement_level'].lower().replace('_', ' ')} "
              f"in {result['processing_time']:.1f}s")

await main()

🔍 Modern Analysis: Detected standard query type
⚙️ Modern Tuning: gpt-4.1 params: {'temperature': 0.7, 'max_tokens': 1500, 'timeout': 30}
⚙️ Modern Tuning: gemini-2.0-flash-001 params: {'temperature': 0.7, 'max_tokens': 1500, 'timeout': 30}
⚙️ Modern Tuning: llama-3.1-8b-instruct params: {'temperature': 0.7, 'max_tokens': 1500, 'timeout': 30}
⚙️ Modern Tuning: claude-sonnet-4 params: {'temperature': 0.7, 'max_tokens': 1500, 'timeout': 30}
⚙️ Modern Tuning: grok-4 params: {'temperature': 0.7, 'max_tokens': 1500, 'timeout': 30}

 MULTI-JUDGE AI SYSTEM 
📝 Query: Given an input string s and a pattern p, implement regular expression matching with support for '.' and '*' where:

'.' Matches any single character.​​​​
'*' Matches zero or more of the preceding element.
The matching should cover the entire input string (not partial).

 

Example 1:

Input: s = "aa", p = "a"
Output: false
Explanation: "a" does not match the entire string "aa".
Example 2:

Input: s = "aa", p = "a*"
Output: true
Ex

ERROR:__main__:Error with model anthropic/claude-sonnet-4: 402, message='Payment Required', url='https://openrouter.ai/api/v1/chat/completions'


🔄 Generating final consensus answer...


ERROR:__main__:Error with model anthropic/claude-sonnet-4: 402, message='Payment Required', url='https://openrouter.ai/api/v1/chat/completions'



📊 PROCESSING METRICS
------------------------------
⏱️  Processing Time: 192.60s
🤖 Models Used: 4/5
⚖️  Evaluations: 2
🎯 Diversity Score: 0.91/1.0
🤝 Agreement: DIVERSE_OPINIONS
📈 Confidence: 0.96/1.0

📋 FINAL ANSWER
--------------------------------------------------
**Note**: Using best-rated individual response

```python
def isMatch(s: str, p: str) -> bool:
    """
    Implements regular expression matching with support for '.' and '*'.

    Args:
        s: The input string.
        p: The pattern string.

    Returns:
        True if the pattern matches the entire input string, False otherwise.
    """

    s_len = len(s)
    p_len = len(p)

    # dp[i][j] is True if s[0:i] matches p[0:j], False otherwise.
    dp = [[False] * (p_len + 1) for _ in range(s_len + 1)]

    # Empty string matches empty pattern
    dp[0][0] = True

    # Deal with patterns like a*, a*b*, a*b*c*
    for j in range(1, p_len + 1):
        if p[j - 1] == '*':
            dp[0][j] = dp[0][j - 2]

    # Fill 