Skip to content

ZoomC2021/multi-chat

Repository files navigation

Multi-Chat

Chat with multiple LLMs simultaneously using LiteLLM and async Python.

Setup

  1. Create a virtual environment and install dependencies (recommended):
# Run the bundled installer which creates .venv and installs the uv CLI
# The script will run `uv sync` inside the venv to install dependencies
./install.sh            # install runtime dependencies
./install.sh --dev      # install runtime + development dependencies (optional)

If you prefer to use uv directly without the installer, you can still run:

uv sync
  1. Copy the environment file and add your API keys:
cp .env.example .env
# Edit .env with your API keys

API Key Setup

Get your API keys from:

  1. Run the application:
uv run python main.py

Model Presets

The application includes three built-in presets for easy configuration:

  • fast: Quick, cost-effective models for rapid responses
    • groq/openai/gpt-oss-120b
    • cerebras/qwen-3-235b-a22b-thinking-2507
    • groq/moonshotai/kimi-k2-instruct-0905
  • balanced: Mid-tier models balancing speed and quality
    • openai/GLM-4.6
    • gemini/gemini-2.5-pro
    • openrouter/x-ai/grok-4-fast:free
    • qwen-cli
  • premium: High-end models for best quality responses
    • github_copilot/gpt-5
    • gemini/gemini-2.5-pro
    • openrouter/x-ai/grok-4-fast:free
    • qwen-cli

You can also define custom model combinations in the code.

Features

  • Chat with multiple LLMs simultaneously
  • Async processing for fast responses
  • Easy to add/remove models
  • Error handling for individual model failures
  • Built-in model presets (fast, balanced, premium)
  • API key availability checking
  • Support for 8 major LLM providers
  • Model-specific prompt adjustments for fine-tuning responses

Model-Specific Prompt Adjustments

Some models tend to give terse responses and benefit from explicit instructions. You can configure model-specific prompt prefixes in config/models.json:

"opencode/grok-code": {
  "display_name": "grok-code",
  "provider": "OpenCode CLI",
  "capabilities": ["coding", "deep-reasoning", "local-cli"],
  "description": "Grok code model via OpenCode CLI.",
  "prompt_prefix": "Respond comprehensively and explain in detail: "
}

When configured, the prefix is automatically prepended to user prompts for that model. For example:

  • User asks: "What is Python?"
  • Model receives: "Respond comprehensively and explain in detail: What is Python?"

This works in both the CLI and web app - the prefix is applied transparently before sending to the model.

This is particularly useful for:

  • Terse models like grok-code and code-supernova that need encouragement for detailed responses
  • Format instructions like "Answer in markdown format: "
  • Context setting like "As a Python expert: "

See config/models.md for more details and examples.

Supported Models

The project supports multiple LLM providers through LiteLLM:

OpenAI

  • Requires: OPENAI_API_KEY

Anthropic

  • Requires: ANTHROPIC_API_KEY

Google Gemini

  • gemini/gemini-2.5-flash, gemini/gemini-2.5-pro
  • Requires: GOOGLE_API_KEY

Mistral

  • mistral/mistral-small, mistral/mistral-medium, mistral/mistral-large
  • Requires: MISTRAL_API_KEY

Cerebras

  • cerebras/qwen-3-235b-a22b-thinking-2507, cerebras/llama3.1-8b
  • Requires: CEREBRAS_API_KEY

Groq

  • groq/openai/gpt-oss-120b, groq/llama3-8b-8192, groq/mixtral-8x7b-32768
  • Requires: GROQ_API_KEY

GitHub Copilot

  • github_copilot/oswe-vscode-insiders, github_copilot/gpt-4o
  • Requires: GITHUB_TOKEN

OpenRouter

  • openrouter/x-ai/grok-4-fast:free
  • Requires: OPENROUTER_API_KEY

z.ai GLM

  • openai/GLM-4.6, openai/GLM-4.5
  • Configure .env with OPENAI_COMPAT_MODELS=GLM-4.6,GLM-4.5 (or ZAI_MODELS=...)
  • Provide the custom key via OPENAI_COMPAT_API_KEY (or ZAI_API_KEY) and optional base URL through OPENAI_COMPAT_BASE_URL / ZAI_API_BASE_URL (defaults to https://api.z.ai/api/coding/paas/v4)
  • Keep OPENAI_API_KEY set to continue using the direct OpenAI endpoint alongside the z.ai models

Qwen CLI

Codex CLI

  • codex-cli: Anthropic Codex command-line interface tool
  • Requires: Codex CLI installed and in PATH

Gemini CLI

  • gemini-cli: Google Gemini command-line interface tool
  • Requires: Gemini CLI installed and in PATH

Usage

import asyncio
from src.multi_chat import MultiChat

async def main():
    # Mix and match models from different providers
    models = [
        "gemini/gemini-2.5-flash",                    # Google Gemini
        "cerebras/qwen-3-235b-a22b-thinking-2507",    # Cerebras
        "github_copilot/oswe-vscode-insiders",        # GitHub Copilot
        "groq/openai/gpt-oss-120b",                   # Groq
        "openrouter/x-ai/grok-4-fast:free",           # OpenRouter
        "openai/GLM-4.6",                            # z.ai GLM
        "qwen-cli",                                  # Qwen CLI tool
    ]

    chat = MultiChat(models)

    # Chat with all models simultaneously
    results = await chat.chat_all("Hello, how are you?")
    chat.print_results(results)

if __name__ == '__main__':
    asyncio.run(main())

POTENTIAL FEATURES TO ADD

Feature 1: Interactive Merge UI with Attribution Highlighting and Source Toggling

This feature combines visual interactivity, source transparency, and real-time control over the merging process.

Why it matters: Users need to understand which model contributed which part of the merged output. Without this, trust in the final result diminishes, especially in high-stakes domains like legal, medical, or technical writing. Visual cues and interactive controls allow users to inspect, validate, and refine the merge dynamically.

Implementation Details:

  • UI Layout: Use a three-panel layout:
    • Left: Original LLM responses (as collapsible cards).
    • Center: Merged output (editable).
    • Right: Metadata sidebar (source model, prompt, timestamp, etc.).
  • Color-Coding: Assign distinct colors to each model (e.g., GPT-4 = blue, Claude = purple, Gemini = green). Apply color highlights to corresponding segments in the merged output.
  • Hover Tooltips: On hover over a colored segment, show:
    • The exact original snippet from that model.
    • Model name, temperature, and prompt used.
  • Source Toggling: Add toggle switches in the sidebar to enable/disable contributions from specific models. When a model is toggled off, its segments are grayed out or removed from the merge in real time.
  • Drag-and-Drop Editing: Use React-Beautiful-DND or SortableJS to allow users to reorder response snippets directly in the merge panel.
  • Inline Editing: Use Quill or Slate.js for rich-text editing within the merged output. Allow users to edit, delete, or insert text directly.

Technical Stack:

  • Frontend: React + Redux/Context for state management.
  • Libraries: react-beautiful-dnd, quill, styled-components.
  • Backend: Store metadata per snippet in a JSONB field (PostgreSQL) or MongoDB document.

User Impact Example:

A researcher merges legal advice from GPT-4 and Claude. They notice a claim about "patent expiration timelines" is only from GPT-4. By toggling GPT-4 off, they see the claim disappears—prompting them to verify it independently.


Feature 2: Conflict Resolution Workbench with Confidence & Consistency Scoring

This feature introduces automated disagreement detection, confidence scoring, and interactive resolution tools to ensure factual accuracy and coherence.

Why it matters: LLMs often contradict each other on facts, tone, or technical details. A merged output that silently includes conflicting claims is unreliable. This feature surfaces disagreements and helps users resolve them efficiently.

Implementation Details:

  • Disagreement Detection:
    • Use semantic similarity metrics (e.g., cosine similarity via sentence embeddings) to compare model outputs.
    • Flag sentences or phrases where agreement is low (e.g., <60% similarity).
    • Use BLEURT, ROUGE-L, or BERTScore for more nuanced evaluation.
  • Confidence Scoring Engine:
    • Agreement Score: Compute entropy or KL-divergence across model outputs. Low entropy = high consensus.
    • Factual Consistency Score: For each claim, retrieve real-time snippets from trusted sources (e.g., Wikipedia, Google/Bing via API) using embeddings and a reranker (e.g., Cohere Rerank, ColBERT).
    • Fluency Score: Run the merged text through a lightweight LLM (e.g., DistilGPT-2) to compute perplexity—lower = more fluent.
  • UI Integration:
    • Color-code text in the merged output:
      • Green: High confidence (consensus ≥2 models, low perplexity).
      • Orange: Medium confidence (disagreement detected).
      • Red: Low confidence (only one model, high perplexity).
    • Clicking a flagged segment opens a split-view panel showing all model variants.
    • Include upvote/downvote buttons or a "Select Version" dropdown.
    • Allow users to write a custom resolution (e.g., "Paris is the capital of France (confirmed via Wikipedia)").
  • Auto-Merge Option: Provide a "Resolve Automatically" button that picks the most consistent version based on similarity scores.

Technical Stack:

  • NLP Libraries: sentence-transformers, bert-score, bleurt.
  • Search: Use You.com, Serper, or Bing API for real-time fact-checking.
  • Backend: Store diffs and scores in a versioned database.

User Impact Example:

A developer merges code solutions. The app flags a conflict: Model A suggests async/await, Model B uses callbacks. The user votes for async/await, and the merge updates instantly—no manual copy-paste.


Bonus Features (Highly Recommended):

  • Prompt Library: Let users save and tag prompts for reuse.
  • AI-Assisted Summarization: Add a "Smart Condense" button that runs a final LLM pass to generate a concise summary.
  • Model Weighting Sliders: Allow users to assign weights (e.g., 70% GPT-4, 30% Claude) for fine-grained control.

To build a best-in-class LLM response merging web app, implement the following integrated feature set, combining the strengths of all three sources:

Recommended Implementation Roadmap

Phase 1: Core Merging with Transparency (Weeks 1–4)

  1. Implement Delta-View UI (Source 2): Three-panel layout with model outputs and merged result.
  2. Add color-coded attribution (Sources 1 & 2): Highlight segments by model.
  3. Enable hover tooltips showing original snippets and metadata (Source 1).
  4. Integrate drag-and-drop reordering using React-Beautiful-DND (Source 0).

Phase 2: Quality & Conflict Resolution (Weeks 5–8)

  1. Build Conflict Resolution Workbench (Source 1):
    • Flag disagreements using sentence embeddings and cosine similarity.
    • Show split-view panel with upvote/select options.
  2. Add Confidence Scoring Engine (Source 2):
    • Compute agreement, factual consistency (via Bing API), and fluency.
    • Color-code text (green/orange/red).
  3. Include "Resolve Automatically" button using similarity metrics.

Phase 3: Workflow & Collaboration (Weeks 9–12)

  1. Implement Task-Specific Merge Presets (Source 1):
    • Start with "Code Debugging", "Creative Brainstorming", "Factual Synthesis".
    • Allow user customization.
  2. Add Versioned Session Memory (Sources 0 & 2):
    • Store sessions as JSON-LD diff chains.
    • Use diff-match-patch for efficient storage.
    • Build History Sidebar with diff previews and restore.
  3. Enable Export Options (Sources 0 & 2):
    • Markdown, HTML, PDF.
    • Include footnotes/bibliography.

Phase 4: Advanced & Enterprise (Optional)

  1. Add REST API (Source 2) for CI/CD or automation.
  2. Integrate real-time collaboration (Source 0) with Liveblocks.
  3. Build Prompt Library (Source 0) with tagging and search.
  4. Add AI-Assisted Summarization (Source 0) for executive summaries.

Model Selection & Best Practices

  • For Confidence Scoring: Use all-MiniLM-L6-v2 for embeddings, bert-score for similarity.
  • For Fact-Checking: Use Serper API or You.com for low-cost web search.
  • For UI: Use Tailwind CSS + React for rapid development.
  • For Backend: Use FastAPI or Express.js with PostgreSQL.
  • For Real-Time: Use WebSockets or Liveblocks for collaboration.

🔧 1. Smarter Merging Algorithms (The "Brain")

The core differentiator of a high-quality merging app is not the number of models, but the intelligence of the fusion logic. All three sources agree that naive concatenation or majority voting is insufficient.

Recommended Merging Techniques

Technique Implementation Tools & Libraries Impact
Confidence-Weighted Fusion Use logprobs or entropy to weight model outputs. Normalize and calibrate per-model confidence using temperature scaling. openai.ChatCompletion(..., logprobs=5), Hugging Face transformers, torch.nn.functional ↑Accuracy by 20–40%; reduces hallucinations
Merge-Then-Refine Concatenate raw outputs → feed into a "refiner" LLM (e.g., GPT-4 or Mistral-7B) to produce a polished, coherent answer. LiteLLM, vLLM, transformers pipeline Produces human-readable, logically consistent outputs
Token-Level Probability Fusion Combine probability distributions across models using weighted averaging or Bayesian methods. Custom Python logic, numpy, scipy Optimal for models with shared tokenizers (e.g., Llama family)
Chain-of-Thought Consensus Extract final answers from reasoning chains → apply voting on extracted conclusions. Regex or LLM prompt: "Extract the final answer from the reasoning above." ↑Accuracy on math, coding, and policy tasks
Mixture-of-Experts (MoE) Gating Use a lightweight neural network (e.g., 2-layer MLP) to dynamically select which model to trust per token or segment. sentence-transformers, torch, sklearn Ideal for heterogeneous model ensembles (GPT-4, Claude, Llama)
Retrieval-Augmented Fusion (RAF) Retrieve top-k documents (Pinecone, Weaviate) → down-weight answers that contradict evidence. Pinecone, FAISS, langchain, chromadb Reduces hallucination in knowledge-intensive domains

💡 Pro Tip: Apply temperature calibration per model:

def calibrate_logits(logits, temperature):
    return logits / temperature

Fit temperature on a validation set (e.g., temp_gpt4=0.78, temp_claude=1.02) to improve confidence reliability.


👁️ 2. User Control & Transparency (Trust Builder)

Users distrust black-box systems. All three sources emphasize explainability, customization, and interactive feedback.

Key UX Enhancements

  • Interactive Diff Viewer Show side-by-side comparisons of raw model outputs vs. merged result. Highlight consensus (green), contradictions (red), and excluded content. Tech: react-diff-viewer, monaco-editor, D3.js for visualizations.

  • Source Attribution & Confidence Display Always show:

    "Consensus formed from 3 sources. Primary source: GPT-4 (confidence: 92%). Claim 'X' from Claude-2 excluded due to low confidence."

  • Weight Sliders & Presets Allow users to adjust model influence via UI sliders. Offer domain-specific presets:

    {
      "medical": {"priority": ["GPT-4", "Med-PaLM"], "confidence_threshold": 0.8},
      "coding": {"require_tests": true, "disregard": ["Claude-2"]}
    }
  • "Why This Answer?" Tooltip Show token-level confidence heatmaps, model contributions, and reasoning snippets.

  • One-Click Export with Audit Trail Export to PDF/Markdown with version history, source citations, and bias/confidence metadata.


3. Performance & Scalability (Seamless UX)

Latency and cost are critical. All sources agree: streaming, caching, and parallelization are non-negotiable.

Performance Optimization Strategies

Bottleneck Solution Expected Impact
API Cost Implement cost-aware routing: use cheaper models (e.g., gpt-4o-mini) first; escalate only if confidence is low. ↓Costs by 30–70%
Throughput Use continuous batching (vLLM, TensorRT-LLM) to process multiple prompts in parallel on GPU. 3–6× throughput increase
Cold Starts Keep models warm with provisioned concurrency (AWS Lambda) or persistent workers. Eliminate 2–5s delays
Redundant Queries Request de-duplication: SHA256(prompt + params) → serve cached response if <5 min old. 25–40% traffic reduction
Speculative Decoding Use a small draft model (e.g., Mistral-7B) to generate tokens → verify with GPT-4. 1.8–2.3× speed-up, no quality loss
KV-Cache Reuse Cache common prompt prefixes (e.g., system prompts) → compute only deltas. 30–60% TTFT reduction

Must-Do: Add a bandwidth estimator before execution: "This query will cost ~$0.02 and take 1.2s."


🛠️ 5. Architecture & Infrastructure (Production-Grade)

All sources converge on a modular, scalable, and observable architecture.

Recommended Tech Stack

Layer Technology Why
Frontend Next.js 14 (React Server Components), Tailwind, HeadlessUI Fast first-byte, modern UI
Backend FastAPI (async), Celery + Redis Decouples request handling from long-running tasks
Model Workers Dockerized: OpenAI API, Anthropic, vLLM (local) Portable, scalable, API-agnostic
Caching Redis (in-memory), with TTL (15–30 min) Reduces redundant API calls
Observability Prometheus + Grafana, structured JSON logs, Loki/Elastic Real-time monitoring, debugging
CI/CD GitHub Actions, ArgoCD, Docker multi-arch builds Automated testing and deployment
Secrets AWS/GCP Secrets Manager, HashiCorp Vault Secure API key management

🛡️ 6. Security, Compliance & Operations

Enterprise readiness requires privacy, auditability, and resilience.

Critical Ops Features

  • PII Redaction Run spaCy or regex NER on raw responses → replace detected PII with {{redacted}}.

  • Rate Limiting & Quotas Use FastAPI limiter middleware → per-user or per-IP token bucket.

  • Circuit-Breaker Pattern If p99 latency > 2s or error rate > 1%, auto-failover to backup model or region.

  • Audit Trail Persist merge_log table: request_id, user_id, prompt_hash, model_weights, merged_output, timestamp.

  • Legal Compliance Display model-source disclaimer:

    "Answer generated by a blend of GPT-4, Claude, and Llama-2."

  • Disaster Recovery Backup Redis snapshots every 6h; maintain hot-standby worker in another region.

💡 Critical Insight: Users prefer slightly slower but explainable responses. Transparency > raw speed.


🚀 8. Implementation Roadmap (Prioritized)

Phase Key Actions Timeframe
1 (Week 1–2) Add confidence-weighted fusion, source attribution, SSE streaming, Redis caching 2 weeks
2 (Week 3–4) Implement merge-then-refine pipeline, cost estimator, circuit-breaker 2 weeks
3 (Week 5–6) Build interactive debugger, weight sliders, feedback UI 2 weeks
4 (Ongoing) Launch feedback loop → fine-tune merger model, add domain-specific tuning Continuous

🎯 Final Insight: The Winning Edge

Stop building an "LLM aggregator" – build a "Truth Synthesizer." Your unique value: "I know which LLM is lying, and why." Start with confidence-weighted merging + transparency tools → outperform 90% of competitors. Then layer on domain expertise and feedback loops to create a defensible, outcome-focused product.

Synthesis and Recommendations

Step-by-Step Implementation Plan

Phase 1: Foundation (Week 1–2)

  1. Integrate confidence-weighted fusion using logprobs and temperature scaling (Source 2 code).
  2. Add Redis caching for prompt → merged answer (TTL=15 min) (All sources).

Phase 2: Performance & UX (Week 3–4)

  1. Add continuous batching using vLLM (Source 1).
  2. Implement merge-then-refine with GPT-4 or Mistral-7B (Source 1).
  3. Build interactive diff viewer with color-coded conflicts (Source 0).
  4. Add weight sliders and presets (Source 2).
  5. Deploy Prometheus + Grafana for observability (All sources).

Phase 3: Intelligence & Learning (Week 5–6)

  1. Launch feedback system: paragraph-level 👍/👎 (Source 1).
  2. Track implicit feedback (hover time, corrections) (Source 0).
  3. Start fine-tuning a small merger model on user-approved outputs (Source 1).
  4. Add retrieval-augmented fusion with Pinecone (Source 0).

Phase 4: Enterprise Readiness (Ongoing)

  1. Implement PII redaction with spaCy (Source 2).
  2. Add circuit-breaker for API failures (All sources).
  3. Enable on-prem deployment via Docker (Source 0).
  4. Launch CI/CD pipeline with automated testing (Source 2).

Model Selection Criteria

  • For Merging: Mistral-7B (fast, cheap, fine-tunable).
  • For Refinement: GPT-4 or Claude-3 (high quality).
  • For Drafting: gpt-4o-mini or Llama-3-8B (speculative decoding).
  • For Routing: gpt-4o-mini as meta-LLM (Source 2).

Best Practices

  • Transparency > Speed: Users prefer explainable answers.
  • Cache Everything: Prompt, raw responses, merged output.
  • Measure User Satisfaction: Not just BLEU or latency.
  • Start Simple: Weighted voting → merge-then-refine → MoE.
  • Secure by Design: Secrets vault, PII redaction, audit logs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published