Multi-Chat

Chat with multiple LLMs simultaneously using LiteLLM and async Python.

Setup

Create a virtual environment and install dependencies (recommended):

# Run the bundled installer which creates .venv and installs the uv CLI
# The script will run `uv sync` inside the venv to install dependencies
./install.sh            # install runtime dependencies
./install.sh --dev      # install runtime + development dependencies (optional)

If you prefer to use uv directly without the installer, you can still run:

uv sync

Copy the environment file and add your API keys:

cp .env.example .env
# Edit .env with your API keys

API Key Setup

Get your API keys from:

OpenAI: https://platform.openai.com/api-keys
Anthropic: https://console.anthropic.com/
Google Gemini: https://aistudio.google.com/app/apikey
Mistral: https://console.mistral.ai/
Cerebras: https://cloud.cerebras.ai/
Groq: https://console.groq.com/keys
GitHub Copilot: Use your GitHub personal access token
OpenRouter: https://openrouter.ai/keys

Run the application:

uv run python main.py

Model Presets

The application includes three built-in presets for easy configuration:

fast: Quick, cost-effective models for rapid responses
- groq/openai/gpt-oss-120b
- cerebras/qwen-3-235b-a22b-thinking-2507
- groq/moonshotai/kimi-k2-instruct-0905
balanced: Mid-tier models balancing speed and quality
- openai/GLM-4.6
- gemini/gemini-2.5-pro
- openrouter/x-ai/grok-4-fast:free
- qwen-cli
premium: High-end models for best quality responses
- github_copilot/gpt-5
- gemini/gemini-2.5-pro
- openrouter/x-ai/grok-4-fast:free
- qwen-cli

You can also define custom model combinations in the code.

Features

Chat with multiple LLMs simultaneously
Async processing for fast responses
Easy to add/remove models
Error handling for individual model failures
Built-in model presets (fast, balanced, premium)
API key availability checking
Support for 8 major LLM providers
Model-specific prompt adjustments for fine-tuning responses

Model-Specific Prompt Adjustments

Some models tend to give terse responses and benefit from explicit instructions. You can configure model-specific prompt prefixes in config/models.json:

"opencode/grok-code": {
  "display_name": "grok-code",
  "provider": "OpenCode CLI",
  "capabilities": ["coding", "deep-reasoning", "local-cli"],
  "description": "Grok code model via OpenCode CLI.",
  "prompt_prefix": "Respond comprehensively and explain in detail: "
}

When configured, the prefix is automatically prepended to user prompts for that model. For example:

User asks: "What is Python?"
Model receives: "Respond comprehensively and explain in detail: What is Python?"

This works in both the CLI and web app - the prefix is applied transparently before sending to the model.

This is particularly useful for:

Terse models like grok-code and code-supernova that need encouragement for detailed responses
Format instructions like "Answer in markdown format: "
Context setting like "As a Python expert: "

See config/models.md for more details and examples.

Supported Models

The project supports multiple LLM providers through LiteLLM:

OpenAI

Requires: OPENAI_API_KEY

Anthropic

Requires: ANTHROPIC_API_KEY

Google Gemini

gemini/gemini-2.5-flash, gemini/gemini-2.5-pro
Requires: GOOGLE_API_KEY

Mistral

mistral/mistral-small, mistral/mistral-medium, mistral/mistral-large
Requires: MISTRAL_API_KEY

Cerebras

cerebras/qwen-3-235b-a22b-thinking-2507, cerebras/llama3.1-8b
Requires: CEREBRAS_API_KEY

Groq

groq/openai/gpt-oss-120b, groq/llama3-8b-8192, groq/mixtral-8x7b-32768
Requires: GROQ_API_KEY

GitHub Copilot

github_copilot/oswe-vscode-insiders, github_copilot/gpt-4o
Requires: GITHUB_TOKEN

OpenRouter

openrouter/x-ai/grok-4-fast:free
Requires: OPENROUTER_API_KEY

z.ai GLM

openai/GLM-4.6, openai/GLM-4.5
Configure .env with OPENAI_COMPAT_MODELS=GLM-4.6,GLM-4.5 (or ZAI_MODELS=...)
Provide the custom key via OPENAI_COMPAT_API_KEY (or ZAI_API_KEY) and optional base URL through OPENAI_COMPAT_BASE_URL / ZAI_API_BASE_URL (defaults to https://api.z.ai/api/coding/paas/v4)
Keep OPENAI_API_KEY set to continue using the direct OpenAI endpoint alongside the z.ai models

Qwen CLI

qwen-cli: Qwen command-line interface tool
Requires: Qwen CLI installed and in PATH (https://github.com/QwenLM/Qwen)

Codex CLI

codex-cli: Anthropic Codex command-line interface tool
Requires: Codex CLI installed and in PATH

Gemini CLI

gemini-cli: Google Gemini command-line interface tool
Requires: Gemini CLI installed and in PATH

Usage

import asyncio
from src.multi_chat import MultiChat

async def main():
    # Mix and match models from different providers
    models = [
        "gemini/gemini-2.5-flash",                    # Google Gemini
        "cerebras/qwen-3-235b-a22b-thinking-2507",    # Cerebras
        "github_copilot/oswe-vscode-insiders",        # GitHub Copilot
        "groq/openai/gpt-oss-120b",                   # Groq
        "openrouter/x-ai/grok-4-fast:free",           # OpenRouter
        "openai/GLM-4.6",                            # z.ai GLM
        "qwen-cli",                                  # Qwen CLI tool
    ]

    chat = MultiChat(models)

    # Chat with all models simultaneously
    results = await chat.chat_all("Hello, how are you?")
    chat.print_results(results)

if __name__ == '__main__':
    asyncio.run(main())

POTENTIAL FEATURES TO ADD

Feature 1: Interactive Merge UI with Attribution Highlighting and Source Toggling

This feature combines visual interactivity, source transparency, and real-time control over the merging process.

Why it matters: Users need to understand which model contributed which part of the merged output. Without this, trust in the final result diminishes, especially in high-stakes domains like legal, medical, or technical writing. Visual cues and interactive controls allow users to inspect, validate, and refine the merge dynamically.

Implementation Details:

UI Layout: Use a three-panel layout:
- Left: Original LLM responses (as collapsible cards).
- Center: Merged output (editable).
- Right: Metadata sidebar (source model, prompt, timestamp, etc.).
Color-Coding: Assign distinct colors to each model (e.g., GPT-4 = blue, Claude = purple, Gemini = green). Apply color highlights to corresponding segments in the merged output.
Hover Tooltips: On hover over a colored segment, show:
- The exact original snippet from that model.
- Model name, temperature, and prompt used.
Source Toggling: Add toggle switches in the sidebar to enable/disable contributions from specific models. When a model is toggled off, its segments are grayed out or removed from the merge in real time.
Drag-and-Drop Editing: Use React-Beautiful-DND or SortableJS to allow users to reorder response snippets directly in the merge panel.
Inline Editing: Use Quill or Slate.js for rich-text editing within the merged output. Allow users to edit, delete, or insert text directly.

Technical Stack:

Frontend: React + Redux/Context for state management.
Libraries: react-beautiful-dnd, quill, styled-components.
Backend: Store metadata per snippet in a JSONB field (PostgreSQL) or MongoDB document.

User Impact Example:

A researcher merges legal advice from GPT-4 and Claude. They notice a claim about "patent expiration timelines" is only from GPT-4. By toggling GPT-4 off, they see the claim disappears—prompting them to verify it independently.

Feature 2: Conflict Resolution Workbench with Confidence & Consistency Scoring

This feature introduces automated disagreement detection, confidence scoring, and interactive resolution tools to ensure factual accuracy and coherence.

Why it matters: LLMs often contradict each other on facts, tone, or technical details. A merged output that silently includes conflicting claims is unreliable. This feature surfaces disagreements and helps users resolve them efficiently.

Implementation Details:

Disagreement Detection:
- Use semantic similarity metrics (e.g., cosine similarity via sentence embeddings) to compare model outputs.
- Flag sentences or phrases where agreement is low (e.g., <60% similarity).
- Use BLEURT, ROUGE-L, or BERTScore for more nuanced evaluation.
Confidence Scoring Engine:
- Agreement Score: Compute entropy or KL-divergence across model outputs. Low entropy = high consensus.
- Factual Consistency Score: For each claim, retrieve real-time snippets from trusted sources (e.g., Wikipedia, Google/Bing via API) using embeddings and a reranker (e.g., Cohere Rerank, ColBERT).
- Fluency Score: Run the merged text through a lightweight LLM (e.g., DistilGPT-2) to compute perplexity—lower = more fluent.
UI Integration:
- Color-code text in the merged output:
  - Green: High confidence (consensus ≥2 models, low perplexity).
  - Orange: Medium confidence (disagreement detected).
  - Red: Low confidence (only one model, high perplexity).
- Clicking a flagged segment opens a split-view panel showing all model variants.
- Include upvote/downvote buttons or a "Select Version" dropdown.
- Allow users to write a custom resolution (e.g., "Paris is the capital of France (confirmed via Wikipedia)").
Auto-Merge Option: Provide a "Resolve Automatically" button that picks the most consistent version based on similarity scores.

Technical Stack:

NLP Libraries: sentence-transformers, bert-score, bleurt.
Search: Use You.com, Serper, or Bing API for real-time fact-checking.
Backend: Store diffs and scores in a versioned database.

User Impact Example:

A developer merges code solutions. The app flags a conflict: Model A suggests async/await, Model B uses callbacks. The user votes for async/await, and the merge updates instantly—no manual copy-paste.

Bonus Features (Highly Recommended):

Prompt Library: Let users save and tag prompts for reuse.
AI-Assisted Summarization: Add a "Smart Condense" button that runs a final LLM pass to generate a concise summary.
Model Weighting Sliders: Allow users to assign weights (e.g., 70% GPT-4, 30% Claude) for fine-grained control.

To build a best-in-class LLM response merging web app, implement the following integrated feature set, combining the strengths of all three sources:

Recommended Implementation Roadmap

Phase 1: Core Merging with Transparency (Weeks 1–4)

Implement Delta-View UI (Source 2): Three-panel layout with model outputs and merged result.
Add color-coded attribution (Sources 1 & 2): Highlight segments by model.
Enable hover tooltips showing original snippets and metadata (Source 1).
Integrate drag-and-drop reordering using React-Beautiful-DND (Source 0).

Phase 2: Quality & Conflict Resolution (Weeks 5–8)

Build Conflict Resolution Workbench (Source 1):
- Flag disagreements using sentence embeddings and cosine similarity.
- Show split-view panel with upvote/select options.
Add Confidence Scoring Engine (Source 2):
- Compute agreement, factual consistency (via Bing API), and fluency.
- Color-code text (green/orange/red).
Include "Resolve Automatically" button using similarity metrics.

Phase 3: Workflow & Collaboration (Weeks 9–12)

Implement Task-Specific Merge Presets (Source 1):
- Start with "Code Debugging", "Creative Brainstorming", "Factual Synthesis".
- Allow user customization.
Add Versioned Session Memory (Sources 0 & 2):
- Store sessions as JSON-LD diff chains.
- Use diff-match-patch for efficient storage.
- Build History Sidebar with diff previews and restore.
Enable Export Options (Sources 0 & 2):
- Markdown, HTML, PDF.
- Include footnotes/bibliography.

Phase 4: Advanced & Enterprise (Optional)

Add REST API (Source 2) for CI/CD or automation.
Integrate real-time collaboration (Source 0) with Liveblocks.
Build Prompt Library (Source 0) with tagging and search.
Add AI-Assisted Summarization (Source 0) for executive summaries.

Model Selection & Best Practices

For Confidence Scoring: Use all-MiniLM-L6-v2 for embeddings, bert-score for similarity.
For Fact-Checking: Use Serper API or You.com for low-cost web search.
For UI: Use Tailwind CSS + React for rapid development.
For Backend: Use FastAPI or Express.js with PostgreSQL.
For Real-Time: Use WebSockets or Liveblocks for collaboration.

🔧 1. Smarter Merging Algorithms (The "Brain")

The core differentiator of a high-quality merging app is not the number of models, but the intelligence of the fusion logic. All three sources agree that naive concatenation or majority voting is insufficient.

✅ Recommended Merging Techniques

Technique	Implementation	Tools & Libraries	Impact
Confidence-Weighted Fusion	Use logprobs or entropy to weight model outputs. Normalize and calibrate per-model confidence using temperature scaling.	`openai.ChatCompletion(..., logprobs=5)`, Hugging Face `transformers`, `torch.nn.functional`	↑Accuracy by 20–40%; reduces hallucinations
Merge-Then-Refine	Concatenate raw outputs → feed into a "refiner" LLM (e.g., GPT-4 or Mistral-7B) to produce a polished, coherent answer.	LiteLLM, vLLM, `transformers` pipeline	Produces human-readable, logically consistent outputs
Token-Level Probability Fusion	Combine probability distributions across models using weighted averaging or Bayesian methods.	Custom Python logic, `numpy`, `scipy`	Optimal for models with shared tokenizers (e.g., Llama family)
Chain-of-Thought Consensus	Extract final answers from reasoning chains → apply voting on extracted conclusions.	Regex or LLM prompt: `"Extract the final answer from the reasoning above."`	↑Accuracy on math, coding, and policy tasks
Mixture-of-Experts (MoE) Gating	Use a lightweight neural network (e.g., 2-layer MLP) to dynamically select which model to trust per token or segment.	`sentence-transformers`, `torch`, `sklearn`	Ideal for heterogeneous model ensembles (GPT-4, Claude, Llama)
Retrieval-Augmented Fusion (RAF)	Retrieve top-k documents (Pinecone, Weaviate) → down-weight answers that contradict evidence.	Pinecone, FAISS, `langchain`, `chromadb`	Reduces hallucination in knowledge-intensive domains

💡 Pro Tip: Apply temperature calibration per model:
def calibrate_logits(logits, temperature):
    return logits / temperature
Fit temperature on a validation set (e.g., temp_gpt4=0.78, temp_claude=1.02) to improve confidence reliability.

👁️ 2. User Control & Transparency (Trust Builder)

Users distrust black-box systems. All three sources emphasize explainability, customization, and interactive feedback.

✅ Key UX Enhancements

Interactive Diff Viewer Show side-by-side comparisons of raw model outputs vs. merged result. Highlight consensus (green), contradictions (red), and excluded content. Tech: react-diff-viewer, monaco-editor, D3.js for visualizations.
Source Attribution & Confidence Display Always show:

"Consensus formed from 3 sources. Primary source: GPT-4 (confidence: 92%). Claim 'X' from Claude-2 excluded due to low confidence."

Weight Sliders & Presets Allow users to adjust model influence via UI sliders. Offer domain-specific presets:

{
  "medical": {"priority": ["GPT-4", "Med-PaLM"], "confidence_threshold": 0.8},
  "coding": {"require_tests": true, "disregard": ["Claude-2"]}
}

"Why This Answer?" Tooltip Show token-level confidence heatmaps, model contributions, and reasoning snippets.
One-Click Export with Audit Trail Export to PDF/Markdown with version history, source citations, and bias/confidence metadata.

⚡ 3. Performance & Scalability (Seamless UX)

Latency and cost are critical. All sources agree: streaming, caching, and parallelization are non-negotiable.

✅ Performance Optimization Strategies

Bottleneck	Solution	Expected Impact
API Cost	Implement cost-aware routing: use cheaper models (e.g., `gpt-4o-mini`) first; escalate only if confidence is low.	↓Costs by 30–70%
Throughput	Use continuous batching (vLLM, TensorRT-LLM) to process multiple prompts in parallel on GPU.	3–6× throughput increase
Cold Starts	Keep models warm with provisioned concurrency (AWS Lambda) or persistent workers.	Eliminate 2–5s delays
Redundant Queries	Request de-duplication: SHA256(prompt + params) → serve cached response if <5 min old.	25–40% traffic reduction
Speculative Decoding	Use a small draft model (e.g., Mistral-7B) to generate tokens → verify with GPT-4.	1.8–2.3× speed-up, no quality loss
KV-Cache Reuse	Cache common prompt prefixes (e.g., system prompts) → compute only deltas.	30–60% TTFT reduction

✅ Must-Do: Add a bandwidth estimator before execution: "This query will cost ~$0.02 and take 1.2s."

🛠️ 5. Architecture & Infrastructure (Production-Grade)

All sources converge on a modular, scalable, and observable architecture.

✅ Recommended Tech Stack

Layer	Technology	Why
Frontend	Next.js 14 (React Server Components), Tailwind, HeadlessUI	Fast first-byte, modern UI
Backend	FastAPI (async), Celery + Redis	Decouples request handling from long-running tasks
Model Workers	Dockerized: OpenAI API, Anthropic, vLLM (local)	Portable, scalable, API-agnostic
Caching	Redis (in-memory), with TTL (15–30 min)	Reduces redundant API calls
Observability	Prometheus + Grafana, structured JSON logs, Loki/Elastic	Real-time monitoring, debugging
CI/CD	GitHub Actions, ArgoCD, Docker multi-arch builds	Automated testing and deployment
Secrets	AWS/GCP Secrets Manager, HashiCorp Vault	Secure API key management

🛡️ 6. Security, Compliance & Operations

Enterprise readiness requires privacy, auditability, and resilience.

✅ Critical Ops Features

PII Redaction Run spaCy or regex NER on raw responses → replace detected PII with {{redacted}}.
Rate Limiting & Quotas Use FastAPI limiter middleware → per-user or per-IP token bucket.
Circuit-Breaker Pattern If p99 latency > 2s or error rate > 1%, auto-failover to backup model or region.
Audit Trail Persist merge_log table: request_id, user_id, prompt_hash, model_weights, merged_output, timestamp.
Legal Compliance Display model-source disclaimer:

"Answer generated by a blend of GPT-4, Claude, and Llama-2."
Disaster Recovery Backup Redis snapshots every 6h; maintain hot-standby worker in another region.

💡 Critical Insight: Users prefer slightly slower but explainable responses. Transparency > raw speed.

🚀 8. Implementation Roadmap (Prioritized)

Phase	Key Actions	Timeframe
1 (Week 1–2)	Add confidence-weighted fusion, source attribution, SSE streaming, Redis caching	2 weeks
2 (Week 3–4)	Implement merge-then-refine pipeline, cost estimator, circuit-breaker	2 weeks
3 (Week 5–6)	Build interactive debugger, weight sliders, feedback UI	2 weeks
4 (Ongoing)	Launch feedback loop → fine-tune merger model, add domain-specific tuning	Continuous

🎯 Final Insight: The Winning Edge

Stop building an "LLM aggregator" – build a "Truth Synthesizer." Your unique value: "I know which LLM is lying, and why." Start with confidence-weighted merging + transparency tools → outperform 90% of competitors. Then layer on domain expertise and feedback loops to create a defensible, outcome-focused product.

Synthesis and Recommendations

Step-by-Step Implementation Plan

Phase 1: Foundation (Week 1–2)

Integrate confidence-weighted fusion using logprobs and temperature scaling (Source 2 code).
Add Redis caching for prompt → merged answer (TTL=15 min) (All sources).

Phase 2: Performance & UX (Week 3–4)

Add continuous batching using vLLM (Source 1).
Implement merge-then-refine with GPT-4 or Mistral-7B (Source 1).
Build interactive diff viewer with color-coded conflicts (Source 0).
Add weight sliders and presets (Source 2).
Deploy Prometheus + Grafana for observability (All sources).

Phase 3: Intelligence & Learning (Week 5–6)

Launch feedback system: paragraph-level 👍/👎 (Source 1).
Track implicit feedback (hover time, corrections) (Source 0).
Start fine-tuning a small merger model on user-approved outputs (Source 1).
Add retrieval-augmented fusion with Pinecone (Source 0).

Phase 4: Enterprise Readiness (Ongoing)

Implement PII redaction with spaCy (Source 2).
Add circuit-breaker for API failures (All sources).
Enable on-prem deployment via Docker (Source 0).
Launch CI/CD pipeline with automated testing (Source 2).

Model Selection Criteria

For Merging: Mistral-7B (fast, cheap, fine-tunable).
For Refinement: GPT-4 or Claude-3 (high quality).
For Drafting: gpt-4o-mini or Llama-3-8B (speculative decoding).
For Routing: gpt-4o-mini as meta-LLM (Source 2).

Best Practices

Transparency > Speed: Users prefer explainable answers.
Cache Everything: Prompt, raw responses, merged output.
Measure User Satisfaction: Not just BLEU or latency.
Start Simple: Weighted voting → merge-then-refine → MoE.
Secure by Design: Secrets vault, PII redaction, audit logs.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.vscode		.vscode
config		config
examples		examples
individual_responses		individual_responses
runs/merge-model-query		runs/merge-model-query
src		src
web-app		web-app
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
TODO-MERGE_STRATEGIES.md		TODO-MERGE_STRATEGIES.md
TODO-WEB_GROUNDING.md		TODO-WEB_GROUNDING.md
install.sh		install.sh
main.py		main.py
multi_chat_output.md		multi_chat_output.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

ZoomC2021/multi-chat

Folders and files

Latest commit

History

Repository files navigation