Chat with multiple LLMs simultaneously using LiteLLM and async Python.
- Create a virtual environment and install dependencies (recommended):
# Run the bundled installer which creates .venv and installs the uv CLI
# The script will run `uv sync` inside the venv to install dependencies
./install.sh # install runtime dependencies
./install.sh --dev # install runtime + development dependencies (optional)If you prefer to use uv directly without the installer, you can still run:
uv sync- Copy the environment file and add your API keys:
cp .env.example .env
# Edit .env with your API keysGet your API keys from:
- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/
- Google Gemini: https://aistudio.google.com/app/apikey
- Mistral: https://console.mistral.ai/
- Cerebras: https://cloud.cerebras.ai/
- Groq: https://console.groq.com/keys
- GitHub Copilot: Use your GitHub personal access token
- OpenRouter: https://openrouter.ai/keys
- Run the application:
uv run python main.pyThe application includes three built-in presets for easy configuration:
- fast: Quick, cost-effective models for rapid responses
groq/openai/gpt-oss-120bcerebras/qwen-3-235b-a22b-thinking-2507groq/moonshotai/kimi-k2-instruct-0905
- balanced: Mid-tier models balancing speed and quality
openai/GLM-4.6gemini/gemini-2.5-proopenrouter/x-ai/grok-4-fast:freeqwen-cli
- premium: High-end models for best quality responses
github_copilot/gpt-5gemini/gemini-2.5-proopenrouter/x-ai/grok-4-fast:freeqwen-cli
You can also define custom model combinations in the code.
- Chat with multiple LLMs simultaneously
- Async processing for fast responses
- Easy to add/remove models
- Error handling for individual model failures
- Built-in model presets (fast, balanced, premium)
- API key availability checking
- Support for 8 major LLM providers
- Model-specific prompt adjustments for fine-tuning responses
Some models tend to give terse responses and benefit from explicit instructions. You can configure model-specific prompt prefixes in config/models.json:
"opencode/grok-code": {
"display_name": "grok-code",
"provider": "OpenCode CLI",
"capabilities": ["coding", "deep-reasoning", "local-cli"],
"description": "Grok code model via OpenCode CLI.",
"prompt_prefix": "Respond comprehensively and explain in detail: "
}When configured, the prefix is automatically prepended to user prompts for that model. For example:
- User asks: "What is Python?"
- Model receives: "Respond comprehensively and explain in detail: What is Python?"
This works in both the CLI and web app - the prefix is applied transparently before sending to the model.
This is particularly useful for:
- Terse models like
grok-codeandcode-supernovathat need encouragement for detailed responses - Format instructions like "Answer in markdown format: "
- Context setting like "As a Python expert: "
See config/models.md for more details and examples.
The project supports multiple LLM providers through LiteLLM:
- Requires:
OPENAI_API_KEY
- Requires:
ANTHROPIC_API_KEY
gemini/gemini-2.5-flash,gemini/gemini-2.5-pro- Requires:
GOOGLE_API_KEY
mistral/mistral-small,mistral/mistral-medium,mistral/mistral-large- Requires:
MISTRAL_API_KEY
cerebras/qwen-3-235b-a22b-thinking-2507,cerebras/llama3.1-8b- Requires:
CEREBRAS_API_KEY
groq/openai/gpt-oss-120b,groq/llama3-8b-8192,groq/mixtral-8x7b-32768- Requires:
GROQ_API_KEY
github_copilot/oswe-vscode-insiders,github_copilot/gpt-4o- Requires:
GITHUB_TOKEN
openrouter/x-ai/grok-4-fast:free- Requires:
OPENROUTER_API_KEY
openai/GLM-4.6,openai/GLM-4.5- Configure
.envwithOPENAI_COMPAT_MODELS=GLM-4.6,GLM-4.5(orZAI_MODELS=...) - Provide the custom key via
OPENAI_COMPAT_API_KEY(orZAI_API_KEY) and optional base URL throughOPENAI_COMPAT_BASE_URL/ZAI_API_BASE_URL(defaults tohttps://api.z.ai/api/coding/paas/v4) - Keep
OPENAI_API_KEYset to continue using the direct OpenAI endpoint alongside the z.ai models
qwen-cli: Qwen command-line interface tool- Requires: Qwen CLI installed and in PATH (https://github.com/QwenLM/Qwen)
codex-cli: Anthropic Codex command-line interface tool- Requires: Codex CLI installed and in PATH
gemini-cli: Google Gemini command-line interface tool- Requires: Gemini CLI installed and in PATH
import asyncio
from src.multi_chat import MultiChat
async def main():
# Mix and match models from different providers
models = [
"gemini/gemini-2.5-flash", # Google Gemini
"cerebras/qwen-3-235b-a22b-thinking-2507", # Cerebras
"github_copilot/oswe-vscode-insiders", # GitHub Copilot
"groq/openai/gpt-oss-120b", # Groq
"openrouter/x-ai/grok-4-fast:free", # OpenRouter
"openai/GLM-4.6", # z.ai GLM
"qwen-cli", # Qwen CLI tool
]
chat = MultiChat(models)
# Chat with all models simultaneously
results = await chat.chat_all("Hello, how are you?")
chat.print_results(results)
if __name__ == '__main__':
asyncio.run(main())This feature combines visual interactivity, source transparency, and real-time control over the merging process.
Why it matters: Users need to understand which model contributed which part of the merged output. Without this, trust in the final result diminishes, especially in high-stakes domains like legal, medical, or technical writing. Visual cues and interactive controls allow users to inspect, validate, and refine the merge dynamically.
Implementation Details:
- UI Layout: Use a three-panel layout:
- Left: Original LLM responses (as collapsible cards).
- Center: Merged output (editable).
- Right: Metadata sidebar (source model, prompt, timestamp, etc.).
- Color-Coding: Assign distinct colors to each model (e.g., GPT-4 = blue, Claude = purple, Gemini = green). Apply color highlights to corresponding segments in the merged output.
- Hover Tooltips: On hover over a colored segment, show:
- The exact original snippet from that model.
- Model name, temperature, and prompt used.
- Source Toggling: Add toggle switches in the sidebar to enable/disable contributions from specific models. When a model is toggled off, its segments are grayed out or removed from the merge in real time.
- Drag-and-Drop Editing: Use React-Beautiful-DND or SortableJS to allow users to reorder response snippets directly in the merge panel.
- Inline Editing: Use Quill or Slate.js for rich-text editing within the merged output. Allow users to edit, delete, or insert text directly.
Technical Stack:
- Frontend: React + Redux/Context for state management.
- Libraries:
react-beautiful-dnd,quill,styled-components. - Backend: Store metadata per snippet in a JSONB field (PostgreSQL) or MongoDB document.
User Impact Example:
A researcher merges legal advice from GPT-4 and Claude. They notice a claim about "patent expiration timelines" is only from GPT-4. By toggling GPT-4 off, they see the claim disappears—prompting them to verify it independently.
This feature introduces automated disagreement detection, confidence scoring, and interactive resolution tools to ensure factual accuracy and coherence.
Why it matters: LLMs often contradict each other on facts, tone, or technical details. A merged output that silently includes conflicting claims is unreliable. This feature surfaces disagreements and helps users resolve them efficiently.
Implementation Details:
- Disagreement Detection:
- Use semantic similarity metrics (e.g., cosine similarity via sentence embeddings) to compare model outputs.
- Flag sentences or phrases where agreement is low (e.g., <60% similarity).
- Use BLEURT, ROUGE-L, or BERTScore for more nuanced evaluation.
- Confidence Scoring Engine:
- Agreement Score: Compute entropy or KL-divergence across model outputs. Low entropy = high consensus.
- Factual Consistency Score: For each claim, retrieve real-time snippets from trusted sources (e.g., Wikipedia, Google/Bing via API) using embeddings and a reranker (e.g., Cohere Rerank, ColBERT).
- Fluency Score: Run the merged text through a lightweight LLM (e.g., DistilGPT-2) to compute perplexity—lower = more fluent.
- UI Integration:
- Color-code text in the merged output:
- Green: High confidence (consensus ≥2 models, low perplexity).
- Orange: Medium confidence (disagreement detected).
- Red: Low confidence (only one model, high perplexity).
- Clicking a flagged segment opens a split-view panel showing all model variants.
- Include upvote/downvote buttons or a "Select Version" dropdown.
- Allow users to write a custom resolution (e.g., "Paris is the capital of France (confirmed via Wikipedia)").
- Color-code text in the merged output:
- Auto-Merge Option: Provide a "Resolve Automatically" button that picks the most consistent version based on similarity scores.
Technical Stack:
- NLP Libraries:
sentence-transformers,bert-score,bleurt. - Search: Use You.com, Serper, or Bing API for real-time fact-checking.
- Backend: Store diffs and scores in a versioned database.
User Impact Example:
A developer merges code solutions. The app flags a conflict: Model A suggests
async/await, Model B uses callbacks. The user votes forasync/await, and the merge updates instantly—no manual copy-paste.
Bonus Features (Highly Recommended):
- Prompt Library: Let users save and tag prompts for reuse.
- AI-Assisted Summarization: Add a "Smart Condense" button that runs a final LLM pass to generate a concise summary.
- Model Weighting Sliders: Allow users to assign weights (e.g., 70% GPT-4, 30% Claude) for fine-grained control.
To build a best-in-class LLM response merging web app, implement the following integrated feature set, combining the strengths of all three sources:
Phase 1: Core Merging with Transparency (Weeks 1–4)
- Implement Delta-View UI (Source 2): Three-panel layout with model outputs and merged result.
- Add color-coded attribution (Sources 1 & 2): Highlight segments by model.
- Enable hover tooltips showing original snippets and metadata (Source 1).
- Integrate drag-and-drop reordering using React-Beautiful-DND (Source 0).
Phase 2: Quality & Conflict Resolution (Weeks 5–8)
- Build Conflict Resolution Workbench (Source 1):
- Flag disagreements using sentence embeddings and cosine similarity.
- Show split-view panel with upvote/select options.
- Add Confidence Scoring Engine (Source 2):
- Compute agreement, factual consistency (via Bing API), and fluency.
- Color-code text (green/orange/red).
- Include "Resolve Automatically" button using similarity metrics.
Phase 3: Workflow & Collaboration (Weeks 9–12)
- Implement Task-Specific Merge Presets (Source 1):
- Start with "Code Debugging", "Creative Brainstorming", "Factual Synthesis".
- Allow user customization.
- Add Versioned Session Memory (Sources 0 & 2):
- Store sessions as JSON-LD diff chains.
- Use diff-match-patch for efficient storage.
- Build History Sidebar with diff previews and restore.
- Enable Export Options (Sources 0 & 2):
- Markdown, HTML, PDF.
- Include footnotes/bibliography.
Phase 4: Advanced & Enterprise (Optional)
- Add REST API (Source 2) for CI/CD or automation.
- Integrate real-time collaboration (Source 0) with Liveblocks.
- Build Prompt Library (Source 0) with tagging and search.
- Add AI-Assisted Summarization (Source 0) for executive summaries.
- For Confidence Scoring: Use
all-MiniLM-L6-v2for embeddings,bert-scorefor similarity. - For Fact-Checking: Use Serper API or You.com for low-cost web search.
- For UI: Use Tailwind CSS + React for rapid development.
- For Backend: Use FastAPI or Express.js with PostgreSQL.
- For Real-Time: Use WebSockets or Liveblocks for collaboration.
The core differentiator of a high-quality merging app is not the number of models, but the intelligence of the fusion logic. All three sources agree that naive concatenation or majority voting is insufficient.
| Technique | Implementation | Tools & Libraries | Impact |
|---|---|---|---|
| Confidence-Weighted Fusion | Use logprobs or entropy to weight model outputs. Normalize and calibrate per-model confidence using temperature scaling. | openai.ChatCompletion(..., logprobs=5), Hugging Face transformers, torch.nn.functional |
↑Accuracy by 20–40%; reduces hallucinations |
| Merge-Then-Refine | Concatenate raw outputs → feed into a "refiner" LLM (e.g., GPT-4 or Mistral-7B) to produce a polished, coherent answer. | LiteLLM, vLLM, transformers pipeline |
Produces human-readable, logically consistent outputs |
| Token-Level Probability Fusion | Combine probability distributions across models using weighted averaging or Bayesian methods. | Custom Python logic, numpy, scipy |
Optimal for models with shared tokenizers (e.g., Llama family) |
| Chain-of-Thought Consensus | Extract final answers from reasoning chains → apply voting on extracted conclusions. | Regex or LLM prompt: "Extract the final answer from the reasoning above." |
↑Accuracy on math, coding, and policy tasks |
| Mixture-of-Experts (MoE) Gating | Use a lightweight neural network (e.g., 2-layer MLP) to dynamically select which model to trust per token or segment. | sentence-transformers, torch, sklearn |
Ideal for heterogeneous model ensembles (GPT-4, Claude, Llama) |
| Retrieval-Augmented Fusion (RAF) | Retrieve top-k documents (Pinecone, Weaviate) → down-weight answers that contradict evidence. | Pinecone, FAISS, langchain, chromadb |
Reduces hallucination in knowledge-intensive domains |
💡 Pro Tip: Apply temperature calibration per model:
def calibrate_logits(logits, temperature): return logits / temperatureFit
temperatureon a validation set (e.g.,temp_gpt4=0.78,temp_claude=1.02) to improve confidence reliability.
Users distrust black-box systems. All three sources emphasize explainability, customization, and interactive feedback.
-
Interactive Diff Viewer Show side-by-side comparisons of raw model outputs vs. merged result. Highlight consensus (green), contradictions (red), and excluded content. Tech:
react-diff-viewer,monaco-editor, D3.js for visualizations. -
Source Attribution & Confidence Display Always show:
"Consensus formed from 3 sources. Primary source: GPT-4 (confidence: 92%). Claim 'X' from Claude-2 excluded due to low confidence."
-
Weight Sliders & Presets Allow users to adjust model influence via UI sliders. Offer domain-specific presets:
{ "medical": {"priority": ["GPT-4", "Med-PaLM"], "confidence_threshold": 0.8}, "coding": {"require_tests": true, "disregard": ["Claude-2"]} } -
"Why This Answer?" Tooltip Show token-level confidence heatmaps, model contributions, and reasoning snippets.
-
One-Click Export with Audit Trail Export to PDF/Markdown with version history, source citations, and bias/confidence metadata.
Latency and cost are critical. All sources agree: streaming, caching, and parallelization are non-negotiable.
| Bottleneck | Solution | Expected Impact |
|---|---|---|
| API Cost | Implement cost-aware routing: use cheaper models (e.g., gpt-4o-mini) first; escalate only if confidence is low. |
↓Costs by 30–70% |
| Throughput | Use continuous batching (vLLM, TensorRT-LLM) to process multiple prompts in parallel on GPU. | 3–6× throughput increase |
| Cold Starts | Keep models warm with provisioned concurrency (AWS Lambda) or persistent workers. | Eliminate 2–5s delays |
| Redundant Queries | Request de-duplication: SHA256(prompt + params) → serve cached response if <5 min old. | 25–40% traffic reduction |
| Speculative Decoding | Use a small draft model (e.g., Mistral-7B) to generate tokens → verify with GPT-4. | 1.8–2.3× speed-up, no quality loss |
| KV-Cache Reuse | Cache common prompt prefixes (e.g., system prompts) → compute only deltas. | 30–60% TTFT reduction |
✅ Must-Do: Add a bandwidth estimator before execution: "This query will cost ~$0.02 and take 1.2s."
All sources converge on a modular, scalable, and observable architecture.
| Layer | Technology | Why |
|---|---|---|
| Frontend | Next.js 14 (React Server Components), Tailwind, HeadlessUI | Fast first-byte, modern UI |
| Backend | FastAPI (async), Celery + Redis | Decouples request handling from long-running tasks |
| Model Workers | Dockerized: OpenAI API, Anthropic, vLLM (local) | Portable, scalable, API-agnostic |
| Caching | Redis (in-memory), with TTL (15–30 min) | Reduces redundant API calls |
| Observability | Prometheus + Grafana, structured JSON logs, Loki/Elastic | Real-time monitoring, debugging |
| CI/CD | GitHub Actions, ArgoCD, Docker multi-arch builds | Automated testing and deployment |
| Secrets | AWS/GCP Secrets Manager, HashiCorp Vault | Secure API key management |
Enterprise readiness requires privacy, auditability, and resilience.
-
PII Redaction Run spaCy or regex NER on raw responses → replace detected PII with
{{redacted}}. -
Rate Limiting & Quotas Use FastAPI
limitermiddleware → per-user or per-IP token bucket. -
Circuit-Breaker Pattern If p99 latency > 2s or error rate > 1%, auto-failover to backup model or region.
-
Audit Trail Persist
merge_logtable:request_id, user_id, prompt_hash, model_weights, merged_output, timestamp. -
Legal Compliance Display model-source disclaimer:
"Answer generated by a blend of GPT-4, Claude, and Llama-2."
-
Disaster Recovery Backup Redis snapshots every 6h; maintain hot-standby worker in another region.
💡 Critical Insight: Users prefer slightly slower but explainable responses. Transparency > raw speed.
| Phase | Key Actions | Timeframe |
|---|---|---|
| 1 (Week 1–2) | Add confidence-weighted fusion, source attribution, SSE streaming, Redis caching | 2 weeks |
| 2 (Week 3–4) | Implement merge-then-refine pipeline, cost estimator, circuit-breaker | 2 weeks |
| 3 (Week 5–6) | Build interactive debugger, weight sliders, feedback UI | 2 weeks |
| 4 (Ongoing) | Launch feedback loop → fine-tune merger model, add domain-specific tuning | Continuous |
Stop building an "LLM aggregator" – build a "Truth Synthesizer." Your unique value: "I know which LLM is lying, and why." Start with confidence-weighted merging + transparency tools → outperform 90% of competitors. Then layer on domain expertise and feedback loops to create a defensible, outcome-focused product.
- Integrate confidence-weighted fusion using logprobs and temperature scaling (Source 2 code).
- Add Redis caching for prompt → merged answer (TTL=15 min) (All sources).
- Add continuous batching using vLLM (Source 1).
- Implement merge-then-refine with GPT-4 or Mistral-7B (Source 1).
- Build interactive diff viewer with color-coded conflicts (Source 0).
- Add weight sliders and presets (Source 2).
- Deploy Prometheus + Grafana for observability (All sources).
- Launch feedback system: paragraph-level 👍/👎 (Source 1).
- Track implicit feedback (hover time, corrections) (Source 0).
- Start fine-tuning a small merger model on user-approved outputs (Source 1).
- Add retrieval-augmented fusion with Pinecone (Source 0).
- Implement PII redaction with spaCy (Source 2).
- Add circuit-breaker for API failures (All sources).
- Enable on-prem deployment via Docker (Source 0).
- Launch CI/CD pipeline with automated testing (Source 2).
- For Merging: Mistral-7B (fast, cheap, fine-tunable).
- For Refinement: GPT-4 or Claude-3 (high quality).
- For Drafting: gpt-4o-mini or Llama-3-8B (speculative decoding).
- For Routing: gpt-4o-mini as meta-LLM (Source 2).
- Transparency > Speed: Users prefer explainable answers.
- Cache Everything: Prompt, raw responses, merged output.
- Measure User Satisfaction: Not just BLEU or latency.
- Start Simple: Weighted voting → merge-then-refine → MoE.
- Secure by Design: Secrets vault, PII redaction, audit logs.