CodeAgent-MCP

Multi-Agent Code Generation System with MCP Protocol

Three specialized agents collaborate through a feedback loop: Planner decomposes tasks, Coder generates code with tool access, Reviewer scores quality and requests fixes until the code meets standards.

Live Demo | Evaluation Results | Project Series

Architecture

                        ┌───────────────────────────────────────┐
                        │            Orchestrator               │
                        │                                       │
  User Requirement ───► │  Planner ──► Coder ◄──── Reviewer     │
                        │              │    ▲        │          │
                        │              │    │ fix    │          │
                        │              │    └────────┘          │
                        │              │   score < 7 → retry    │
                        │              │   score ≥ 7 → done     │
                        └──────────────┼────────────────────────┘
                                       │ MCP Protocol (stdio)
                        ┌──────────────┼────────────────────────┐
                        │              ▼                        │
                        │  File Server  Shell Server  Git Server│
                        │  RAG Server (optional)                │
                        └───────────────────────────────────────┘

Highlights

	Feature	Detail
Multi-Agent	Planner + Coder + Reviewer feedback loop	Custom ~300-line orchestrator, no LangChain dependency
MCP Tools	4 tool servers via standard MCP protocol	File ops, sandboxed shell, Git, RAG retrieval
HumanEval	90.2% pass@1 (148/164)	Verified on the standard OpenAI benchmark
Benchmark	100% completion, avg score 8.9/10	8 tasks (easy/medium/hard), automated evaluation
Optimization	Prompt engineering for cost reduction	Planner: -85% tokens, Coder tools: -32% tokens
Security	Shell whitelist + path sandboxing	Only safe commands allowed, directory traversal blocked
Live Demo	Gradio app on HuggingFace Spaces	Enter a requirement, watch agents collaborate in real-time

Quick Start

# Install
pip install openai mcp pydantic pyyaml rich

# Configure
export OPENAI_API_KEY="your-deepseek-api-key"

# Run (multi-agent, no MCP tools)
python -m src.main --no-mcp "Implement an LRU Cache with O(1) get/put"

# Run (multi-agent + MCP tools)
python -m src.main "Implement an LRU Cache with O(1) get/put"

# Run evaluation
python -m eval.run_eval --tasks B1

Evaluation Results

HumanEval Benchmark

Metric	Value
pass@1	90.2% avg (148/164)
Stability	89.0% - 91.5% across 3 runs
Model	DeepSeek-chat

Custom Benchmark (8 tasks)

Configuration	Completion	Avg Score	Tokens
single_agent	3/3	N/A	2.5K
multi_agent (no-mcp)	8/8	8.89	136K
multi_agent (mcp)	8/8	8.75	2,905K

Difficulty	Tasks	Avg Score
Easy	2	8.85
Medium	3	9.07
Hard	3	8.73

Optimization Results

Optimization	Before	After	Improvement
Planner prompt tuning	6 subtasks, 400K tokens	2 subtasks, 57K tokens	-85% tokens, +0.4 score
Coder tool-call rules	1,165K tokens	787K tokens	-32% tokens, quality maintained
B7/B8 workspace fix	Empty workspace	config_manager.py + 3 async files	All 8/8 tasks now produce workspace files

Key Findings

Single agent cuts corners: Without Reviewer, B7 produced only 227 tokens of output
MCP mode produces real engineering artifacts: source files + unit tests + integration demos
Prompt engineering has the highest ROI: One sentence change cut Planner tokens by 85%
Orchestrator state management is critical: Retry paths must re-inject workspace context

Project Structure

CodeAgent-MCP/
├── src/
│   ├── core/           # LLM client, agent base (ReAct loop), orchestrator, config
│   ├── agents/         # Planner, Coder, Reviewer
│   ├── mcp/            # MCP client manager + 4 tool servers
│   └── main.py         # CLI entry point
├── eval/               # Benchmark tasks, evaluation runner, HumanEval adapter
├── config/             # LLM providers, agent prompts, MCP server config
├── deploy/             # HuggingFace Spaces deployment scripts
├── app.py              # Gradio demo
├── notebooks/          # Colab experiment notebooks + reports
└── tests/              # Unit tests

Design Decisions

Decision	Why
Custom framework, no LangChain	~300 lines, fully explainable in interviews
Only Coder gets MCP tools	Planner/Reviewer are pure LLM calls, reducing complexity
4-layer JSON fallback parser	LLM output format is unpredictable, need robust parsing
Shell command whitelist	Only python/pytest/ruff allowed, blocks rm/sudo/etc.
Explicit env var merging for MCP	Subprocess chains don't reliably inherit parent env

Project Series

This project is the capstone of a four-project progression from single-skill research to system integration:

#	Project	Focus	Key Result
1	small-llms-tool-use	Function calling fine-tuning	86-89% exact match (QLoRA)
2	agenttune	Multi-step ReAct reasoning	100% task success rate
3	smallrag	RAG optimization	chunk_size=512 + MMR + top-k=5
4	CodeAgent-MCP (this repo)	Multi-Agent + MCP integration	90.2% HumanEval, 8.9/10 benchmark

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeAgent-MCP

Architecture

Highlights

Quick Start

Evaluation Results

HumanEval Benchmark

Custom Benchmark (8 tasks)

Optimization Results

Key Findings

Project Structure

Design Decisions

Project Series

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
deploy		deploy
eval		eval
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CodeAgent-MCP

Architecture

Highlights

Quick Start

Evaluation Results

HumanEval Benchmark

Custom Benchmark (8 tasks)

Optimization Results

Key Findings

Project Structure

Design Decisions

Project Series

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages