A fun and engaging multimodal vision agent powered by Joey Tribbiani's persona! This agent brings charm, humor, and technical expertise to web development by analyzing Figma designs and guiding users through full-stack frontend implementation step-by-step.
- Vision-Powered Design Analysis: Analyzes Figma designs and interprets visual layouts using Google's Gemini Vision AI
- Interactive Voice Communication: Speaks with engaging personality via ElevenLabs text-to-speech
- Speech Recognition: Understands user input via Deepgram speech-to-text
- Real-time Streaming Voice Calls: Powered by GetStream for seamless multimodal interaction
- Built-in MCP Tooling: Local Filesystem + Fetch MCP servers for file operations and URL/doc retrieval
- Knowledge-Aware Responses: Supports Gemini File Search over local markdown docs in
knowledge/ - Optional GitHub MCP Integration: Automatically enabled when
GITHUB_PATis set - Custom LLM Helper Functions: Includes timestamping, React boilerplate generation, package suggestions, and safe workspace command execution
- Step-by-Step Guidance: Breaks down web development projects into manageable, progressively-built features
- Joey's Personality: Enthusiastic, charming communication style that keeps users engaged and motivated
- Vision Language Model: Google Gemini Vision AI
- Speech-to-Text: Deepgram
- Text-to-Speech: ElevenLabs
- Real-time Communication: GetStream
- MCP:
@modelcontextprotocol/server-filesystem,@modelcontextprotocol/server-fetch - ML/Transformers: HuggingFace Transformers, NVIDIA/Ultralytics support
- Core: Python 3.12+, Vision Agents Framework
- Python 3.12 or higher
- Node.js 18+ (required for MCP servers launched through
npx) - API keys for:
- Google Gemini Vision
- ElevenLabs
- Deepgram
- Stream / GetStream
Optional:
- GitHub Personal Access Token (
GITHUB_PAT) to enable remote GitHub MCP
- Clone the repository:
git clone <repository-url>
cd vision-agent-hack- Create a Python virtual environment:
python -m venv venv
# Windows (PowerShell)
venv\Scripts\Activate.ps1
# Windows (cmd)
venv\Scripts\activate.bat
# macOS/Linux
source venv/bin/activate- Install dependencies:
pip install -e .Or install directly:
pip install -r requirements.txt# 1) copy env template
cp .env.example .env
# 2) fill in API keys inside .env
# 3) run agent
python full-stack-joey.pyCreate a .env file in the project root with your API keys:
GOOGLE_API_KEY=your_gemini_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GITHUB_PAT=your_github_pat_optionalNote: If your local setup expects
GETSTREAM_API_KEY/GETSTREAM_API_SECRET, keep both pairs in.env.
- On Windows, Joey uses
C:/joey-workspace - On macOS/Linux, Joey uses
/tmp/joey-workspace - MCP launcher scripts (
_mcp_*.py) and generated component boilerplates are written there
Joey supports a local knowledge base backed by Gemini File Search.
- Add markdown files under
knowledge/(for example:knowledge/nextjs.md,knowledge/shadcn.md) - The loader scans
*.mdfiles from that folder
create_rag_from_directory()builds a Gemini file-search store fromknowledge/- The store is attached to Gemini VLM via
gemini.tools.FileSearch(...) - During conversations, Joey can retrieve relevant snippets from indexed docs and ground responses in project knowledge
- If
knowledge/is missing, startup logs a warning and no local knowledge is indexed - To enable retrieval, ensure RAG initialization runs before agent creation in your startup flow
Joey specializes in:
- React.js - Functional components, hooks, state management, performance optimization
- Next.js - Full-stack development, SSR, SSG, API routes
- TypeScript - Type-safe, maintainable code with strict checking
- Tailwind CSS - Utility-first responsive design and styling
- Analyzes Figma designs with enthusiasm
- Breaks projects into logical, manageable steps
- Builds incrementally (html → styling → interactivity)
- Follows best practices: semantic HTML, responsive design, accessibility
- Maintains Joey's engaging personality throughout!
- "How you doin'?"
- "Could I BE any more excited about this code?"
- "Oh my God!"
- "That's so...not good!"
- "Yeah, baby!"
Run the agent:
python full-stack-joey.pyThe agent will:
- Initialize a real-time multimodal connection
- Wait for users to join the call
- Analyze uploaded Figma designs
- Guide through frontend implementation with Joey's personality
- Build features step-by-step with explanations
.
├── full-stack-joey.py # Main agent implementation
├── knowledge/ # Local knowledge base markdown docs for Gemini File Search
│ ├── nextjs.md
│ └── shadcn.md
├── main.py # Entry point
├── pyproject.toml # Project configuration and dependencies
├── requirements.txt # Direct dependencies
├── README.md # This file
└── .env # Configuration (not in repo)
- Knowledge Indexing (Optional): Builds Gemini file-search store from
knowledge/*.md - Agent Creation: Initializes
Full-Stack Joeywith Gemini VLM (max_output_tokens=3000) and FileSearch tool, plus ElevenLabs TTS and Deepgram STT - MCP Server Bootstrapping: Generates Python launcher scripts and starts local Filesystem + Fetch MCP servers
- Remote MCP (Optional): Adds GitHub MCP automatically when
GITHUB_PATis present - Function Registration: Registers custom helpers (
get_timestamp,generate_component_boilerplate,suggest_packages,run_workspace_command) - Call Lifecycle: Joins a GetStream call and handles participant-join events before finishing the session
Joey registers domain-specific functions directly on the VLM:
get_timestamp()— returns current datetime stringgenerate_component_boilerplate(component_name, props_json, use_typescript)— writes a React/Next component file to workspacesuggest_packages(use_case)— recommends npm packages by use-case categoryrun_workspace_command(command)— runs bounded shell commands in Joey workspace with timeout and captured output
npxnot found: install Node.js 18+ and reopen terminal.- MCP server startup fails: ensure
C:/joey-workspaceis writable on Windows. - Missing credentials errors: verify keys in
.envand restart the process. - Voice/call connection issues: confirm Stream/GetStream API key + secret are valid for your app.
- Knowledge search returns nothing: verify markdown docs exist under
knowledge/and RAG initialization is executed before agent startup.
python-dotenv>=1.2.1- Environment variable managementtransformers>=4.57.6- ML transformers for enhanced capabilitiesvision-agents[deepgram,elevenlabs,gemini,getstream,huggingface,nvidia,openai,ultralytics]>=0.3.8- Core vision agent framework with all plugins
![]() Caleb Chandrasekar @calebjubal |
![]() S.Tharundhatri @Tharun-10Dragneel |
![]() Rishav @Rishav23av |
![]() Kushagra Chandok @mengyokyu |
Made with ❤️ by bringing Joey Tribbiani to the web development world!



