Skip to content

calebjubal/vision-agent-hack

Repository files navigation

🎬 vision-agent-hack: Joey Tribbiani - Full-Stack Frontend Expert

A fun and engaging multimodal vision agent powered by Joey Tribbiani's persona! This agent brings charm, humor, and technical expertise to web development by analyzing Figma designs and guiding users through full-stack frontend implementation step-by-step.

✨ Features

  • Vision-Powered Design Analysis: Analyzes Figma designs and interprets visual layouts using Google's Gemini Vision AI
  • Interactive Voice Communication: Speaks with engaging personality via ElevenLabs text-to-speech
  • Speech Recognition: Understands user input via Deepgram speech-to-text
  • Real-time Streaming Voice Calls: Powered by GetStream for seamless multimodal interaction
  • Built-in MCP Tooling: Local Filesystem + Fetch MCP servers for file operations and URL/doc retrieval
  • Knowledge-Aware Responses: Supports Gemini File Search over local markdown docs in knowledge/
  • Optional GitHub MCP Integration: Automatically enabled when GITHUB_PAT is set
  • Custom LLM Helper Functions: Includes timestamping, React boilerplate generation, package suggestions, and safe workspace command execution
  • Step-by-Step Guidance: Breaks down web development projects into manageable, progressively-built features
  • Joey's Personality: Enthusiastic, charming communication style that keeps users engaged and motivated

🛠️ Technical Stack

  • Vision Language Model: Google Gemini Vision AI
  • Speech-to-Text: Deepgram
  • Text-to-Speech: ElevenLabs
  • Real-time Communication: GetStream
  • MCP: @modelcontextprotocol/server-filesystem, @modelcontextprotocol/server-fetch
  • ML/Transformers: HuggingFace Transformers, NVIDIA/Ultralytics support
  • Core: Python 3.12+, Vision Agents Framework

📋 Prerequisites

  • Python 3.12 or higher
  • Node.js 18+ (required for MCP servers launched through npx)
  • API keys for:
    • Google Gemini Vision
    • ElevenLabs
    • Deepgram
    • Stream / GetStream

Optional:

  • GitHub Personal Access Token (GITHUB_PAT) to enable remote GitHub MCP

🚀 Installation

  1. Clone the repository:
git clone <repository-url>
cd vision-agent-hack
  1. Create a Python virtual environment:
python -m venv venv
# Windows (PowerShell)
venv\Scripts\Activate.ps1
# Windows (cmd)
venv\Scripts\activate.bat
# macOS/Linux
source venv/bin/activate
  1. Install dependencies:
pip install -e .

Or install directly:

pip install -r requirements.txt

⚡ Quickstart

# 1) copy env template
cp .env.example .env

# 2) fill in API keys inside .env

# 3) run agent
python full-stack-joey.py

🔑 Configuration

Create a .env file in the project root with your API keys:

GOOGLE_API_KEY=your_gemini_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
DEEPGRAM_API_KEY=your_deepgram_api_key
STREAM_API_KEY=your_stream_api_key
STREAM_API_SECRET=your_stream_api_secret
GITHUB_PAT=your_github_pat_optional

Note: If your local setup expects GETSTREAM_API_KEY / GETSTREAM_API_SECRET, keep both pairs in .env.

Workspace behavior

  • On Windows, Joey uses C:/joey-workspace
  • On macOS/Linux, Joey uses /tmp/joey-workspace
  • MCP launcher scripts (_mcp_*.py) and generated component boilerplates are written there

📚 Knowledge & Gemini Search

Joey supports a local knowledge base backed by Gemini File Search.

Where knowledge lives

  • Add markdown files under knowledge/ (for example: knowledge/nextjs.md, knowledge/shadcn.md)
  • The loader scans *.md files from that folder

How Gemini search works in this project

  1. create_rag_from_directory() builds a Gemini file-search store from knowledge/
  2. The store is attached to Gemini VLM via gemini.tools.FileSearch(...)
  3. During conversations, Joey can retrieve relevant snippets from indexed docs and ground responses in project knowledge

Important note

  • If knowledge/ is missing, startup logs a warning and no local knowledge is indexed
  • To enable retrieval, ensure RAG initialization runs before agent creation in your startup flow

💬 Joey's Expertise

Joey specializes in:

Frontend Frameworks

  • React.js - Functional components, hooks, state management, performance optimization
  • Next.js - Full-stack development, SSR, SSG, API routes
  • TypeScript - Type-safe, maintainable code with strict checking
  • Tailwind CSS - Utility-first responsive design and styling

Development Approach

  • Analyzes Figma designs with enthusiasm
  • Breaks projects into logical, manageable steps
  • Builds incrementally (html → styling → interactivity)
  • Follows best practices: semantic HTML, responsive design, accessibility
  • Maintains Joey's engaging personality throughout!

Key Phrases

  • "How you doin'?"
  • "Could I BE any more excited about this code?"
  • "Oh my God!"
  • "That's so...not good!"
  • "Yeah, baby!"

📞 Usage

Run the agent:

python full-stack-joey.py

The agent will:

  1. Initialize a real-time multimodal connection
  2. Wait for users to join the call
  3. Analyze uploaded Figma designs
  4. Guide through frontend implementation with Joey's personality
  5. Build features step-by-step with explanations

📁 Project Structure

.
├── full-stack-joey.py      # Main agent implementation
├── knowledge/              # Local knowledge base markdown docs for Gemini File Search
│   ├── nextjs.md
│   └── shadcn.md
├── main.py                 # Entry point
├── pyproject.toml          # Project configuration and dependencies
├── requirements.txt        # Direct dependencies
├── README.md              # This file
└── .env                   # Configuration (not in repo)

🎯 How It Works

  1. Knowledge Indexing (Optional): Builds Gemini file-search store from knowledge/*.md
  2. Agent Creation: Initializes Full-Stack Joey with Gemini VLM (max_output_tokens=3000) and FileSearch tool, plus ElevenLabs TTS and Deepgram STT
  3. MCP Server Bootstrapping: Generates Python launcher scripts and starts local Filesystem + Fetch MCP servers
  4. Remote MCP (Optional): Adds GitHub MCP automatically when GITHUB_PAT is present
  5. Function Registration: Registers custom helpers (get_timestamp, generate_component_boilerplate, suggest_packages, run_workspace_command)
  6. Call Lifecycle: Joins a GetStream call and handles participant-join events before finishing the session

🧠 Custom Helper Functions

Joey registers domain-specific functions directly on the VLM:

  • get_timestamp() — returns current datetime string
  • generate_component_boilerplate(component_name, props_json, use_typescript) — writes a React/Next component file to workspace
  • suggest_packages(use_case) — recommends npm packages by use-case category
  • run_workspace_command(command) — runs bounded shell commands in Joey workspace with timeout and captured output

🧯 Troubleshooting

  • npx not found: install Node.js 18+ and reopen terminal.
  • MCP server startup fails: ensure C:/joey-workspace is writable on Windows.
  • Missing credentials errors: verify keys in .env and restart the process.
  • Voice/call connection issues: confirm Stream/GetStream API key + secret are valid for your app.
  • Knowledge search returns nothing: verify markdown docs exist under knowledge/ and RAG initialization is executed before agent startup.

📦 Dependencies

  • python-dotenv>=1.2.1 - Environment variable management
  • transformers>=4.57.6 - ML transformers for enhanced capabilities
  • vision-agents[deepgram,elevenlabs,gemini,getstream,huggingface,nvidia,openai,ultralytics]>=0.3.8 - Core vision agent framework with all plugins

👥 Meet the Team: status200

Caleb Chandrasekar
Caleb Chandrasekar
@calebjubal
S.Tharundhatri
S.Tharundhatri
@Tharun-10Dragneel
Rishav
Rishav
@Rishav23av
Kushagra Chandok
Kushagra Chandok
@mengyokyu

Made with ❤️ by bringing Joey Tribbiani to the web development world!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages