"Hey Jarvis" -- and it sees your screen, talks back, and takes control.
Warning
This project is in very early development and is very, very unstable. Expect breaking changes, missing features, and rough edges. Not ready for production use. Contributions and feedback welcome!
Jarvis is a voice-activated AI assistant for macOS that can see your screen, hear you speak, talk back, and control your computer -- all hands-free. Think Iron Man's Jarvis, but for your Mac.
- Say "Jarvis" -- wake word detection activates listening (powered by Picovoice Porcupine)
- Speak your request -- local speech-to-text converts your voice (OpenAI Whisper)
- Jarvis sees your screen -- smart screen capture grabs the relevant window
- AI thinks & acts -- Claude analyzes the screenshot, plans actions, and executes them
- Jarvis talks back -- natural voice response via ElevenLabs streaming TTS
All of this happens in seconds with a native macOS overlay showing you what Jarvis is doing.
- Voice Activation -- Custom "Jarvis" wake word, no button pressing needed
- Screen Vision -- Captures and understands what's on your screen
- Computer Control -- Mouse clicks, keyboard input, app navigation via macOS Accessibility API
- Natural Voice -- Streams responses with ElevenLabs for low-latency, natural speech
- Native Overlay -- Swift-based transparent overlay with status indicators (Listening / Thinking / Acting)
- Conversation Memory -- Maintains context across your conversation
- ESC Kill Switch -- Instantly stops all actions with a single keypress
Jarvis (macOS App) -- Python backend + Swift overlay
├── Voice In : Picovoice wake word -> Whisper STT (local)
├── Voice Out : ElevenLabs streaming WebSocket TTS
├── Vision : macOS CGWindowList (smart capture)
├── Actions : macOS Accessibility API + CGEvent
├── Overlay UI : Swift transparent NSWindow
├── Brain : Claude AI (streaming, tool-use, vision)
└── Comms : Python <-> Swift via WebSocket
| Component | Technology | Purpose |
|---|---|---|
| Core | Python 3.11+ | Main engine, AI integration |
| Overlay UI | Swift / AppKit | Native macOS transparent window |
| Wake Word | Picovoice Porcupine | Local, fast wake word detection |
| STT | OpenAI Whisper (local) | Offline speech-to-text |
| TTS | ElevenLabs (streaming) | Natural voice, low latency |
| Screen Capture | macOS CGWindowList + mss | Smart window-aware capture |
| Computer Control | macOS Accessibility + CGEvent | Native mouse/keyboard control |
| AI Brain | Anthropic Claude | Vision + tool-use + streaming |
- macOS (Apple Silicon or Intel)
- Python 3.11+
- Xcode (for the Swift overlay)
- API keys for:
- Anthropic (Claude API)
- ElevenLabs (Text-to-Speech)
- Picovoice (Wake word detection)
# Clone the repository
git clone https://github.com/your-username/ai-watcher.git
cd ai-watcher
# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e .Create a .env file in the project root:
ANTHROPIC_API_KEY=your_anthropic_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
PICOVOICE_ACCESS_KEY=your_picovoice_access_keycd overlay/JarvisOverlay
swift build
cd ../..Jarvis needs the following macOS permissions (you'll be prompted on first run):
- Microphone -- for voice input
- Accessibility -- for computer control (mouse/keyboard)
- Screen Recording -- for screen capture
# Activate your virtual environment
source .venv/bin/activate
# Run Jarvis
jarvisOnce running:
- Say "Jarvis" to activate
- Speak your request (e.g., "What's on my screen?", "Open Safari", "Click the submit button")
- Press ESC at any time to immediately stop Jarvis
ai-watcher/
├── jarvis/
│ ├── main.py # Entry point & event loop
│ ├── brain/ # AI integration (Claude)
│ ├── voice/ # Wake word, STT, TTS
│ ├── vision/ # Screen capture
│ ├── actions/ # Mouse/keyboard control
│ └── core/ # Shared utilities
├── overlay/
│ └── JarvisOverlay/ # Swift native overlay UI
├── assets/
│ ├── screenshots/ # Project screenshots
│ └── sounds/ # Sound effects
└── pyproject.toml # Python project config
This project is actively under development and highly unstable. Here's where things stand:
- Wake word detection ("Jarvis")
- Speech-to-text (Whisper local)
- ElevenLabs streaming TTS
- Claude AI integration (streaming + tool-use)
- Conversation memory
- Main event loop (wake -> listen -> think -> act -> speak)
- Screen capture (focused window + full screen)
- Mouse/keyboard control
- Swift overlay UI with status pill
- Smart crop around mouse cursor
- Cursor highlight trail during actions
- Sound effects (wake chime, action clicks)
- Error recovery & self-correction
- Settings UI
- Windows/Linux support
- Unstable WebSocket connections to ElevenLabs on mute/unmute
- Overlay may flicker on certain macOS versions
- Wake word detection sensitivity varies with ambient noise
- This is alpha software -- expect crashes
MIT
Built with coffee, Claude, and the dream of a real Jarvis.


