Jarvis - Voice-Activated AI Desktop Assistant

"Hey Jarvis" -- and it sees your screen, talks back, and takes control.

Warning

This project is in very early development and is very, very unstable. Expect breaking changes, missing features, and rough edges. Not ready for production use. Contributions and feedback welcome!

Jarvis is a voice-activated AI assistant for macOS that can see your screen, hear you speak, talk back, and control your computer -- all hands-free. Think Iron Man's Jarvis, but for your Mac.

How It Works

Say "Jarvis" -- wake word detection activates listening (powered by Picovoice Porcupine)
Speak your request -- local speech-to-text converts your voice (OpenAI Whisper)
Jarvis sees your screen -- smart screen capture grabs the relevant window
AI thinks & acts -- Claude analyzes the screenshot, plans actions, and executes them
Jarvis talks back -- natural voice response via ElevenLabs streaming TTS

All of this happens in seconds with a native macOS overlay showing you what Jarvis is doing.

Features

Voice Activation -- Custom "Jarvis" wake word, no button pressing needed
Screen Vision -- Captures and understands what's on your screen
Computer Control -- Mouse clicks, keyboard input, app navigation via macOS Accessibility API
Natural Voice -- Streams responses with ElevenLabs for low-latency, natural speech
Native Overlay -- Swift-based transparent overlay with status indicators (Listening / Thinking / Acting)
Conversation Memory -- Maintains context across your conversation
ESC Kill Switch -- Instantly stops all actions with a single keypress

Architecture

Jarvis (macOS App) -- Python backend + Swift overlay
├── Voice In      : Picovoice wake word -> Whisper STT (local)
├── Voice Out     : ElevenLabs streaming WebSocket TTS
├── Vision        : macOS CGWindowList (smart capture)
├── Actions       : macOS Accessibility API + CGEvent
├── Overlay UI    : Swift transparent NSWindow
├── Brain         : Claude AI (streaming, tool-use, vision)
└── Comms         : Python <-> Swift via WebSocket

Tech Stack

Component	Technology	Purpose
Core	Python 3.11+	Main engine, AI integration
Overlay UI	Swift / AppKit	Native macOS transparent window
Wake Word	Picovoice Porcupine	Local, fast wake word detection
STT	OpenAI Whisper (local)	Offline speech-to-text
TTS	ElevenLabs (streaming)	Natural voice, low latency
Screen Capture	macOS CGWindowList + mss	Smart window-aware capture
Computer Control	macOS Accessibility + CGEvent	Native mouse/keyboard control
AI Brain	Anthropic Claude	Vision + tool-use + streaming

Prerequisites

macOS (Apple Silicon or Intel)
Python 3.11+
Xcode (for the Swift overlay)
API keys for:
- Anthropic (Claude API)
- ElevenLabs (Text-to-Speech)
- Picovoice (Wake word detection)

Installation

# Clone the repository
git clone https://github.com/your-username/ai-watcher.git
cd ai-watcher

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e .

Environment Variables

Create a .env file in the project root:

ANTHROPIC_API_KEY=your_anthropic_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
PICOVOICE_ACCESS_KEY=your_picovoice_access_key

Build the Overlay (Swift)

cd overlay/JarvisOverlay
swift build
cd ../..

macOS Permissions

Jarvis needs the following macOS permissions (you'll be prompted on first run):

Microphone -- for voice input
Accessibility -- for computer control (mouse/keyboard)
Screen Recording -- for screen capture

Usage

# Activate your virtual environment
source .venv/bin/activate

# Run Jarvis
jarvis

Once running:

Say "Jarvis" to activate
Speak your request (e.g., "What's on my screen?", "Open Safari", "Click the submit button")
Press ESC at any time to immediately stop Jarvis

Project Structure

ai-watcher/
├── jarvis/
│   ├── main.py          # Entry point & event loop
│   ├── brain/           # AI integration (Claude)
│   ├── voice/           # Wake word, STT, TTS
│   ├── vision/          # Screen capture
│   ├── actions/         # Mouse/keyboard control
│   └── core/            # Shared utilities
├── overlay/
│   └── JarvisOverlay/   # Swift native overlay UI
├── assets/
│   ├── screenshots/     # Project screenshots
│   └── sounds/          # Sound effects
└── pyproject.toml       # Python project config

Current Status

This project is actively under development and highly unstable. Here's where things stand:

Known Issues

Unstable WebSocket connections to ElevenLabs on mute/unmute
Overlay may flicker on certain macOS versions
Wake word detection sensitivity varies with ambient noise
This is alpha software -- expect crashes

License

MIT

Built with coffee, Claude, and the dream of a real Jarvis.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets/screenshots		assets/screenshots
jarvis		jarvis
overlay/JarvisOverlay		overlay/JarvisOverlay
.env.example		.env.example
.gitignore		.gitignore
PLAN.md		PLAN.md
README.md		README.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml
testss.png		testss.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jarvis - Voice-Activated AI Desktop Assistant

How It Works

Features

Architecture

Tech Stack

Prerequisites

Installation

Environment Variables

Build the Overlay (Swift)

macOS Permissions

Usage

Project Structure

Current Status

Known Issues

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jarvis - Voice-Activated AI Desktop Assistant

How It Works

Features

Architecture

Tech Stack

Prerequisites

Installation

Environment Variables

Build the Overlay (Swift)

macOS Permissions

Usage

Project Structure

Current Status

Known Issues

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages