Skip to content

abdushsk/jarvis

Repository files navigation

Jarvis - Voice-Activated AI Desktop Assistant

"Hey Jarvis" -- and it sees your screen, talks back, and takes control.

Warning

This project is in very early development and is very, very unstable. Expect breaking changes, missing features, and rough edges. Not ready for production use. Contributions and feedback welcome!


Say Jarvis to speak

Jarvis is a voice-activated AI assistant for macOS that can see your screen, hear you speak, talk back, and control your computer -- all hands-free. Think Iron Man's Jarvis, but for your Mac.

Jarvis conversation

Jarvis in action


How It Works

  1. Say "Jarvis" -- wake word detection activates listening (powered by Picovoice Porcupine)
  2. Speak your request -- local speech-to-text converts your voice (OpenAI Whisper)
  3. Jarvis sees your screen -- smart screen capture grabs the relevant window
  4. AI thinks & acts -- Claude analyzes the screenshot, plans actions, and executes them
  5. Jarvis talks back -- natural voice response via ElevenLabs streaming TTS

All of this happens in seconds with a native macOS overlay showing you what Jarvis is doing.

Features

  • Voice Activation -- Custom "Jarvis" wake word, no button pressing needed
  • Screen Vision -- Captures and understands what's on your screen
  • Computer Control -- Mouse clicks, keyboard input, app navigation via macOS Accessibility API
  • Natural Voice -- Streams responses with ElevenLabs for low-latency, natural speech
  • Native Overlay -- Swift-based transparent overlay with status indicators (Listening / Thinking / Acting)
  • Conversation Memory -- Maintains context across your conversation
  • ESC Kill Switch -- Instantly stops all actions with a single keypress

Architecture

Jarvis (macOS App) -- Python backend + Swift overlay
├── Voice In      : Picovoice wake word -> Whisper STT (local)
├── Voice Out     : ElevenLabs streaming WebSocket TTS
├── Vision        : macOS CGWindowList (smart capture)
├── Actions       : macOS Accessibility API + CGEvent
├── Overlay UI    : Swift transparent NSWindow
├── Brain         : Claude AI (streaming, tool-use, vision)
└── Comms         : Python <-> Swift via WebSocket

Tech Stack

Component Technology Purpose
Core Python 3.11+ Main engine, AI integration
Overlay UI Swift / AppKit Native macOS transparent window
Wake Word Picovoice Porcupine Local, fast wake word detection
STT OpenAI Whisper (local) Offline speech-to-text
TTS ElevenLabs (streaming) Natural voice, low latency
Screen Capture macOS CGWindowList + mss Smart window-aware capture
Computer Control macOS Accessibility + CGEvent Native mouse/keyboard control
AI Brain Anthropic Claude Vision + tool-use + streaming

Prerequisites

  • macOS (Apple Silicon or Intel)
  • Python 3.11+
  • Xcode (for the Swift overlay)
  • API keys for:

Installation

# Clone the repository
git clone https://github.com/your-username/ai-watcher.git
cd ai-watcher

# Create a virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -e .

Environment Variables

Create a .env file in the project root:

ANTHROPIC_API_KEY=your_anthropic_api_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
PICOVOICE_ACCESS_KEY=your_picovoice_access_key

Build the Overlay (Swift)

cd overlay/JarvisOverlay
swift build
cd ../..

macOS Permissions

Jarvis needs the following macOS permissions (you'll be prompted on first run):

  • Microphone -- for voice input
  • Accessibility -- for computer control (mouse/keyboard)
  • Screen Recording -- for screen capture

Usage

# Activate your virtual environment
source .venv/bin/activate

# Run Jarvis
jarvis

Once running:

  1. Say "Jarvis" to activate
  2. Speak your request (e.g., "What's on my screen?", "Open Safari", "Click the submit button")
  3. Press ESC at any time to immediately stop Jarvis

Project Structure

ai-watcher/
├── jarvis/
│   ├── main.py          # Entry point & event loop
│   ├── brain/           # AI integration (Claude)
│   ├── voice/           # Wake word, STT, TTS
│   ├── vision/          # Screen capture
│   ├── actions/         # Mouse/keyboard control
│   └── core/            # Shared utilities
├── overlay/
│   └── JarvisOverlay/   # Swift native overlay UI
├── assets/
│   ├── screenshots/     # Project screenshots
│   └── sounds/          # Sound effects
└── pyproject.toml       # Python project config

Current Status

This project is actively under development and highly unstable. Here's where things stand:

  • Wake word detection ("Jarvis")
  • Speech-to-text (Whisper local)
  • ElevenLabs streaming TTS
  • Claude AI integration (streaming + tool-use)
  • Conversation memory
  • Main event loop (wake -> listen -> think -> act -> speak)
  • Screen capture (focused window + full screen)
  • Mouse/keyboard control
  • Swift overlay UI with status pill
  • Smart crop around mouse cursor
  • Cursor highlight trail during actions
  • Sound effects (wake chime, action clicks)
  • Error recovery & self-correction
  • Settings UI
  • Windows/Linux support

Known Issues

  • Unstable WebSocket connections to ElevenLabs on mute/unmute
  • Overlay may flicker on certain macOS versions
  • Wake word detection sensitivity varies with ambient noise
  • This is alpha software -- expect crashes

License

MIT


Built with coffee, Claude, and the dream of a real Jarvis.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors