Captures your screen β Analyzes with Gemma 4 β Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.
Features Β· Gemma 4 Deep Dive Β· Quick Start Β· Architecture Β· Agent Platform Β· MCP Β· API
Microsoft showed the world wants screen-aware AI with Recall. But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative β every screenshot analyzed, every insight generated, every search result β all computed locally using Gemma 4's multimodal capabilities.
It's not just a screen recorder. It's an AI memory you can talk to, search through, and build automations on top of.
- πΈ Smart Capture β Content-change detection, not a fixed timer. Captures when your screen actually changes.
- π¬ Gemma 4 Vision Analysis β Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.
- π Hybrid Search β Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by meaning, not just keywords.
- π¬ Chat with Memory β Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" β get the actual message.
- ποΈ Voice Memos β Hold
Ctrl+Shift+Vβ Gemma 4's native audio encoder transcribes. Screenshot captured alongside. - π€ Meeting Transcription β Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.
- π Analytics Dashboard β Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.
- βͺ Day Rewind β Timelapse playback of your entire day with play/pause/scrub/speed controls.
- Three Analysis Modes β Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.
- Per-App pHash Cache β 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. ~40% fewer inference calls.
- Chat-First GPU Priority β Chat cancels in-flight analysis instantly. GPU freed in <1s.
- Auto-Pause Heavy Apps β Games, video editors, 3D software detected β capture pauses automatically.
- 100% Local β All data stays on your machine. Zero network calls. No telemetry. Ever.
- Sensitive Data Filter β Auto-redacts credit cards, SSNs, API keys, passwords before storage.
- Encryption at Rest β AES encryption for screenshots (Fernet + OS keyring).
- Dashboard PIN Lock β Session-based auth with configurable auto-lock timeout.
- Incognito Mode β One-click pause. Nothing recorded.
π Integrations & Extensibility
| Integration | Description |
|---|---|
| π€ Agent Platform | Build automations in Markdown (English) or Python. Drop a file, get an agent. |
| π MCP Server | Expose screen history to Claude Desktop, Cursor, VS Code |
| π Obsidian | Auto-sync daily summaries to your vault |
| π Notion | Push summaries to a Notion database |
| πͺ Webhooks | Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry) |
| π Smart Notifications | Distraction alerts, break reminders |
| β Auto-Bookmark | Keyword triggers (git push, deploy) auto-flag important moments |
| Hotkey | Action |
|---|---|
Ctrl+Shift+B |
πΈ Instant bookmarked capture |
Ctrl+Shift+P |
βΈ Toggle pause/resume |
Ctrl+Shift+V |
π€ Hold to record voice memo |
All hotkeys customizable from Settings.
Gemma 4 E2B is not a bolt-on β it's architecturally load-bearing. ScreenMind uses all three modalities:
Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:
- App name, activity category, summary, detailed context
- Mood classification, confidence score
- Rich scene description (every visible element inventoried)
- Layout regions (sidebar, chat area, toolbar boundaries)
Three modes:
- Accurate β single call with thinking (~76s). Best layout detection.
- Balanced β thinking enabled, analysis-only (~40s). Richer descriptions than Fast.
- Fast β no-thinking prefill trick (~12s). Layout via OCR clustering instead.
Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:
- Voice memo transcription (hold hotkey β speak β release)
- Meeting transcription (15s chunks, map-reduce summarization for long meetings)
No Whisper dependency. One model handles everything.
- Daily summaries with deep reasoning (
think=True) - Chat answers grounded in actual screen data (text-first RAG with vision fallback)
- Agent execution β Gemma processes markdown agent prompts with injected screen data
| Constraint | Why It Rules Out Alternatives |
|---|---|
| Must run continuously in background | Rules out 12B+ models (too heavy) |
| Must understand screenshots natively | Rules out text-only models |
| Must stay 100% local for privacy | Rules out cloud APIs |
| Must handle audio natively | Rules out models without audio encoder |
| Must be fast enough for 30s cycle | E2B processes in 12-76s depending on mode |
Gemma 4 E2B is the only model that checks all five boxes.
Requirements: Python 3.10+ Β· GPU recommended (4GB+ VRAM) Β· ~5GB disk for model
git clone https://github.com/ayushh0110/ScreenMind.git
cd screenmind
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS/Linux
pip install -r requirements.txtpython main.py3οΈβ£ Open β http://127.0.0.1:7777
On first run, ScreenMind will:
- Auto-download Gemma 4 E2B GGUF model (~5GB, one time)
- Start
llama-serverin background - Show the welcome screen to set up an optional PIN
- Create
~/.screenmind/for data storage
βοΈ Optional: Configure via .env
cp .env.example .env
# Edit capture interval, blocked apps, hotkeys, etc.Or configure everything from the Settings tab in the dashboard.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ScreenMind β
β β
β ββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββββββ β
β β Capture βββββΆβ Async Queue βββββΆβ Analysis Worker β β
β β Worker β β (max: 100) β β β β
β β β ββββββββββββββββ β βββββββββββββββββββββ β β
β β β’ Screen β β β Per-App pHash β β β
β β β’ Window β β β Cache (3-tier) β β β
β β β’ Dedup β β βββββββββββββββββββββ β β
β β β’ A11y β β β β β
β β β’ Privacy β β βΌ β β
β ββββββββββββββ β βββββββββββββββββββββ β β
β β β EasyOCR β β β
β ββββββββββββββ β β (text extract) β β β
β β Audio β β βββββββββββββββββββββ β β
β β Worker β β β β β
β β β β βΌ β β
β β β’ Meeting β β βββββββββββββββββββββ β β
β β detect β β β Gemma 4 E2B β β β
β β β’ Record β β β (via llama.cpp) β β β
β β β’ Transcr. β β β Vision + Audio β β β
β ββββββββββββββ β βββββββββββββββββββββ β β
β β β β β
β ββββββββββββββ β βΌ β β
β β Agent β β βββββββββββββββββββββ β β
β β Scheduler β β β Layout Analyzer β β β
β β β β β (spatial OCR) β β β
β β β’ .md AI β β βββββββββββββββββββββ β β
β β β’ .py code β β β β β
β ββββββββββββββ β βΌ β β
β β βββββββββββββββββββββ β β
β β β MiniLM-L6-v2 β β β
β β β (embeddings) β β β
β β βββββββββββββββββββββ β β
β βββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββ β
β β SQLite (WAL) β β
β β + FTS5 index β β
β βββββββββββ¬ββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β FastAPI REST Server β β
β β /timeline Β· /search Β· /chat Β· /stats Β· /agents Β· /mcp β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Web Dashboard (Vanilla JS SPA) β β β
β β β Timeline Β· Chat Β· Search Β· Analytics Β· Agents Β· Settings β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Screenshot β EasyOCR (text) β Gemma 4 E2B (understanding) β MiniLM (embeddings) β SQLite + FTS5
β
OCR text fed as context
(Gemma sees image + reads text)
Four AI models working in concert, with Gemma 4 as the brain:
- EasyOCR β extracts raw screen text
- Gemma 4 E2B β understands what you're doing (vision + reasoning)
- MiniLM-L6-v2 β generates semantic vectors for natural language search
- FTS5 β indexes text for instant keyword search
ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.
| Mode | File Type | For | Example |
|---|---|---|---|
| π€ AI Agent | .md |
Everyone | Write a prompt in English β Gemma runs it on your data |
| π Python Plugin | .py |
Developers | Full code with SDK access, state persistence, LLM calls |
---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---
Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.Drop this file in ~/.screenmind/agents/ β it runs automatically.
from screenmind_sdk import ScreenMindSDK
sdk = ScreenMindSDK("my-tracker")
# Get today's activities filtered by app
activities = sdk.get_activities(app="Chrome", limit=20)
# Persistent state across runs
last_count = sdk.load_state("url_count", 0)
urls = sdk.get_urls_visited()
sdk.save_state("url_count", len(urls))
# Ask Gemma (GPU-safe β waits for idle)
insight = sdk.ask_gemma(f"Summarize these URLs: {urls}")
print(insight)Markdown agents declare what data they need:
| Selector | Injects |
|---|---|
timeline |
Recent activities with timestamps, apps, summaries |
apps |
App usage counts + category breakdown |
urls |
URLs visited (extracted from browser address bars) |
meetings |
Meeting summaries and durations |
mood |
Mood/sentiment from screen analysis |
Data injection auto-scales to your model's context window.
- daily-journal.md β First-person journal entry from your day
- focus-report.md β Focus score, deep work hours, distractions
- meeting-actions.md β Extract action items from meeting transcripts
- code-changelog.md β Summarize coding activity (commits, files, repos)
ScreenMind exposes your screen history to any MCP-compatible AI tool:
python mcp_server.py # stdio transportClaude Desktop config (~/.claude/claude_desktop_config.json):
{
"mcpServers": {
"screenmind": {
"command": "python",
"args": ["C:/path/to/screenmind/mcp_server.py"]
}
}
}| Tool | Description |
|---|---|
search_screen |
Semantic + keyword search across all history |
get_recent_activity |
Last N activities with full details |
get_activity_by_time |
Activities for a specific date/time range |
get_daily_summary |
AI-generated daily summary |
capture_now |
Trigger instant screenshot |
get_stats |
Usage statistics |
search_audio |
Search meeting transcripts |
get_screenshot |
Retrieve screenshot path by activity ID |
Full Swagger docs at http://127.0.0.1:7777/docs
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/status |
System health, worker stats |
GET |
/api/timeline?date=2026-05-21 |
Activities for a date |
GET |
/api/search?q=debugging auth |
Hybrid semantic + keyword search |
POST |
/api/chat |
Conversational AI with screen memory (SSE stream) |
GET |
/api/stats?range=day |
Analytics (categories, apps, meetings) |
GET |
/api/rewind?date=2026-05-21 |
Timelapse frames |
POST |
/api/summary/generate |
Generate AI daily summary |
GET |
/api/agents |
List all agents |
POST |
/api/agents/{name}/run |
Trigger agent execution |
POST |
/api/capture/pause |
Pause capture |
POST |
/api/incognito/toggle |
Toggle incognito mode |
All settings configurable via .env, environment variables, or the Settings dashboard (persists to settings.json).
| Variable | Default | Description |
|---|---|---|
CAPTURE_INTERVAL |
40 |
Seconds between captures |
ANALYSIS_MODE |
merged |
merged (accurate, ~76s) or fast (~12s) |
PERFORMANCE_MODE |
balanced |
GPU layers: minimal / balanced / maximum |
BLOCKED_APPS |
(empty) | Comma-separated apps to never capture |
MEETING_TRANSCRIPTION |
false |
Auto-transcribe when meeting apps detected |
RETENTION_DAYS |
7 |
Auto-delete data older than N days (0 = forever) |
ENCRYPTION_ENABLED |
false |
Encrypt screenshots at rest |
SENSITIVE_FILTER_ENABLED |
true |
Redact credit cards, SSNs, API keys |
See
.env.examplefor the full list.
| Layer | Technology | Why |
|---|---|---|
| Vision + Audio AI | Gemma 4 E2B (via llama.cpp) | Only model with vision + audio + reasoning that runs locally on 4GB VRAM |
| Inference Server | llama-server (llama.cpp) | Direct GGUF inference, OpenAI-compatible API, 8-12% faster than Ollama |
| OCR | EasyOCR | Extracts screen text fed to Gemma as context |
| Embeddings | all-MiniLM-L6-v2 | 80MB, runs on CPU, 384-dim vectors for semantic search |
| Backend | FastAPI + Uvicorn | Async-first, auto-generated API docs |
| Database | SQLite (WAL) + FTS5 | Zero-config, concurrent reads, full-text search |
| Capture | mss + ctypes/UI Automation | Native screen capture + accessibility text extraction |
| Frontend | Vanilla JS + CSS | No build step, instant load, dark glassmorphism UI |
| Platform | Windows / macOS / Linux | Abstraction layer with OS-specific adapters |
screenmind/
βββ main.py # Entry point β starts all services
βββ config.py # Pydantic settings (env + runtime overrides)
βββ requirements.txt # Python dependencies
βββ mcp_server.py # MCP server for Claude/Cursor/VS Code
βββ screenmind_sdk.py # SDK for Python plugin agents
β
βββ capture/ # Screenshot capture layer
β βββ screen.py # mss-based capture + encryption
β βββ window.py # Active window detection
β βββ dedup.py # Perceptual hash deduplication
β βββ hotkey.py # Global hotkeys (bookmark, pause, voice)
β βββ voice_recorder.py # Mic recording for voice memos
β
βββ engine/ # AI & intelligence layer
β βββ analyzer.py # Gemma 4 vision analysis (dual mode)
β βββ llm_client.py # llama-server client (chat, vision, audio)
β βββ model_manager.py # Server lifecycle, model download/switch
β βββ embedder.py # MiniLM semantic embeddings
β βββ ocr.py # EasyOCR text extraction
β βββ layout_analyzer.py # Spatial OCR organization
β βββ dev_context.py # Git repo/branch/diff detection
β βββ a11y_extractor.py # Accessibility API text extraction
β βββ agent_runner.py # Agent scheduling & execution
β
βββ workers/ # Background processing
β βββ capture_worker.py # Smart capture loop + privacy filtering
β βββ analysis_worker.py # OCR β Gemma β Layout β Embed β Store
β βββ audio_worker.py # Meeting detection & transcription
β
βββ storage/ # Data persistence
β βββ database.py # SQLite + FTS5 + migrations
β βββ models.py # Pydantic data models
β
βββ privacy/ # Privacy & security
β βββ encryption.py # Fernet AES encryption at rest
β βββ data_filter.py # Sensitive data redaction
β
βββ platform_support/ # Cross-platform abstraction
β βββ windows.py # Win32 + UI Automation
β βββ macos.py # AppKit + AXUIElement
β βββ linux.py # xdotool + AT-SPI
β
βββ integrations/ # External connections
β βββ obsidian.py # Vault markdown export
β βββ notion.py # Notion API export
β βββ webhooks.py # HTTP webhooks (HMAC, retry)
β βββ smart_notify.py # Distraction/break notifications
β
βββ api/ # REST API + dashboard
β βββ server.py # FastAPI app + auth middleware
β βββ dependencies.py # Shared state for routes
β βββ routes/ # 16 route modules
β βββ static/ # Web dashboard (HTML + CSS + JS)
β
βββ default_agents/ # 4 built-in agents
β βββ daily-journal.md
β βββ focus-report.md
β βββ meeting-actions.md
β βββ code-changelog.md
β
βββ docs/
βββ BUILD_YOUR_OWN_AGENT.md
| Scenario | Behavior |
|---|---|
| llama-server not running | Auto-starts on launch. Captures continue; analysis retried with backoff. |
| Model not downloaded | Auto-downloads GGUF on first start via HuggingFace. |
| GPU out of memory | Detects OOM, retries with delay, re-queues on persistent failure. |
| Duplicate frames | pHash dedup skips identical screenshots (threshold: 8 hamming distance). |
| Stale queue items | Captures >3 min old auto-skipped. Backfilled during idle. |
| App in blocklist | Silently skips β no screenshot saved. |
| Meeting app closed | Process-alive check + silence detection + 5-min hard timeout. |
| Chat during analysis | Cancels in-flight inference, frees GPU in <1s, re-queues analysis. |
| Crash recovery | Stale meetings cleaned on startup. Unanalyzed entries backfilled. |
The web dashboard at http://127.0.0.1:7777 features:
- Timeline β Browse activities by date with thumbnails, AI summaries, category badges
- Chat β Conversational AI with screen memory. Ask anything about your history.
- Search β Semantic + keyword hybrid search with OCR highlighting on screenshots
- Analytics β Category charts, top apps, hourly heatmap, meeting stats
- Rewind β Timelapse player with play/pause/scrub/speed controls
- Memos β Voice memo list with audio player
- Agents β Create, edit, run, and monitor agents
- Settings β 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage
Dark glassmorphism UI. No build step. Instant load.
Contributions welcome! Here are some high-impact areas:
- π macOS/Linux testing β platform adapters exist, need real hardware testing
- π³ Docker container β one-command setup
- π§© Community agent registry β share agents between users
- π Browser extension β richer URL/tab context
- π€ Export formats β Markdown, CSV, JSON
If you find ScreenMind useful, please consider:
- β Star this repo β it helps others discover the project
- π΄ Fork it β build your own agents and features
- π Report issues β help us improve
- π£ Share it β tell others about privacy-first AI
MIT License β see LICENSE for details.
Built with π§ Gemma 4 E2B Β· π 100% Local Β· π Zero Cloud Dependencies
Vision + Audio + Reasoning β all three modalities, one model, your machine.
Made with β€οΈ by ayushh0110