Skip to content

ayushh0110/ScreenMind

Repository files navigation


ScreenMind



Captures your screen β†’ Analyzes with Gemma 4 β†’ Builds a searchable AI memory
100% local. 100% private. Zero cloud dependencies.


Python 3.10+ Gemma 4 E2B llama.cpp License MIT MCP Ready


Features Β· Gemma 4 Deep Dive Β· Quick Start Β· Architecture Β· Agent Platform Β· MCP Β· API



Microsoft showed the world wants screen-aware AI with Recall. But Recall stores data in plaintext, sends telemetry, and was met with massive privacy backlash. ScreenMind is the open-source, privacy-first alternative β€” every screenshot analyzed, every insight generated, every search result β€” all computed locally using Gemma 4's multimodal capabilities.

It's not just a screen recorder. It's an AI memory you can talk to, search through, and build automations on top of.


✨ Features

🧠 Core Intelligence

  • πŸ“Έ Smart Capture β€” Content-change detection, not a fixed timer. Captures when your screen actually changes.
  • πŸ”¬ Gemma 4 Vision Analysis β€” Every screenshot analyzed: app detection, activity categorization, mood, scene description, spatial layout regions.
  • πŸ” Hybrid Search β€” Semantic embeddings (MiniLM) + FTS5 keyword search. Find anything by meaning, not just keywords.
  • πŸ’¬ Chat with Memory β€” Conversational RAG with follow-up support. Ask "what did Ishaa say on Discord?" β†’ get the actual message.
  • πŸŽ™οΈ Voice Memos β€” Hold Ctrl+Shift+V β†’ Gemma 4's native audio encoder transcribes. Screenshot captured alongside.
  • 🎀 Meeting Transcription β€” Auto-detects Zoom/Teams/Meet, records audio, transcribes, generates structured summaries.
  • πŸ“Š Analytics Dashboard β€” Category breakdown, top apps, hourly heatmap, meeting stats, focus metrics.
  • βͺ Day Rewind β€” Timelapse playback of your entire day with play/pause/scrub/speed controls.

⚑ Performance

  • Three Analysis Modes β€” Accurate (~76s, deep thinking + layout), Balanced (~40s, thinking), or Fast (~12s, no thinking). You choose.
  • Per-App pHash Cache β€” 3-tier caching with app-aware staleness. Communication apps refresh faster than IDEs. ~40% fewer inference calls.
  • Chat-First GPU Priority β€” Chat cancels in-flight analysis instantly. GPU freed in <1s.
  • Auto-Pause Heavy Apps β€” Games, video editors, 3D software detected β†’ capture pauses automatically.

πŸ”’ Privacy & Security

  • 100% Local β€” All data stays on your machine. Zero network calls. No telemetry. Ever.
  • Sensitive Data Filter β€” Auto-redacts credit cards, SSNs, API keys, passwords before storage.
  • Encryption at Rest β€” AES encryption for screenshots (Fernet + OS keyring).
  • Dashboard PIN Lock β€” Session-based auth with configurable auto-lock timeout.
  • Incognito Mode β€” One-click pause. Nothing recorded.
πŸ”Œ Integrations & Extensibility
Integration Description
πŸ€– Agent Platform Build automations in Markdown (English) or Python. Drop a file, get an agent.
πŸ”Œ MCP Server Expose screen history to Claude Desktop, Cursor, VS Code
πŸ““ Obsidian Auto-sync daily summaries to your vault
πŸ“‹ Notion Push summaries to a Notion database
πŸͺ Webhooks Fire events to Slack, Discord, IFTTT (HMAC signed, auto-retry)
πŸ”” Smart Notifications Distraction alerts, break reminders
⭐ Auto-Bookmark Keyword triggers (git push, deploy) auto-flag important moments

⌨️ System-Wide Hotkeys

Hotkey Action
Ctrl+Shift+B πŸ“Έ Instant bookmarked capture
Ctrl+Shift+P ⏸ Toggle pause/resume
Ctrl+Shift+V 🎀 Hold to record voice memo

All hotkeys customizable from Settings.


🧠 How Gemma 4 Is Used

Gemma 4 E2B is not a bolt-on β€” it's architecturally load-bearing. ScreenMind uses all three modalities:

1. Vision β€” Screenshot Analysis

Every screenshot is sent to Gemma 4 with OCR context. It returns structured JSON:

  • App name, activity category, summary, detailed context
  • Mood classification, confidence score
  • Rich scene description (every visible element inventoried)
  • Layout regions (sidebar, chat area, toolbar boundaries)

Three modes:

  • Accurate β€” single call with thinking (~76s). Best layout detection.
  • Balanced β€” thinking enabled, analysis-only (~40s). Richer descriptions than Fast.
  • Fast β€” no-thinking prefill trick (~12s). Layout via OCR clustering instead.

2. Audio β€” Voice Memos & Meeting Transcription

Gemma 4 E2B has a native audio encoder. ScreenMind uses it for:

  • Voice memo transcription (hold hotkey β†’ speak β†’ release)
  • Meeting transcription (15s chunks, map-reduce summarization for long meetings)

No Whisper dependency. One model handles everything.

3. Reasoning β€” Summaries, Chat, Agents

  • Daily summaries with deep reasoning (think=True)
  • Chat answers grounded in actual screen data (text-first RAG with vision fallback)
  • Agent execution β€” Gemma processes markdown agent prompts with injected screen data

Why E2B Specifically?

Constraint Why It Rules Out Alternatives
Must run continuously in background Rules out 12B+ models (too heavy)
Must understand screenshots natively Rules out text-only models
Must stay 100% local for privacy Rules out cloud APIs
Must handle audio natively Rules out models without audio encoder
Must be fast enough for 30s cycle E2B processes in 12-76s depending on mode

Gemma 4 E2B is the only model that checks all five boxes.


πŸš€ Quick Start

Requirements: Python 3.10+ Β· GPU recommended (4GB+ VRAM) Β· ~5GB disk for model

1️⃣ Clone & Install

git clone https://github.com/ayushh0110/ScreenMind.git
cd screenmind

python -m venv venv
venv\Scripts\activate        # Windows
# source venv/bin/activate   # macOS/Linux

pip install -r requirements.txt

2️⃣ Run

python main.py

3️⃣ Open β†’ http://127.0.0.1:7777

On first run, ScreenMind will:

  • Auto-download Gemma 4 E2B GGUF model (~5GB, one time)
  • Start llama-server in background
  • Show the welcome screen to set up an optional PIN
  • Create ~/.screenmind/ for data storage
βš™οΈ Optional: Configure via .env
cp .env.example .env
# Edit capture interval, blocked apps, hotkeys, etc.

Or configure everything from the Settings tab in the dashboard.


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          ScreenMind                                  β”‚
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Capture   │───▢│  Async Queue │───▢│    Analysis Worker      β”‚ β”‚
β”‚  β”‚  Worker    β”‚    β”‚  (max: 100)  β”‚    β”‚                         β”‚ β”‚
β”‚  β”‚            β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚ β€’ Screen   β”‚                        β”‚  β”‚  Per-App pHash    β”‚  β”‚ β”‚
β”‚  β”‚ β€’ Window   β”‚                        β”‚  β”‚  Cache (3-tier)   β”‚  β”‚ β”‚
β”‚  β”‚ β€’ Dedup    β”‚                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β”‚ β€’ A11y     β”‚                        β”‚           β”‚             β”‚ β”‚
β”‚  β”‚ β€’ Privacy  β”‚                        β”‚           β–Ό             β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚                                        β”‚  β”‚   EasyOCR         β”‚  β”‚ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚  β”‚   (text extract)  β”‚  β”‚ β”‚
β”‚  β”‚   Audio    β”‚                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β”‚   Worker   β”‚                        β”‚           β”‚             β”‚ β”‚
β”‚  β”‚            β”‚                        β”‚           β–Ό             β”‚ β”‚
β”‚  β”‚ β€’ Meeting  β”‚                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚   detect   β”‚                        β”‚  β”‚   Gemma 4 E2B     β”‚  β”‚ β”‚
β”‚  β”‚ β€’ Record   β”‚                        β”‚  β”‚   (via llama.cpp) β”‚  β”‚ β”‚
β”‚  β”‚ β€’ Transcr. β”‚                        β”‚  β”‚   Vision + Audio  β”‚  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚                                        β”‚           β”‚             β”‚ β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚           β–Ό             β”‚ β”‚
β”‚  β”‚   Agent    β”‚                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚  β”‚  Scheduler β”‚                        β”‚  β”‚  Layout Analyzer  β”‚  β”‚ β”‚
β”‚  β”‚            β”‚                        β”‚  β”‚  (spatial OCR)    β”‚  β”‚ β”‚
β”‚  β”‚ β€’ .md AI   β”‚                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚  β”‚ β€’ .py code β”‚                        β”‚           β”‚             β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚           β–Ό             β”‚ β”‚
β”‚                                        β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚ β”‚
β”‚                                        β”‚  β”‚  MiniLM-L6-v2     β”‚  β”‚ β”‚
β”‚                                        β”‚  β”‚  (embeddings)     β”‚  β”‚ β”‚
β”‚                                        β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ β”‚
β”‚                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                    β”‚               β”‚
β”‚                                                    β–Ό               β”‚
β”‚                                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚                                        β”‚   SQLite (WAL)    β”‚       β”‚
β”‚                                        β”‚   + FTS5 index    β”‚       β”‚
β”‚                                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                  β”‚                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚  β”‚                                                                 β”‚
β”‚  β–Ό                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                    FastAPI REST Server                         β”‚ β”‚
β”‚  β”‚  /timeline Β· /search Β· /chat Β· /stats Β· /agents Β· /mcp       β”‚ β”‚
β”‚  β”‚                                                               β”‚ β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚ β”‚
β”‚  β”‚  β”‚           Web Dashboard (Vanilla JS SPA)               β”‚   β”‚ β”‚
β”‚  β”‚  β”‚  Timeline Β· Chat Β· Search Β· Analytics Β· Agents Β· Settings β”‚ β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Multi-Model AI Pipeline

Screenshot β†’ EasyOCR (text) β†’ Gemma 4 E2B (understanding) β†’ MiniLM (embeddings) β†’ SQLite + FTS5
                                     ↑
                              OCR text fed as context
                              (Gemma sees image + reads text)

Four AI models working in concert, with Gemma 4 as the brain:

  1. EasyOCR β€” extracts raw screen text
  2. Gemma 4 E2B β€” understands what you're doing (vision + reasoning)
  3. MiniLM-L6-v2 β€” generates semantic vectors for natural language search
  4. FTS5 β€” indexes text for instant keyword search

πŸ€– Agent Platform

ScreenMind includes a full agent/plugin system. Build any automation on top of your screen data.

Two Modes

Mode File Type For Example
πŸ€– AI Agent .md Everyone Write a prompt in English β†’ Gemma runs it on your data
🐍 Python Plugin .py Developers Full code with SDK access, state persistence, LLM calls

Markdown Agent Example

---
name: Daily Focus Report
schedule: every 6h
data: timeline, apps, mood
output: local, obsidian
---

Analyze my screen activity and generate a focus report:
- How many hours of deep work vs shallow work?
- What were my main distractions?
- Give me a focus score out of 10.

Drop this file in ~/.screenmind/agents/ β€” it runs automatically.

Python Plugin SDK

from screenmind_sdk import ScreenMindSDK

sdk = ScreenMindSDK("my-tracker")

# Get today's activities filtered by app
activities = sdk.get_activities(app="Chrome", limit=20)

# Persistent state across runs
last_count = sdk.load_state("url_count", 0)
urls = sdk.get_urls_visited()
sdk.save_state("url_count", len(urls))

# Ask Gemma (GPU-safe β€” waits for idle)
insight = sdk.ask_gemma(f"Summarize these URLs: {urls}")
print(insight)

Data Selectors (Frontmatter)

Markdown agents declare what data they need:

Selector Injects
timeline Recent activities with timestamps, apps, summaries
apps App usage counts + category breakdown
urls URLs visited (extracted from browser address bars)
meetings Meeting summaries and durations
mood Mood/sentiment from screen analysis

Data injection auto-scales to your model's context window.

4 Agents Ship Built-In

  • daily-journal.md β€” First-person journal entry from your day
  • focus-report.md β€” Focus score, deep work hours, distractions
  • meeting-actions.md β€” Extract action items from meeting transcripts
  • code-changelog.md β€” Summarize coding activity (commits, files, repos)

πŸ”Œ MCP Server (Claude / Cursor / VS Code)

ScreenMind exposes your screen history to any MCP-compatible AI tool:

python mcp_server.py  # stdio transport

Claude Desktop config (~/.claude/claude_desktop_config.json):

{
  "mcpServers": {
    "screenmind": {
      "command": "python",
      "args": ["C:/path/to/screenmind/mcp_server.py"]
    }
  }
}

Tools Available

Tool Description
search_screen Semantic + keyword search across all history
get_recent_activity Last N activities with full details
get_activity_by_time Activities for a specific date/time range
get_daily_summary AI-generated daily summary
capture_now Trigger instant screenshot
get_stats Usage statistics
search_audio Search meeting transcripts
get_screenshot Retrieve screenshot path by activity ID

πŸ“‘ API Reference

Full Swagger docs at http://127.0.0.1:7777/docs

Key Endpoints

Method Endpoint Description
GET /api/status System health, worker stats
GET /api/timeline?date=2026-05-21 Activities for a date
GET /api/search?q=debugging auth Hybrid semantic + keyword search
POST /api/chat Conversational AI with screen memory (SSE stream)
GET /api/stats?range=day Analytics (categories, apps, meetings)
GET /api/rewind?date=2026-05-21 Timelapse frames
POST /api/summary/generate Generate AI daily summary
GET /api/agents List all agents
POST /api/agents/{name}/run Trigger agent execution
POST /api/capture/pause Pause capture
POST /api/incognito/toggle Toggle incognito mode

βš™οΈ Configuration


All settings configurable via .env, environment variables, or the Settings dashboard (persists to settings.json).

Variable Default Description
CAPTURE_INTERVAL 40 Seconds between captures
ANALYSIS_MODE merged merged (accurate, ~76s) or fast (~12s)
PERFORMANCE_MODE balanced GPU layers: minimal / balanced / maximum
BLOCKED_APPS (empty) Comma-separated apps to never capture
MEETING_TRANSCRIPTION false Auto-transcribe when meeting apps detected
RETENTION_DAYS 7 Auto-delete data older than N days (0 = forever)
ENCRYPTION_ENABLED false Encrypt screenshots at rest
SENSITIVE_FILTER_ENABLED true Redact credit cards, SSNs, API keys

See .env.example for the full list.


πŸ”§ Tech Stack

Layer Technology Why
Vision + Audio AI Gemma 4 E2B (via llama.cpp) Only model with vision + audio + reasoning that runs locally on 4GB VRAM
Inference Server llama-server (llama.cpp) Direct GGUF inference, OpenAI-compatible API, 8-12% faster than Ollama
OCR EasyOCR Extracts screen text fed to Gemma as context
Embeddings all-MiniLM-L6-v2 80MB, runs on CPU, 384-dim vectors for semantic search
Backend FastAPI + Uvicorn Async-first, auto-generated API docs
Database SQLite (WAL) + FTS5 Zero-config, concurrent reads, full-text search
Capture mss + ctypes/UI Automation Native screen capture + accessibility text extraction
Frontend Vanilla JS + CSS No build step, instant load, dark glassmorphism UI
Platform Windows / macOS / Linux Abstraction layer with OS-specific adapters

πŸ“ Project Structure


screenmind/
β”œβ”€β”€ main.py                    # Entry point β€” starts all services
β”œβ”€β”€ config.py                  # Pydantic settings (env + runtime overrides)
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ mcp_server.py              # MCP server for Claude/Cursor/VS Code
β”œβ”€β”€ screenmind_sdk.py          # SDK for Python plugin agents
β”‚
β”œβ”€β”€ capture/                   # Screenshot capture layer
β”‚   β”œβ”€β”€ screen.py              # mss-based capture + encryption
β”‚   β”œβ”€β”€ window.py              # Active window detection
β”‚   β”œβ”€β”€ dedup.py               # Perceptual hash deduplication
β”‚   β”œβ”€β”€ hotkey.py              # Global hotkeys (bookmark, pause, voice)
β”‚   └── voice_recorder.py      # Mic recording for voice memos
β”‚
β”œβ”€β”€ engine/                    # AI & intelligence layer
β”‚   β”œβ”€β”€ analyzer.py            # Gemma 4 vision analysis (dual mode)
β”‚   β”œβ”€β”€ llm_client.py          # llama-server client (chat, vision, audio)
β”‚   β”œβ”€β”€ model_manager.py       # Server lifecycle, model download/switch
β”‚   β”œβ”€β”€ embedder.py            # MiniLM semantic embeddings
β”‚   β”œβ”€β”€ ocr.py                 # EasyOCR text extraction
β”‚   β”œβ”€β”€ layout_analyzer.py     # Spatial OCR organization
β”‚   β”œβ”€β”€ dev_context.py         # Git repo/branch/diff detection
β”‚   β”œβ”€β”€ a11y_extractor.py      # Accessibility API text extraction
β”‚   └── agent_runner.py        # Agent scheduling & execution
β”‚
β”œβ”€β”€ workers/                   # Background processing
β”‚   β”œβ”€β”€ capture_worker.py      # Smart capture loop + privacy filtering
β”‚   β”œβ”€β”€ analysis_worker.py     # OCR β†’ Gemma β†’ Layout β†’ Embed β†’ Store
β”‚   └── audio_worker.py        # Meeting detection & transcription
β”‚
β”œβ”€β”€ storage/                   # Data persistence
β”‚   β”œβ”€β”€ database.py            # SQLite + FTS5 + migrations
β”‚   └── models.py              # Pydantic data models
β”‚
β”œβ”€β”€ privacy/                   # Privacy & security
β”‚   β”œβ”€β”€ encryption.py          # Fernet AES encryption at rest
β”‚   └── data_filter.py         # Sensitive data redaction
β”‚
β”œβ”€β”€ platform_support/          # Cross-platform abstraction
β”‚   β”œβ”€β”€ windows.py             # Win32 + UI Automation
β”‚   β”œβ”€β”€ macos.py               # AppKit + AXUIElement
β”‚   └── linux.py               # xdotool + AT-SPI
β”‚
β”œβ”€β”€ integrations/              # External connections
β”‚   β”œβ”€β”€ obsidian.py            # Vault markdown export
β”‚   β”œβ”€β”€ notion.py              # Notion API export
β”‚   β”œβ”€β”€ webhooks.py            # HTTP webhooks (HMAC, retry)
β”‚   └── smart_notify.py        # Distraction/break notifications
β”‚
β”œβ”€β”€ api/                       # REST API + dashboard
β”‚   β”œβ”€β”€ server.py              # FastAPI app + auth middleware
β”‚   β”œβ”€β”€ dependencies.py        # Shared state for routes
β”‚   β”œβ”€β”€ routes/                # 16 route modules
β”‚   └── static/                # Web dashboard (HTML + CSS + JS)
β”‚
β”œβ”€β”€ default_agents/            # 4 built-in agents
β”‚   β”œβ”€β”€ daily-journal.md
β”‚   β”œβ”€β”€ focus-report.md
β”‚   β”œβ”€β”€ meeting-actions.md
β”‚   └── code-changelog.md
β”‚
└── docs/
    └── BUILD_YOUR_OWN_AGENT.md

πŸ›‘οΈ Error Handling & Resilience

Scenario Behavior
llama-server not running Auto-starts on launch. Captures continue; analysis retried with backoff.
Model not downloaded Auto-downloads GGUF on first start via HuggingFace.
GPU out of memory Detects OOM, retries with delay, re-queues on persistent failure.
Duplicate frames pHash dedup skips identical screenshots (threshold: 8 hamming distance).
Stale queue items Captures >3 min old auto-skipped. Backfilled during idle.
App in blocklist Silently skips β€” no screenshot saved.
Meeting app closed Process-alive check + silence detection + 5-min hard timeout.
Chat during analysis Cancels in-flight inference, frees GPU in <1s, re-queues analysis.
Crash recovery Stale meetings cleaned on startup. Unanalyzed entries backfilled.

🎨 Dashboard

The web dashboard at http://127.0.0.1:7777 features:

  • Timeline β€” Browse activities by date with thumbnails, AI summaries, category badges
  • Chat β€” Conversational AI with screen memory. Ask anything about your history.
  • Search β€” Semantic + keyword hybrid search with OCR highlighting on screenshots
  • Analytics β€” Category charts, top apps, hourly heatmap, meeting stats
  • Rewind β€” Timelapse player with play/pause/scrub/speed controls
  • Memos β€” Voice memo list with audio player
  • Agents β€” Create, edit, run, and monitor agents
  • Settings β€” 8 organized sections: Shortcuts, Capture, AI, Audio, Privacy, Automation, Integrations, Storage

Dark glassmorphism UI. No build step. Instant load.


🀝 Contributing

Contributions welcome! Here are some high-impact areas:

  • 🍎 macOS/Linux testing β€” platform adapters exist, need real hardware testing
  • 🐳 Docker container β€” one-command setup
  • 🧩 Community agent registry β€” share agents between users
  • 🌐 Browser extension β€” richer URL/tab context
  • πŸ“€ Export formats β€” Markdown, CSV, JSON

⭐ Show Your Support

If you find ScreenMind useful, please consider:

  • ⭐ Star this repo β€” it helps others discover the project
  • 🍴 Fork it β€” build your own agents and features
  • πŸ› Report issues β€” help us improve
  • πŸ“£ Share it β€” tell others about privacy-first AI

Stars Forks



πŸ“ License

MIT License β€” see LICENSE for details.



Built with 🧠 Gemma 4 E2B Β· πŸ”’ 100% Local Β· πŸš€ Zero Cloud Dependencies

Vision + Audio + Reasoning β€” all three modalities, one model, your machine.


Made with ❀️ by ayushh0110

About

🧠 AI-powered screen memory β€” captures, analyzes, and lets you search/chat your screen history. Powered by Gemma 4 E2B. 100% local, 100% private.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors