____ _
/ ___| ___ _ __ ___ __ _ _ __ (_)_ __ __ _
\___ \ / __| '__/ _ \/ _` | '_ \| | '_ \ / _` |
___) | (__| | | __/ (_| | | | | | | | | (_| |
|____/ \___|_| \___|\__,_|_| |_|_|_| |_|\__, |
|___/
_ _
/ \ __ _ ___ _ __ ___(_)_ __ __ _
/ _ \ / _` |/ _ \ '_ \ / __| | '_ \ / _` |
/ ___ \ (_| | __/ | | | (__| | | | | (_| |
/_/ \_\__, |\___|_| |_|\___|_|_| |_|\__, |
|___/ |___/
An open-source AI agent that SEES your screen and CONTROLS your computer using natural language.
Just tell it what to do. It sees what you see, plans the steps, and executes them.
🚀 Quick Start · 📖 Documentation · 🗺️ Roadmap · 🤝 Contributing
┌──────────────────────────────────────────────────────┐
│ │
│ 👤 "Open Chrome, go to YouTube, and search for │
│ 'lo-fi beats to study to'" │
│ │
│ 🤖 🔍 Taking screenshot... │
│ 📊 Analyzing screen... Found: Desktop │
│ 🖱️ Clicking Chrome icon at (120, 890) │
│ ⏳ Waiting for Chrome to open... │
│ 📊 Analyzing screen... Found: Chrome address bar │
│ ⌨️ Typing: youtube.com │
│ ⏳ Pressing Enter... │
│ 📊 Analyzing screen... Found: YouTube loaded │
│ 🖱️ Clicking search bar at (640, 180) │
│ ⌨️ Typing: lo-fi beats to study to │
│ ⏳ Pressing Enter... │
│ ✅ Done! YouTube is now searching for your query. │
│ │
└──────────────────────────────────────────────────────┘
⚠️ BETA NOTICE: This project is in active development. The agent can interact with your desktop — use with caution and always review planned actions before confirming.
| Feature | Description |
|---|---|
| 👁️ Sees Your Screen | Uses GPT-4o Vision or local Ollama models to understand what's on your display |
| 🗣️ Natural Language | Just tell it what you want in plain English — no scripts, no coordinates |
| 🖱️ Full Desktop Control | Mouse clicks, typing, hotkeys, scrolling, dragging — it does it all |
| 🧠 Multi-Step Planning | Breaks complex tasks into atomic steps and executes them sequentially |
| 🔁 Loop Detection | Remembers what it did and avoids getting stuck in infinite loops |
| 🛡️ Safety First | Asks for confirmation before dangerous actions (deleting files, sending emails) |
| 🏠 Local Model Support | Run 100% offline with Ollama — no data leaves your machine |
| 🎨 Beautiful CLI | Rich terminal UI with colors, progress bars, and step visualization |
┌─────────────────────────┐
│ 🗣️ Natural Language │
│ User Command │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ 📋 Planner (LLM) │
│ Break command into │
│ atomic actions │
└───────────┬─────────────┘
│
┌───────────▼─────────────┐
│ 🔁 Main Agent Loop │
│ │
│ ┌──────┐ ┌──────────┐ │
│ │ 📷 │ │ 🧠 Vision │ │
│ │Screen│──▶│ AI (LLM) │ │
│ │Capture│ │ │ │
│ └──────┘ └─────┬─────┘ │
│ │ │
│ ┌────▼────┐ │
│ │ 🎯 Plan │ │
│ │ Action │ │
│ └────┬────┘ │
│ │ │
│ ┌────▼────┐ │
│ │ 🖱️⌨️ │ │
│ │ Execute │ │
│ └────┬────┘ │
│ │ │
│ ┌─────────▼─────┐ │
│ │ 🧠 Memory │ │
│ │ Track & Learn │ │
│ └───────────────┘ │
└───────────────────────────┘
│
┌───────────▼─────────────┐
│ 💻 Your Desktop │
│ (any application) │
└─────────────────────────┘
- Python 3.9+
- Tesseract OCR (install guide)
- OpenAI API key (or Ollama running locally)
# Clone the repo
git clone https://github.com/bayuuuu18/screen-agent.git
cd screen-agent
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure
cp .env.example .env
# Edit .env with your API key# Interactive mode (recommended)
python screen_agent.py
# Single command mode
python screen_agent.py --command "open Notepad and type Hello World"
# Use local Ollama model
python screen_agent.py --provider ollama --model llavapip install -e .
screen-agent # launch interactive modeThe agent takes a screenshot using mss (fast, cross-platform) and converts it to base64 for the vision API.
GPT-4o Vision (or a local model) analyzes the screenshot, identifies UI elements, reads text, and understands the current state.
Based on the user's command and the current screen state, the AI generates a sequence of atomic actions (click, type, scroll, etc.).
Actions are executed via pyautogui — moving the mouse, clicking, typing text, pressing hotkeys.
Short-term memory tracks completed actions, detects loops, and maintains context across multi-step tasks.
$ python screen_agent.py
Screen Agent v0.1.0 — AI Desktop Copilot
🤖 What would you like me to do?
> open calculator and compute 42 * 17
📋 Plan: 4 steps
[1/4] 🔍 Analyzing screen...
[1/4] 🖱️ Clicking Start menu (20, 1050)
[2/4] ⌨️ Typing: calculator
[3/4] 🖱️ Clicking Calculator app (180, 420)
[4/4] ⌨️ Typing: 42*17=
✅ Task complete! Result: 714from agent.core import ScreenAgent
agent = ScreenAgent(provider="openai") # or "ollama"
result = agent.execute("Take a screenshot and tell me what's on screen")
print(result)
# Multi-step task
result = agent.execute("Open Firefox, go to github.com, and search for 'screen-agent'")from agent.screen import ScreenCapture
from agent.controller import MouseKeyboard
screen = ScreenCapture()
mk = MouseKeyboard()
# Screenshot
img = screen.capture_full()
img.save("my_screenshot.png")
# Type and click
mk.click(500, 300)
mk.type_text("Hello, World!")
mk.hotkey("ctrl", "a")
mk.hotkey("ctrl", "c")| Capability | Screen Agent | AutoGPT | AgentGPT | Open Interpreter |
|---|---|---|---|---|
| See the screen | ✅ | ❌ | ❌ | ❌ |
| Control mouse/keyboard | ✅ | ❌ | ❌ | ✅ (terminal) |
| Natural language input | ✅ | ✅ | ✅ | ✅ |
| Multi-step planning | ✅ | ✅ | ✅ | ✅ |
| GUI app interaction | ✅ | ❌ | ❌ | ❌ |
| Local model support | ✅ | ✅ | ❌ | ✅ |
| Visual understanding | ✅ | ❌ | ❌ | ❌ |
| Safety confirmations | ✅ | Partial | ❌ | ✅ |
| Loop detection | ✅ | ✅ | ❌ | ❌ |
| Open source | ✅ | ✅ | ✅ | ✅ |
Edit .env to customize:
# API Keys
OPENAI_API_KEY=sk-your-key-here
# Ollama (for local models)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava
# Provider: "openai" or "ollama"
AI_PROVIDER=openai
VISION_MODEL=gpt-4o
# Screen capture
SCREENSHOT_QUALITY=80 # JPEG quality (1-100)
SCREENSHOT_SCALE=0.5 # Scale factor for API (smaller = faster)
# Safety
CONFIRM_DANGEROUS_ACTIONS=true
MAX_STEPS_PER_TASK=20
# Mouse
MOUSE_SPEED=0.5 # Seconds for mouse movement
CLICK_PAUSE=0.1 # Pause after click- Screenshot capture & analysis
- Mouse & keyboard control
- Multi-step task planning
- GPT-4o Vision support
- Ollama local model support
- Safety confirmations
- Loop detection & memory
- 🔜 Voice input (speech-to-text)
- 🔜 Element highlighting overlay
- 🔜 Undo/rollback actions
- 🔜 Task recording & replay
- 🔜 Web UI dashboard
- 🔜 Plugin system
- 🔜 macOS & Linux optimizations
- 🔜 Action confidence scoring
- 🔜 Multi-monitor support
- 🔜 Screen recording to create training data
We love contributions! See CONTRIBUTING.md for guidelines.
Good first issues:
- 🐛 Bug fixes
- 📝 Documentation improvements
- ✨ New action types (drag & drop, gesture support)
- 🧪 Test coverage
MIT License — see LICENSE for details.
- OpenAI for GPT-4o Vision
- Ollama for local LLM support
- PyAutoGUI for desktop automation
- Rich for the beautiful CLI
- mss for fast screenshots
Built with ❤️ by bayuuuu18
If you find this useful, please ⭐ the repo!