🖥️ Screen Agent — Your AI Desktop Copilot

  ____                            _                 
 / ___|  ___ _ __ ___  __ _ _ __ (_)_ __   __ _    
 \___ \ / __| '__/ _ \/ _` | '_ \| | '_ \ / _` |   
  ___) | (__| | |  __/ (_| | | | | | | | | (_| |   
 |____/ \___|_|  \___|\__,_|_| |_|_|_| |_|\__, |   
                                           |___/    
    _                           _           
   / \   __ _  ___ _ __   ___(_)_ __   __ _ 
  / _ \ / _` |/ _ \ '_ \ / __| | '_ \ / _` |
 / ___ \ (_| |  __/ | | | (__| | | | | (_| |
/_/   \_\__, |\___|_| |_|\___|_|_| |_|\__, |
        |___/                           |___/

🖥️ Screen Agent — Your AI Desktop Copilot

An open-source AI agent that SEES your screen and CONTROLS your computer using natural language.

Just tell it what to do. It sees what you see, plans the steps, and executes them.

🚀 Quick Start · 📖 Documentation · 🗺️ Roadmap · 🤝 Contributing

🎬 Demo

┌──────────────────────────────────────────────────────┐
│                                                      │
│  👤 "Open Chrome, go to YouTube, and search for      │
│      'lo-fi beats to study to'"                      │
│                                                      │
│  🤖 🔍 Taking screenshot...                          │
│     📊 Analyzing screen... Found: Desktop             │
│     🖱️  Clicking Chrome icon at (120, 890)           │
│     ⏳ Waiting for Chrome to open...                  │
│     📊 Analyzing screen... Found: Chrome address bar  │
│     ⌨️  Typing: youtube.com                          │
│     ⏳ Pressing Enter...                              │
│     📊 Analyzing screen... Found: YouTube loaded      │
│     🖱️  Clicking search bar at (640, 180)            │
│     ⌨️  Typing: lo-fi beats to study to              │
│     ⏳ Pressing Enter...                              │
│     ✅ Done! YouTube is now searching for your query. │
│                                                      │
└──────────────────────────────────────────────────────┘

⚠️ BETA NOTICE: This project is in active development. The agent can interact with your desktop — use with caution and always review planned actions before confirming.

✨ Why Screen Agent?

Feature	Description
👁️ Sees Your Screen	Uses GPT-4o Vision or local Ollama models to understand what's on your display
🗣️ Natural Language	Just tell it what you want in plain English — no scripts, no coordinates
🖱️ Full Desktop Control	Mouse clicks, typing, hotkeys, scrolling, dragging — it does it all
🧠 Multi-Step Planning	Breaks complex tasks into atomic steps and executes them sequentially
🔁 Loop Detection	Remembers what it did and avoids getting stuck in infinite loops
🛡️ Safety First	Asks for confirmation before dangerous actions (deleting files, sending emails)
🏠 Local Model Support	Run 100% offline with Ollama — no data leaves your machine
🎨 Beautiful CLI	Rich terminal UI with colors, progress bars, and step visualization

🏗️ Architecture

                    ┌─────────────────────────┐
                    │    🗣️ Natural Language    │
                    │      User Command        │
                    └───────────┬─────────────┘
                                │
                                ▼
                    ┌─────────────────────────┐
                    │    📋 Planner (LLM)      │
                    │  Break command into      │
                    │  atomic actions          │
                    └───────────┬─────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │    🔁 Main Agent Loop     │
                    │                          │
                    │  ┌──────┐   ┌──────────┐ │
                    │  │ 📷   │   │ 🧠 Vision │ │
                    │  │Screen│──▶│  AI (LLM) │ │
                    │  │Capture│  │           │ │
                    │  └──────┘   └─────┬─────┘ │
                    │                   │       │
                    │              ┌────▼────┐  │
                    │              │ 🎯 Plan  │  │
                    │              │  Action  │  │
                    │              └────┬────┘  │
                    │                   │       │
                    │              ┌────▼────┐  │
                    │              │ 🖱️⌨️     │  │
                    │              │ Execute  │  │
                    │              └────┬────┘  │
                    │                   │       │
                    │         ┌─────────▼─────┐ │
                    │         │ 🧠 Memory      │ │
                    │         │ Track & Learn  │ │
                    │         └───────────────┘ │
                    └───────────────────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │     💻 Your Desktop      │
                    │   (any application)      │
                    └─────────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.9+
Tesseract OCR (install guide)
OpenAI API key (or Ollama running locally)

Installation

# Clone the repo
git clone https://github.com/bayuuuu18/screen-agent.git
cd screen-agent

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env with your API key

Run

# Interactive mode (recommended)
python screen_agent.py

# Single command mode
python screen_agent.py --command "open Notepad and type Hello World"

# Use local Ollama model
python screen_agent.py --provider ollama --model llava

Or install as a package

pip install -e .
screen-agent  # launch interactive mode

📖 How It Works

1. Capture — See the Screen

The agent takes a screenshot using mss (fast, cross-platform) and converts it to base64 for the vision API.

2. Analyze — Understand What's There

GPT-4o Vision (or a local model) analyzes the screenshot, identifies UI elements, reads text, and understands the current state.

3. Plan — Decide What to Do

Based on the user's command and the current screen state, the AI generates a sequence of atomic actions (click, type, scroll, etc.).

4. Execute — Do It

Actions are executed via pyautogui — moving the mouse, clicking, typing text, pressing hotkeys.

5. Remember — Learn From What Happened

Short-term memory tracks completed actions, detects loops, and maintains context across multi-step tasks.

💬 Usage Examples

Interactive Mode

$ python screen_agent.py

  Screen Agent v0.1.0 — AI Desktop Copilot
  
  🤖 What would you like me to do?
  > open calculator and compute 42 * 17
  
  📋 Plan: 4 steps
  [1/4] 🔍 Analyzing screen...
  [1/4] 🖱️  Clicking Start menu (20, 1050)
  [2/4] ⌨️  Typing: calculator
  [3/4] 🖱️  Clicking Calculator app (180, 420)
  [4/4] ⌨️  Typing: 42*17=
  
  ✅ Task complete! Result: 714

Python API

from agent.core import ScreenAgent

agent = ScreenAgent(provider="openai")  # or "ollama"
result = agent.execute("Take a screenshot and tell me what's on screen")
print(result)

# Multi-step task
result = agent.execute("Open Firefox, go to github.com, and search for 'screen-agent'")

Programmatic Actions

from agent.screen import ScreenCapture
from agent.controller import MouseKeyboard

screen = ScreenCapture()
mk = MouseKeyboard()

# Screenshot
img = screen.capture_full()
img.save("my_screenshot.png")

# Type and click
mk.click(500, 300)
mk.type_text("Hello, World!")
mk.hotkey("ctrl", "a")
mk.hotkey("ctrl", "c")

🆚 Comparison

Capability	Screen Agent	AutoGPT	AgentGPT	Open Interpreter
See the screen	✅	❌	❌	❌
Control mouse/keyboard	✅	❌	❌	✅ (terminal)
Natural language input	✅	✅	✅	✅
Multi-step planning	✅	✅	✅	✅
GUI app interaction	✅	❌	❌	❌
Local model support	✅	✅	❌	✅
Visual understanding	✅	❌	❌	❌
Safety confirmations	✅	Partial	❌	✅
Loop detection	✅	✅	❌	❌
Open source	✅	✅	✅	✅

⚙️ Configuration

Edit .env to customize:

# API Keys
OPENAI_API_KEY=sk-your-key-here

# Ollama (for local models)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava

# Provider: "openai" or "ollama"
AI_PROVIDER=openai
VISION_MODEL=gpt-4o

# Screen capture
SCREENSHOT_QUALITY=80        # JPEG quality (1-100)
SCREENSHOT_SCALE=0.5         # Scale factor for API (smaller = faster)

# Safety
CONFIRM_DANGEROUS_ACTIONS=true
MAX_STEPS_PER_TASK=20

# Mouse
MOUSE_SPEED=0.5              # Seconds for mouse movement
CLICK_PAUSE=0.1              # Pause after click

🗺️ Roadmap

🤝 Contributing

We love contributions! See CONTRIBUTING.md for guidelines.

Good first issues:

🐛 Bug fixes
📝 Documentation improvements
✨ New action types (drag & drop, gesture support)
🧪 Test coverage

⭐ Star History

📜 License

MIT License — see LICENSE for details.

🙏 Acknowledgments

OpenAI for GPT-4o Vision
Ollama for local LLM support
PyAutoGUI for desktop automation
Rich for the beautiful CLI
mss for fast screenshots

Built with ❤️ by bayuuuu18

If you find this useful, please ⭐ the repo!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🖥️ Screen Agent — Your AI Desktop Copilot

🎬 Demo

✨ Why Screen Agent?

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Run

Or install as a package

📖 How It Works

1. Capture — See the Screen

2. Analyze — Understand What's There

3. Plan — Decide What to Do

4. Execute — Do It

5. Remember — Learn From What Happened

💬 Usage Examples

Interactive Mode

Python API

Programmatic Actions

🆚 Comparison

⚙️ Configuration

🗺️ Roadmap

🤝 Contributing

⭐ Star History

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
examples		examples
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
screen_agent.py		screen_agent.py

Folders and files

Latest commit

History

Repository files navigation

🖥️ Screen Agent — Your AI Desktop Copilot

🎬 Demo

✨ Why Screen Agent?

🏗️ Architecture

🚀 Quick Start

Prerequisites

Installation

Run

Or install as a package

📖 How It Works

1. Capture — See the Screen

2. Analyze — Understand What's There

3. Plan — Decide What to Do

4. Execute — Do It

5. Remember — Learn From What Happened

💬 Usage Examples

Interactive Mode

Python API

Programmatic Actions

🆚 Comparison

⚙️ Configuration

🗺️ Roadmap

🤝 Contributing

⭐ Star History

📜 License

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages