Skip to content

bayuuuu18/screen-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

  ____                            _                 
 / ___|  ___ _ __ ___  __ _ _ __ (_)_ __   __ _    
 \___ \ / __| '__/ _ \/ _` | '_ \| | '_ \ / _` |   
  ___) | (__| | |  __/ (_| | | | | | | | | (_| |   
 |____/ \___|_|  \___|\__,_|_| |_|_|_| |_|\__, |   
                                           |___/    
    _                           _           
   / \   __ _  ___ _ __   ___(_)_ __   __ _ 
  / _ \ / _` |/ _ \ '_ \ / __| | '_ \ / _` |
 / ___ \ (_| |  __/ | | | (__| | | | | (_| |
/_/   \_\__, |\___|_| |_|\___|_|_| |_|\__, |
        |___/                           |___/

🖥️ Screen Agent — Your AI Desktop Copilot

Stars Forks Issues License Python PRs Welcome

An open-source AI agent that SEES your screen and CONTROLS your computer using natural language.

Just tell it what to do. It sees what you see, plans the steps, and executes them.

🚀 Quick Start · 📖 Documentation · 🗺️ Roadmap · 🤝 Contributing


🎬 Demo

┌──────────────────────────────────────────────────────┐
│                                                      │
│  👤 "Open Chrome, go to YouTube, and search for      │
│      'lo-fi beats to study to'"                      │
│                                                      │
│  🤖 🔍 Taking screenshot...                          │
│     📊 Analyzing screen... Found: Desktop             │
│     🖱️  Clicking Chrome icon at (120, 890)           │
│     ⏳ Waiting for Chrome to open...                  │
│     📊 Analyzing screen... Found: Chrome address bar  │
│     ⌨️  Typing: youtube.com                          │
│     ⏳ Pressing Enter...                              │
│     📊 Analyzing screen... Found: YouTube loaded      │
│     🖱️  Clicking search bar at (640, 180)            │
│     ⌨️  Typing: lo-fi beats to study to              │
│     ⏳ Pressing Enter...                              │
│     ✅ Done! YouTube is now searching for your query. │
│                                                      │
└──────────────────────────────────────────────────────┘

⚠️ BETA NOTICE: This project is in active development. The agent can interact with your desktop — use with caution and always review planned actions before confirming.


✨ Why Screen Agent?

Feature Description
👁️ Sees Your Screen Uses GPT-4o Vision or local Ollama models to understand what's on your display
🗣️ Natural Language Just tell it what you want in plain English — no scripts, no coordinates
🖱️ Full Desktop Control Mouse clicks, typing, hotkeys, scrolling, dragging — it does it all
🧠 Multi-Step Planning Breaks complex tasks into atomic steps and executes them sequentially
🔁 Loop Detection Remembers what it did and avoids getting stuck in infinite loops
🛡️ Safety First Asks for confirmation before dangerous actions (deleting files, sending emails)
🏠 Local Model Support Run 100% offline with Ollama — no data leaves your machine
🎨 Beautiful CLI Rich terminal UI with colors, progress bars, and step visualization

🏗️ Architecture

                    ┌─────────────────────────┐
                    │    🗣️ Natural Language    │
                    │      User Command        │
                    └───────────┬─────────────┘
                                │
                                ▼
                    ┌─────────────────────────┐
                    │    📋 Planner (LLM)      │
                    │  Break command into      │
                    │  atomic actions          │
                    └───────────┬─────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │    🔁 Main Agent Loop     │
                    │                          │
                    │  ┌──────┐   ┌──────────┐ │
                    │  │ 📷   │   │ 🧠 Vision │ │
                    │  │Screen│──▶│  AI (LLM) │ │
                    │  │Capture│  │           │ │
                    │  └──────┘   └─────┬─────┘ │
                    │                   │       │
                    │              ┌────▼────┐  │
                    │              │ 🎯 Plan  │  │
                    │              │  Action  │  │
                    │              └────┬────┘  │
                    │                   │       │
                    │              ┌────▼────┐  │
                    │              │ 🖱️⌨️     │  │
                    │              │ Execute  │  │
                    │              └────┬────┘  │
                    │                   │       │
                    │         ┌─────────▼─────┐ │
                    │         │ 🧠 Memory      │ │
                    │         │ Track & Learn  │ │
                    │         └───────────────┘ │
                    └───────────────────────────┘
                                │
                    ┌───────────▼─────────────┐
                    │     💻 Your Desktop      │
                    │   (any application)      │
                    └─────────────────────────┘

🚀 Quick Start

Prerequisites

  • Python 3.9+
  • Tesseract OCR (install guide)
  • OpenAI API key (or Ollama running locally)

Installation

# Clone the repo
git clone https://github.com/bayuuuu18/screen-agent.git
cd screen-agent

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env with your API key

Run

# Interactive mode (recommended)
python screen_agent.py

# Single command mode
python screen_agent.py --command "open Notepad and type Hello World"

# Use local Ollama model
python screen_agent.py --provider ollama --model llava

Or install as a package

pip install -e .
screen-agent  # launch interactive mode

📖 How It Works

1. Capture — See the Screen

The agent takes a screenshot using mss (fast, cross-platform) and converts it to base64 for the vision API.

2. Analyze — Understand What's There

GPT-4o Vision (or a local model) analyzes the screenshot, identifies UI elements, reads text, and understands the current state.

3. Plan — Decide What to Do

Based on the user's command and the current screen state, the AI generates a sequence of atomic actions (click, type, scroll, etc.).

4. Execute — Do It

Actions are executed via pyautogui — moving the mouse, clicking, typing text, pressing hotkeys.

5. Remember — Learn From What Happened

Short-term memory tracks completed actions, detects loops, and maintains context across multi-step tasks.


💬 Usage Examples

Interactive Mode

$ python screen_agent.py

  Screen Agent v0.1.0 — AI Desktop Copilot
  
  🤖 What would you like me to do?
  > open calculator and compute 42 * 17
  
  📋 Plan: 4 steps
  [1/4] 🔍 Analyzing screen...
  [1/4] 🖱️  Clicking Start menu (20, 1050)
  [2/4] ⌨️  Typing: calculator
  [3/4] 🖱️  Clicking Calculator app (180, 420)
  [4/4] ⌨️  Typing: 42*17=
  
  ✅ Task complete! Result: 714

Python API

from agent.core import ScreenAgent

agent = ScreenAgent(provider="openai")  # or "ollama"
result = agent.execute("Take a screenshot and tell me what's on screen")
print(result)

# Multi-step task
result = agent.execute("Open Firefox, go to github.com, and search for 'screen-agent'")

Programmatic Actions

from agent.screen import ScreenCapture
from agent.controller import MouseKeyboard

screen = ScreenCapture()
mk = MouseKeyboard()

# Screenshot
img = screen.capture_full()
img.save("my_screenshot.png")

# Type and click
mk.click(500, 300)
mk.type_text("Hello, World!")
mk.hotkey("ctrl", "a")
mk.hotkey("ctrl", "c")

🆚 Comparison

Capability Screen Agent AutoGPT AgentGPT Open Interpreter
See the screen
Control mouse/keyboard ✅ (terminal)
Natural language input
Multi-step planning
GUI app interaction
Local model support
Visual understanding
Safety confirmations Partial
Loop detection
Open source

⚙️ Configuration

Edit .env to customize:

# API Keys
OPENAI_API_KEY=sk-your-key-here

# Ollama (for local models)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llava

# Provider: "openai" or "ollama"
AI_PROVIDER=openai
VISION_MODEL=gpt-4o

# Screen capture
SCREENSHOT_QUALITY=80        # JPEG quality (1-100)
SCREENSHOT_SCALE=0.5         # Scale factor for API (smaller = faster)

# Safety
CONFIRM_DANGEROUS_ACTIONS=true
MAX_STEPS_PER_TASK=20

# Mouse
MOUSE_SPEED=0.5              # Seconds for mouse movement
CLICK_PAUSE=0.1              # Pause after click

🗺️ Roadmap

  • Screenshot capture & analysis
  • Mouse & keyboard control
  • Multi-step task planning
  • GPT-4o Vision support
  • Ollama local model support
  • Safety confirmations
  • Loop detection & memory
  • 🔜 Voice input (speech-to-text)
  • 🔜 Element highlighting overlay
  • 🔜 Undo/rollback actions
  • 🔜 Task recording & replay
  • 🔜 Web UI dashboard
  • 🔜 Plugin system
  • 🔜 macOS & Linux optimizations
  • 🔜 Action confidence scoring
  • 🔜 Multi-monitor support
  • 🔜 Screen recording to create training data

🤝 Contributing

We love contributions! See CONTRIBUTING.md for guidelines.

Good first issues:

  • 🐛 Bug fixes
  • 📝 Documentation improvements
  • ✨ New action types (drag & drop, gesture support)
  • 🧪 Test coverage

⭐ Star History

Star History Chart


📜 License

MIT License — see LICENSE for details.


🙏 Acknowledgments

  • OpenAI for GPT-4o Vision
  • Ollama for local LLM support
  • PyAutoGUI for desktop automation
  • Rich for the beautiful CLI
  • mss for fast screenshots

Built with ❤️ by bayuuuu18

If you find this useful, please ⭐ the repo!

About

AI desktop agent that sees your screen and controls your computer with natural language | GPT-4o Vision | Ollama

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages