Skip to content

airtasystems/browser-bot

Repository files navigation

Browser Bot

Purpose: This repository is for learning and demonstrating browser automation with Playwright. The main application shows a full pipeline (tiered fetchers, auth capture, POST/UI flows). The learning/ directory holds small, numbered scripts that each teach one idea—launching a browser, contexts, pools, CDP, containers, and related topics—building from a minimal baseline toward patterns used in real automation.

Use automation only on systems and targets you are authorized to test.

Main application

Tiered fetching via Playwright (CDP → Pool → Cluster → Human). Normal runs are clean: ephemeral browser (launch() + new_context()), no repo-level persistent profile. Auth is loaded from sites/{domain}/auth.json (cookies, localStorage, sessionStorage, headers). Legacy storage_state.json is still supported.

Output uses Rich for colored, structured terminal output (status badges, metrics table). First <p> text is extracted from the live DOM after JS/React render.

POST / UI submission – HTTP POST and browser-driven form flows use the same fetcher pipeline as GET. POST payloads are configured under posts/ (see below); run via menu.py or python post.py.

Learning directory (learning/)

Scripts are numbered 101–110 and are meant to be read and run in order. Each file’s docstring explains the concept (e.g. isolated context vs persistent profile, resource blocking, concurrency, connecting over CDP). They are standalone examples, not imports for the main package.

Some lessons use a local ./browser_profile directory when demonstrating persistent Chromium profiles; that folder is gitignored and is only created if you run those scripts. The main app uses ephemeral launches and, for login, sites/{domain}/.login_profile when persistent login is enabled—not a shared browser_profile/ at the repo root.

Auth

Save login state per site (e.g. example.comsites/example.com/):

  1. Run python menu.py
  2. Choose "Add login", enter login URL
  3. Browser opens; log in manually, press Enter when done
  4. Auth is saved to sites/{domain}/auth.json (cookies, localStorage, sessionStorage, headers)
  5. Requests to that domain use the saved auth (cookies, storage, derived headers such as Authorization: Bearer where configured)

Tiers (cascade order)

With FETCH_METHOD = "auto" (default), tiers are tried in order. Set FETCH_METHOD to a specific method to use only that tier.

Tier Method Source Use case
1 CDP Playwright connect Remote browser, behind firewall
2 Pool Page pool + queue Full speed, shared context
3 Cluster Multi-context Max power, parallelism
4 Human New context per request Stealth, no shared state

Fall-through (why each tier fails → next)

Tier Falls through when
CDP CDP_ENDPOINTS empty (skipped) or exception (can't connect to remote browser, goto timeout)
Pool Pool not set up (0 URLs) or exception (goto timeout, browser crash)
Cluster Cluster not set up (0 URLs) or exception (goto timeout, browser crash)
Human Exception (browser launch, goto timeout). Last tier – if it fails, fetch fails

Structure

browser-bot/
├── main.py                 # GET scraper, tier cascade
├── post.py                 # POST runner
├── menu.py                 # Interactive menu (auth, run, POST, sites)
├── urls.json               # URLs to fetch (JSON array)
├── posts/                  # POST / prompt bundles (preferred location)
│   ├── posts.json          # POST configs: url, data/json, headers (list)
│   ├── posts_single.json   # UI single-mode prompts (FRIA-style or legacy)
│   └── posts_multi.json    # UI multi-shot batches
├── learning/               # Teaching scripts (101–110), one concept each
├── sites/                  # Per-domain config (gitignored template; see .gitignore)
│   └── {domain}/
│       ├── auth.json       # Auth (cookies, localStorage, etc.)
│       └── {component}/    # e.g. chat, submissions
│           └── config.yaml # Component config (urls, posts, submission)
├── requirements.txt
├── browser_bot/
│   ├── config.py           # Configuration and post-bundle loading
│   ├── poster.py           # Modular POST logic (auth, form, JSON)
│   ├── metrics.py          # Performance tracking
│   ├── submit/             # UI submission (single / multi)
│   ├── browser/
│   │   ├── launcher.py     # Browser/context launch
│   │   └── routes.py       # Resource blocking
│   └── fetchers/
│       ├── base.py         # FetchResult, BaseFetcher
│       ├── cdp.py          # Tier 1
│       ├── pool.py         # Tier 2
│       ├── cluster.py      # Tier 3
│       └── human.py        # Tier 4

Root-level posts.json is still supported as a fallback for POST configs if posts/posts.json is missing.

Setup

pip install -r requirements.txt
playwright install chromium-browser

Optional: set GEMINI_API_KEY in the environment (or a local *.env file that is gitignored) for features that call the Gemini API—never commit API keys.

Run

# Interactive menu - prompts for site/component on launch, uses for duration
python menu.py

# Or run directly (uses root urls.json; POSTs from posts/ or legacy posts.json)
python main.py    # GET scraper
python post.py    # POST requests

On launch, menu.py asks for site (numbered list) and component (select existing or enter new). Use option 7 to change site/component during the session.

POST

Prefer posts/posts.json: a JSON array of objects with url, optional data (form), json (JSON body), and headers. The same fetcher cascade as GET applies, so auth and FETCH_METHOD matter.

Component configs – Use the menu to create configs under sites/{domain}/{component}/ (e.g. config.yaml). UI submission modes can load prompts from posts/posts_single.json or posts/posts_multi.json (see browser_bot/config.py for supported shapes).

[
  {"url": "https://example.com/api/submit", "json": {"title": "Hello", "body": "World"}},
  {"url": "https://example.com/form", "data": {"field": "value"}}
]

Config

Edit browser_bot/config.py:

  • URLS – Loaded from urls.json (JSON array of URLs to fetch)
  • FETCH_METHOD"auto" (try tiers in order), or "cdp" | "pool" | "cluster" | "human" to use only that method
  • CHROMIUM_EXECUTABLE_PATH – System chromium-browser path (default: /usr/bin/chromium-browser). Set to None for Playwright's bundled Chromium.
  • CDP_ENDPOINTS – Remote browser URLs for CDP tier (e.g. ["http://localhost:9222"])
  • POOL_SIZE, CONTEXT_COUNT, PAGES_PER_CONTEXT – Concurrency tuning
  • POSTS – Loaded from posts/posts.json or legacy root posts.json

Privacy and local data

Keep sites/, *.env, and any persistent profile directories out of version control (see .gitignore). They can contain cookies, tokens, and browsing-related state.

About

Browser bot - This repository is for learning and demonstrating browser automation with Playwright. The main application shows a full pipeline (tiered fetchers, auth capture, POST/UI flows). The learning/ directory holds small, numbered scripts that each teach one idea.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages