Purpose: This repository is for learning and demonstrating browser automation with Playwright. The main application shows a full pipeline (tiered fetchers, auth capture, POST/UI flows). The learning/ directory holds small, numbered scripts that each teach one idea—launching a browser, contexts, pools, CDP, containers, and related topics—building from a minimal baseline toward patterns used in real automation.
Use automation only on systems and targets you are authorized to test.
Tiered fetching via Playwright (CDP → Pool → Cluster → Human). Normal runs are clean: ephemeral browser (launch() + new_context()), no repo-level persistent profile. Auth is loaded from sites/{domain}/auth.json (cookies, localStorage, sessionStorage, headers). Legacy storage_state.json is still supported.
Output uses Rich for colored, structured terminal output (status badges, metrics table). First <p> text is extracted from the live DOM after JS/React render.
POST / UI submission – HTTP POST and browser-driven form flows use the same fetcher pipeline as GET. POST payloads are configured under posts/ (see below); run via menu.py or python post.py.
Scripts are numbered 101–110 and are meant to be read and run in order. Each file’s docstring explains the concept (e.g. isolated context vs persistent profile, resource blocking, concurrency, connecting over CDP). They are standalone examples, not imports for the main package.
Some lessons use a local ./browser_profile directory when demonstrating persistent Chromium profiles; that folder is gitignored and is only created if you run those scripts. The main app uses ephemeral launches and, for login, sites/{domain}/.login_profile when persistent login is enabled—not a shared browser_profile/ at the repo root.
Save login state per site (e.g. example.com → sites/example.com/):
- Run
python menu.py - Choose "Add login", enter login URL
- Browser opens; log in manually, press Enter when done
- Auth is saved to
sites/{domain}/auth.json(cookies, localStorage, sessionStorage, headers) - Requests to that domain use the saved auth (cookies, storage, derived headers such as
Authorization: Bearerwhere configured)
With FETCH_METHOD = "auto" (default), tiers are tried in order. Set FETCH_METHOD to a specific method to use only that tier.
| Tier | Method | Source | Use case |
|---|---|---|---|
| 1 | CDP | Playwright connect | Remote browser, behind firewall |
| 2 | Pool | Page pool + queue | Full speed, shared context |
| 3 | Cluster | Multi-context | Max power, parallelism |
| 4 | Human | New context per request | Stealth, no shared state |
| Tier | Falls through when |
|---|---|
| CDP | CDP_ENDPOINTS empty (skipped) or exception (can't connect to remote browser, goto timeout) |
| Pool | Pool not set up (0 URLs) or exception (goto timeout, browser crash) |
| Cluster | Cluster not set up (0 URLs) or exception (goto timeout, browser crash) |
| Human | Exception (browser launch, goto timeout). Last tier – if it fails, fetch fails |
browser-bot/
├── main.py # GET scraper, tier cascade
├── post.py # POST runner
├── menu.py # Interactive menu (auth, run, POST, sites)
├── urls.json # URLs to fetch (JSON array)
├── posts/ # POST / prompt bundles (preferred location)
│ ├── posts.json # POST configs: url, data/json, headers (list)
│ ├── posts_single.json # UI single-mode prompts (FRIA-style or legacy)
│ └── posts_multi.json # UI multi-shot batches
├── learning/ # Teaching scripts (101–110), one concept each
├── sites/ # Per-domain config (gitignored template; see .gitignore)
│ └── {domain}/
│ ├── auth.json # Auth (cookies, localStorage, etc.)
│ └── {component}/ # e.g. chat, submissions
│ └── config.yaml # Component config (urls, posts, submission)
├── requirements.txt
├── browser_bot/
│ ├── config.py # Configuration and post-bundle loading
│ ├── poster.py # Modular POST logic (auth, form, JSON)
│ ├── metrics.py # Performance tracking
│ ├── submit/ # UI submission (single / multi)
│ ├── browser/
│ │ ├── launcher.py # Browser/context launch
│ │ └── routes.py # Resource blocking
│ └── fetchers/
│ ├── base.py # FetchResult, BaseFetcher
│ ├── cdp.py # Tier 1
│ ├── pool.py # Tier 2
│ ├── cluster.py # Tier 3
│ └── human.py # Tier 4
Root-level posts.json is still supported as a fallback for POST configs if posts/posts.json is missing.
pip install -r requirements.txt
playwright install chromium-browserOptional: set GEMINI_API_KEY in the environment (or a local *.env file that is gitignored) for features that call the Gemini API—never commit API keys.
# Interactive menu - prompts for site/component on launch, uses for duration
python menu.py
# Or run directly (uses root urls.json; POSTs from posts/ or legacy posts.json)
python main.py # GET scraper
python post.py # POST requestsOn launch, menu.py asks for site (numbered list) and component (select existing or enter new). Use option 7 to change site/component during the session.
Prefer posts/posts.json: a JSON array of objects with url, optional data (form), json (JSON body), and headers. The same fetcher cascade as GET applies, so auth and FETCH_METHOD matter.
Component configs – Use the menu to create configs under sites/{domain}/{component}/ (e.g. config.yaml). UI submission modes can load prompts from posts/posts_single.json or posts/posts_multi.json (see browser_bot/config.py for supported shapes).
[
{"url": "https://example.com/api/submit", "json": {"title": "Hello", "body": "World"}},
{"url": "https://example.com/form", "data": {"field": "value"}}
]Edit browser_bot/config.py:
URLS– Loaded fromurls.json(JSON array of URLs to fetch)FETCH_METHOD–"auto"(try tiers in order), or"cdp"|"pool"|"cluster"|"human"to use only that methodCHROMIUM_EXECUTABLE_PATH– System chromium-browser path (default:/usr/bin/chromium-browser). Set toNonefor Playwright's bundled Chromium.CDP_ENDPOINTS– Remote browser URLs for CDP tier (e.g.["http://localhost:9222"])POOL_SIZE,CONTEXT_COUNT,PAGES_PER_CONTEXT– Concurrency tuningPOSTS– Loaded fromposts/posts.jsonor legacy rootposts.json
Keep sites/, *.env, and any persistent profile directories out of version control (see .gitignore). They can contain cookies, tokens, and browsing-related state.