Browser Bot

Purpose: This repository is for learning and demonstrating browser automation with Playwright. The main application shows a full pipeline (tiered fetchers, auth capture, POST/UI flows). The learning/ directory holds small, numbered scripts that each teach one idea—launching a browser, contexts, pools, CDP, containers, and related topics—building from a minimal baseline toward patterns used in real automation.

Use automation only on systems and targets you are authorized to test.

Main application

Tiered fetching via Playwright (CDP → Pool → Cluster → Human). Normal runs are clean: ephemeral browser (launch() + new_context()), no repo-level persistent profile. Auth is loaded from sites/{domain}/auth.json (cookies, localStorage, sessionStorage, headers). Legacy storage_state.json is still supported.

Output uses Rich for colored, structured terminal output (status badges, metrics table). First <p> text is extracted from the live DOM after JS/React render.

POST / UI submission – HTTP POST and browser-driven form flows use the same fetcher pipeline as GET. POST payloads are configured under posts/ (see below); run via menu.py or python post.py.

Learning directory (`learning/`)

Scripts are numbered 101–110 and are meant to be read and run in order. Each file’s docstring explains the concept (e.g. isolated context vs persistent profile, resource blocking, concurrency, connecting over CDP). They are standalone examples, not imports for the main package.

Some lessons use a local ./browser_profile directory when demonstrating persistent Chromium profiles; that folder is gitignored and is only created if you run those scripts. The main app uses ephemeral launches and, for login, sites/{domain}/.login_profile when persistent login is enabled—not a shared browser_profile/ at the repo root.

Auth

Save login state per site (e.g. example.com → sites/example.com/):

Run python menu.py
Choose "Add login", enter login URL
Browser opens; log in manually, press Enter when done
Auth is saved to sites/{domain}/auth.json (cookies, localStorage, sessionStorage, headers)
Requests to that domain use the saved auth (cookies, storage, derived headers such as Authorization: Bearer where configured)

Tiers (cascade order)

With FETCH_METHOD = "auto" (default), tiers are tried in order. Set FETCH_METHOD to a specific method to use only that tier.

Tier	Method	Source	Use case
1	CDP	Playwright connect	Remote browser, behind firewall
2	Pool	Page pool + queue	Full speed, shared context
3	Cluster	Multi-context	Max power, parallelism
4	Human	New context per request	Stealth, no shared state

Fall-through (why each tier fails → next)

Tier	Falls through when
CDP	`CDP_ENDPOINTS` empty (skipped) or exception (can't connect to remote browser, goto timeout)
Pool	Pool not set up (0 URLs) or exception (goto timeout, browser crash)
Cluster	Cluster not set up (0 URLs) or exception (goto timeout, browser crash)
Human	Exception (browser launch, goto timeout). Last tier – if it fails, fetch fails

Structure

browser-bot/
├── main.py                 # GET scraper, tier cascade
├── post.py                 # POST runner
├── menu.py                 # Interactive menu (auth, run, POST, sites)
├── urls.json               # URLs to fetch (JSON array)
├── posts/                  # POST / prompt bundles (preferred location)
│   ├── posts.json          # POST configs: url, data/json, headers (list)
│   ├── posts_single.json   # UI single-mode prompts (FRIA-style or legacy)
│   └── posts_multi.json    # UI multi-shot batches
├── learning/               # Teaching scripts (101–110), one concept each
├── sites/                  # Per-domain config (gitignored template; see .gitignore)
│   └── {domain}/
│       ├── auth.json       # Auth (cookies, localStorage, etc.)
│       └── {component}/    # e.g. chat, submissions
│           └── config.yaml # Component config (urls, posts, submission)
├── requirements.txt
├── browser_bot/
│   ├── config.py           # Configuration and post-bundle loading
│   ├── poster.py           # Modular POST logic (auth, form, JSON)
│   ├── metrics.py          # Performance tracking
│   ├── submit/             # UI submission (single / multi)
│   ├── browser/
│   │   ├── launcher.py     # Browser/context launch
│   │   └── routes.py       # Resource blocking
│   └── fetchers/
│       ├── base.py         # FetchResult, BaseFetcher
│       ├── cdp.py          # Tier 1
│       ├── pool.py         # Tier 2
│       ├── cluster.py      # Tier 3
│       └── human.py        # Tier 4

Root-level posts.json is still supported as a fallback for POST configs if posts/posts.json is missing.

Setup

pip install -r requirements.txt
playwright install chromium-browser

Optional: set GEMINI_API_KEY in the environment (or a local *.env file that is gitignored) for features that call the Gemini API—never commit API keys.

Run

# Interactive menu - prompts for site/component on launch, uses for duration
python menu.py

# Or run directly (uses root urls.json; POSTs from posts/ or legacy posts.json)
python main.py    # GET scraper
python post.py    # POST requests

On launch, menu.py asks for site (numbered list) and component (select existing or enter new). Use option 7 to change site/component during the session.

POST

Prefer posts/posts.json: a JSON array of objects with url, optional data (form), json (JSON body), and headers. The same fetcher cascade as GET applies, so auth and FETCH_METHOD matter.

Component configs – Use the menu to create configs under sites/{domain}/{component}/ (e.g. config.yaml). UI submission modes can load prompts from posts/posts_single.json or posts/posts_multi.json (see browser_bot/config.py for supported shapes).

[
  {"url": "https://example.com/api/submit", "json": {"title": "Hello", "body": "World"}},
  {"url": "https://example.com/form", "data": {"field": "value"}}
]

Config

Edit browser_bot/config.py:

URLS – Loaded from urls.json (JSON array of URLs to fetch)
FETCH_METHOD – "auto" (try tiers in order), or "cdp" | "pool" | "cluster" | "human" to use only that method
CHROMIUM_EXECUTABLE_PATH – System chromium-browser path (default: /usr/bin/chromium-browser). Set to None for Playwright's bundled Chromium.
CDP_ENDPOINTS – Remote browser URLs for CDP tier (e.g. ["http://localhost:9222"])
POOL_SIZE, CONTEXT_COUNT, PAGES_PER_CONTEXT – Concurrency tuning
POSTS – Loaded from posts/posts.json or legacy root posts.json

Privacy and local data

Keep sites/, *.env, and any persistent profile directories out of version control (see .gitignore). They can contain cookies, tokens, and browsing-related state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Browser Bot

Main application

Learning directory (`learning/`)

Auth

Tiers (cascade order)

Fall-through (why each tier fails → next)

Structure

Setup

Run

POST

Config

Privacy and local data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
browser_bot		browser_bot
learning		learning
posts		posts
.gitignore		.gitignore
README.md		README.md
gemini_base.py		gemini_base.py
main.py		main.py
menu.py		menu.py
post.py		post.py
refresh.py		refresh.py
requirements.txt		requirements.txt
urls.json		urls.json

Folders and files

Latest commit

History

Repository files navigation

Browser Bot

Main application

Learning directory (learning/)

Auth

Tiers (cascade order)

Fall-through (why each tier fails → next)

Structure

Setup

Run

POST

Config

Privacy and local data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Learning directory (`learning/`)

Packages