You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Browser engine — single persistent browser for both proxy and non-proxy modes
Previously, proxy mode spawned a fresh Chrome process for every request and tore it down
immediately after, making concurrent proxy crawls extremely expensive. The engine now runs
one persistent browser regardless of whether a proxy is configured.
A local auth-injecting relay (_start_proxy_relay) is started once at browser initialisation
and the browser is launched with --proxy-server=http://127.0.0.1:<relay_port> baked in.
Each request opens an isolated tab (via new_tab=True) and closes it when done — identical
to non-proxy mode. Proxy credentials are injected at the TCP level by the relay and never
touch the browser.
Impact: one Chrome process per spider instead of one per request; dramatically lower memory
and startup overhead on proxy-enabled crawls.
Browser engine — splash screen loaded once at startup, not per request
The project logo / chrome://welcome splash was previously loaded in every request tab as a
warm-up step before navigating to the real target. It is now loaded once on browser.main_tab
immediately after the browser starts (_start()), warming up the renderer, stealth patches,
and (when proxied) the relay tunnel — before any spider request arrives. Request tabs navigate
directly to the target URL with no splash overhead.
Browser engine — early return on non-2xx responses _do_fetch now reads the HTTP status code before waiting for page content. Responses in the
2xx range receive the full _wait_for_content() + settle delay as before. Non-2xx responses
(4xx, 5xx) skip the content wait and return immediately with whatever the browser has already
rendered, avoiding up to 10 seconds of unnecessary polling on error pages.
Added
_wait_for_status(page, timeout=8.0) utility
The Navigation Timing API (performance.getEntriesByType('navigation')[0].responseStatus)
is written asynchronously by Chrome and can return 0 immediately after page.wait(),
especially through a proxy or after redirects. The new helper polls every 250 ms until a
non-zero status is available, then returns it. If the entry never populates within 8 seconds
(rare SPA edge case) it falls back to 200 — the safest assumption when the page loaded but
left no timing entry. _JS_STATUS default changed from ?? 200 to ?? 0 to expose the
"not ready" state to the poller rather than masking it.
Fixed
Browser engine — ConnectionResetError / BrokenPipeError log noise on Windows
On Windows with Python 3.13+, closing a Chrome tab or stopping the browser triggers _ProactorBasePipeTransport._call_connection_lost() which raises ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host. This is harmless — the connection is already gone — but asyncio logged it as
an unhandled exception on every tab close. The loop exception handler now suppresses ConnectionResetError, BrokenPipeError, and raw OSError with winerror == 10054
(the unwrapped variant seen on some Python 3.14 builds).
Browser engine — relay and tab-semaphore torn down correctly on browser restart _reset_browser() now closes the proxy relay server and clears _relay_server / _relay_port before spinning up a new event loop, so the restarted browser gets a fresh
relay rather than pointing at a dead port.