Skip to content

v0.6.7

Choose a tag to compare

@fawadss1 fawadss1 released this 10 Jun 11:45
· 36 commits to master since this release

Changed

  • Browser engine — single persistent browser for both proxy and non-proxy modes
    Previously, proxy mode spawned a fresh Chrome process for every request and tore it down
    immediately after, making concurrent proxy crawls extremely expensive. The engine now runs
    one persistent browser regardless of whether a proxy is configured.
    A local auth-injecting relay (_start_proxy_relay) is started once at browser initialisation
    and the browser is launched with --proxy-server=http://127.0.0.1:<relay_port> baked in.
    Each request opens an isolated tab (via new_tab=True) and closes it when done — identical
    to non-proxy mode. Proxy credentials are injected at the TCP level by the relay and never
    touch the browser.
    Impact: one Chrome process per spider instead of one per request; dramatically lower memory
    and startup overhead on proxy-enabled crawls.
  • Browser engine — splash screen loaded once at startup, not per request
    The project logo / chrome://welcome splash was previously loaded in every request tab as a
    warm-up step before navigating to the real target. It is now loaded once on browser.main_tab
    immediately after the browser starts (_start()), warming up the renderer, stealth patches,
    and (when proxied) the relay tunnel — before any spider request arrives. Request tabs navigate
    directly to the target URL with no splash overhead.
  • Browser engine — early return on non-2xx responses
    _do_fetch now reads the HTTP status code before waiting for page content. Responses in the
    2xx range receive the full _wait_for_content() + settle delay as before. Non-2xx responses
    (4xx, 5xx) skip the content wait and return immediately with whatever the browser has already
    rendered, avoiding up to 10 seconds of unnecessary polling on error pages.

Added

  • _wait_for_status(page, timeout=8.0) utility
    The Navigation Timing API (performance.getEntriesByType('navigation')[0].responseStatus)
    is written asynchronously by Chrome and can return 0 immediately after page.wait(),
    especially through a proxy or after redirects. The new helper polls every 250 ms until a
    non-zero status is available, then returns it. If the entry never populates within 8 seconds
    (rare SPA edge case) it falls back to 200 — the safest assumption when the page loaded but
    left no timing entry. _JS_STATUS default changed from ?? 200 to ?? 0 to expose the
    "not ready" state to the poller rather than masking it.

Fixed

  • Browser engine — ConnectionResetError / BrokenPipeError log noise on Windows
    On Windows with Python 3.13+, closing a Chrome tab or stopping the browser triggers
    _ProactorBasePipeTransport._call_connection_lost() which raises
    ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host. This is harmless — the connection is already gone — but asyncio logged it as
    an unhandled exception on every tab close. The loop exception handler now suppresses
    ConnectionResetError, BrokenPipeError, and raw OSError with winerror == 10054
    (the unwrapped variant seen on some Python 3.14 builds).
  • Browser engine — relay and tab-semaphore torn down correctly on browser restart
    _reset_browser() now closes the proxy relay server and clears _relay_server /
    _relay_port before spinning up a new event loop, so the restarted browser gets a fresh
    relay rather than pointing at a dead port.