Skip to content

refactor: replace scraper with lol_html for HTML charset prescanning#398

Merged
barjin merged 4 commits intomasterfrom
copilot/make-scraper-optional
Mar 4, 2026
Merged

refactor: replace scraper with lol_html for HTML charset prescanning#398
barjin merged 4 commits intomasterfrom
copilot/make-scraper-optional

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 4, 2026

  • Identify scraper usage (only in impit/src/response_parsing/mod.rs)
  • Replace scraper with lol_html = "2.7.2" in impit/Cargo.toml
  • Rewrite prescan_bytestream in response_parsing/mod.rs to use lol_html
  • Build and test verified passing
  • Regenerate impit-node/yarn.lock to fix CI (updated native packages from v0.10.0 → v0.10.1)
Original prompt

This section details on the original issue you should resolve

<issue_title>Make scraper dependency optional or replace with lightweight alternative</issue_title>
<issue_description>I'm using impit as a library for browser fingerprinted HTTP requests. Impit making scraper crate a required dependency causes two problems:

First problem: namespace collision. Scraper is a common crate name. When impit pulls it in as a transitive dependency, any workspace that also has a crate named scraper gets ambiguous resolution, for example I had to specify exact version of my own crate in order for this command to run: cargo run -p scraper@0.1.0 --bin scraper --, otherwise I was getting an error:

error: There are multiple `scraper` packages in your project, and the specification `scraper` is ambiguous.

Since scraper is a generic name, this is likely to affect other consumers too.

Second problem: using scraper as dependency is too heavyweight for what it does. The only usage of scraper crate I found is in response_parsing/mod.rs — prescan_bytestream() parses the first 1024 bytes into a full html5ever DOM tree just to read and attributes. This pulls in html5ever, selectors, ego-tree, cssparser, etc. for a two-selector lookup.

Additionally, this encoding detection isn't even wired into the response pipeline — Impit::send() returns a raw reqwest::Response. The decode() function is only exposed as an opt-in utility via impit::utils, so consumers who don't call it still pay the compile cost.

Suggested fixes (either would work):

  • Make scraper optional behind a feature flag (e.g. encoding-detection) so consumers who don't use impit::utils::decode don't pull it in.
  • Replace "scraper" with lol_html, a streaming HTML parser that can extract charset from tags without building any node trees. Different name avoids the namespace issue, and it's far lighter for this use case.</issue_description>

<agent_instructions>Try replacing scraper with lol_html (or similar lightweight html parser), as described in the issue. Do not add the additional feature flag. Make minimal changes to achieve this goal. Use conventional commits for the commit messages and PR titles. Be as concise as possible, do not add excessive comments.</agent_instructions>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: barjin <61918049+barjin@users.noreply.github.com>
Copilot AI changed the title [WIP] Make scraper dependency optional or replace with alternative refactor: replace scraper with lol_html for HTML charset prescanning Mar 4, 2026
@github-actions github-actions bot added this to the 135th sprint - Tooling team milestone Mar 4, 2026
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 4, 2026
@barjin
Copy link
Copy Markdown
Member

barjin commented Mar 4, 2026

@copilot regenerate the yarn.lock lockfile so the CI passes

Co-authored-by: barjin <61918049+barjin@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 4, 2026

@copilot regenerate the yarn.lock lockfile so the CI passes

Done in eb547e8 — updated impit-node/yarn.lock to bump native packages from v0.10.0 → v0.10.1, matching the current package.json version. yarn install --immutable now passes locally.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the scraper dependency from the impit Rust crate by rewriting the HTML charset prescanning logic to use lol_html, addressing both the “heavyweight dependency” concern and the common-crate-name workspace collision described in #396. It also updates the Node wrapper lockfile to keep CI green after native package updates.

Changes:

  • Replaced DOM-based charset extraction (scraper) with streaming parsing via lol_html in prescan_bytestream.
  • Swapped scraper = "0.25.0" for lol_html = "2.7.2" in impit dependencies (and updated Cargo.lock accordingly).
  • Regenerated impit-node/yarn.lock and normalized impit-node/package.json optionalDependencies entries for native package versions.

Reviewed changes

Copilot reviewed 3 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
impit/src/response_parsing/mod.rs Reimplements HTML prescan encoding detection using lol_html element handlers instead of scraper selectors/DOM.
impit/Cargo.toml Replaces scraper dependency with lol_html.
Cargo.lock Reflects dependency graph changes from swapping scraperlol_html.
impit-node/yarn.lock Updates native package resolutions (v0.10.0 → v0.10.1) to match regenerated lockfile state.
impit-node/package.json Reorders/normalizes optionalDependencies entries (no functional changes observed).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@barjin barjin marked this pull request as ready for review March 4, 2026 13:04
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 4, 2026
@barjin barjin merged commit 96a8a89 into master Mar 4, 2026
54 checks passed
@barjin barjin deleted the copilot/make-scraper-optional branch March 4, 2026 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make scraper dependency optional or replace with lightweight alternative

4 participants