Optimize get_text with a native selectolax text() fast path (~7% faster parse)#145
Merged
Merged
Conversation
Parse timings are only comparable within one interpreter build, so print the Python version/implementation/platform (and WebSearcher version) at the top of every benchmark and profile run.
The pure-Python get_text fragment walker was the largest optimizable cost in a fresh benchmark of the post-selectolax parse pipeline (~18% cumulative, 824k fragment visits/870 parses). Delegate to lexbor's C text() when it is provably byte-identical: the subtree has no script/style/template (native includes their text; the walker skips it) AND either separator=='' or strip is False (so native's kept empty fragments are invisible). Every other call keeps the walker. Verified byte-identical over the full fixture corpus (315k nodes, 0 mismatches) and the snapshot suite stays green without updates (336 passed, 87 snapshots). Back-to-back A/B on Python 3.13: corpus 3872 -> 3590 ms (-7.3%), median 39.9 -> 36.7 ms/SERP, well above the ~0.5% noise floor. Also record the interpreter version/platform at the top of every bench_parse run, since parse timings are only comparable within one Python build.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR optimizes the Selectolax-based parsing pipeline by adding a correctness-preserving fast path in get_text that delegates to Selectolax’s native Node.text() when it is provably byte-identical to the existing Python fragment walker, improving parse_serp latency (~7–8% in the provided benchmark).
Changes:
- Add a fast-path in
WebSearcher._slx.get_text()to use nativenode.text(...)when subtree/tag conditions guarantee equivalence to the Python walker. - Enhance
scripts/bench_parse.pyto print interpreter/platform and WebSearcher version for benchmark comparability. - Add plan documentation (035 done, 036 proposed) and record the optimization in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| WebSearcher/_slx.py | Adds native text() delegation fast path in get_text under equivalence conditions. |
| scripts/bench_parse.py | Prints Python/platform + package version at benchmark start to contextualize timings. |
| docs/plans/035-get-text-native-fastpath.md | Documents the benchmark, equivalence argument, and measured performance win. |
| docs/plans/036-component-signals-and-extractor-hotpath.md | Proposes follow-up performance work focusing on _ComponentSignals and extractor profiling. |
| CHANGELOG.md | Notes the get_text fast-path optimization and benchmark result under Unreleased. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e only as descendants Addresses a PR review note -- the Python walker skips those tags only when they are descendants, not when the root node is itself one, which is why the fast path needs both the node.tag guard and the css_first descendant probe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A fresh benchmark of the parse pipeline (the first since the selectolax native rewrite in plan 026 and the parser additions in plans 027–034) found that the pure-Python
get_textfragment walker had become the single largest optimizable cost (make_soup's lexbor parse is bigger but structural). This PR adds a byte-identical native-text()fast path that recovers ~7% ofparse_serplatency.The change
get_textdelegates to lexbor's Ctext()when it is provably equivalent to the Python fragment walker:script/style/template(native includes their text; the walker skips it), andseparator == ""(an empty fragment adds nothing to a""-join, so native's kept-empties are invisible) orstrip is False(both keep empties identically).Every other call keeps the walker — notably the 38
get_text(x, " ", strip=True)sites (drop-empties with a visible separator), which is the one case where native and the walker diverge.Also:
scripts/bench_parse.pynow records the interpreter version/platform at the top of every run, since parse timings are only comparable within one Python build.Correctness (byte-identical)
("", False),("", True),(" ", False),("<|>", False)); 95.2% of nodes are fast-path-eligible.uv run pytest: 336 passed, 4 skipped, 87 snapshots unchanged (no snapshot updates).ruff check/ruff format --checkclean.Result (back-to-back A/B, same machine, Python 3.13.12)
Far above the ~0.5% noise floor. Post-change profile:
_iter_text_fragmentsself-time 5.8 → 2.2 s (cum 9.6 → 3.5 s), fragment visits 824k → 276k — the displaced work moved into lexbor's Ctext().Docs / plans
docs/plans/035-get-text-native-fastpath.md— records the benchmark, the fast-path correctness argument, and the A/B result.docs/plans/036-component-signals-and-extractor-hotpath.md— scopes the next lever (_ComponentSignals, now ~13% of parse time) plus an extractor hot-path review.[Unreleased].Notes
The repo's pinned
.python-version(3.14.0rc2) currently can't import the package (pydantic 2.13.4vs the 3.14 RCtyping._eval_typesignature); all numbers were captured on Python 3.13.12. Flagged for a separate env/deps fix.https://claude.ai/code/session_01XH4Tpn5aVFaEq814NoBTrC
Generated by Claude Code