Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
0bcf21c
Merge branch 'master' into dev
gitronald Feb 23, 2026
7236545
version [prerelease]: 0.6.10a0
gitronald Feb 23, 2026
34f561c
build(deps): bump tornado from 6.5.4 to 6.5.5
dependabot[bot] Mar 12, 2026
081bcf9
remove unused functions and imports from utils
gitronald Mar 15, 2026
6d6c2de
replace os with pathlib.Path
gitronald Mar 15, 2026
1f55e1f
add ruff formatting, linting, and pre-commit hooks
gitronald Mar 15, 2026
ee6bf2c
apply ruff formatting and lint fixes
gitronald Mar 15, 2026
487209f
replace pandas with polars in scripts
gitronald Mar 15, 2026
6ee78ce
accept Path in utils and save methods, remove str() wrapping
gitronald Mar 15, 2026
353784e
consolidate webutils into utils
gitronald Mar 15, 2026
f92114f
add test coverage reporting
gitronald Mar 15, 2026
f717dd1
add unit tests for utils, locations, models, and feature extractor
gitronald Mar 15, 2026
d59e9cb
Merge branch 'dependabot/uv/tornado-6.5.5' into dev
gitronald Mar 15, 2026
3394b12
parse_serp always returns dict with results and features keys
gitronald Mar 15, 2026
25ddd60
remove BaseResult construction from ads parser
gitronald Mar 15, 2026
40bbe12
convert SERPFeatures from dataclass to pydantic
gitronald Mar 15, 2026
b203653
convert DetailsItem from dataclass to pydantic
gitronald Mar 15, 2026
4cc5091
add ResponseOutput model for search method returns
gitronald Mar 15, 2026
a9a96fe
add ParsedSERP model for parsed output
gitronald Mar 15, 2026
9d6deda
consolidate details field to typed dicts, remove DetailsItem/DetailsList
gitronald Mar 15, 2026
d76cbe7
normalize local_results sub_type for "results for" headers
gitronald Mar 15, 2026
3406b2b
update snapshots for local_results sub_type normalization
gitronald Mar 15, 2026
c20ca90
update demo scripts for ParsedSERP attribute access
gitronald Mar 16, 2026
0eb5c79
auto-detect chrome version by defaulting version_main to None
gitronald Mar 16, 2026
a1c5c7a
add demo-searches script entry point
gitronald Mar 16, 2026
f461061
update readme for v0.7.0 changes
gitronald Mar 16, 2026
a4e7287
fix SearchConfig.method type annotation to SearchMethod only
gitronald Mar 16, 2026
0cb6d14
version [minor]: 0.7.0
gitronald Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ scripts/ads-no-subtype
*.egg-info
*__pycache__

# Ignore test cache
# Ignore caches
.pytest_cache
.ruff_cache
.coverage
7 changes: 7 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.15.5
hooks:
- id: ruff-format
- id: ruff
args: [--fix]
80 changes: 45 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,34 +26,20 @@ and position-based specifications.
---
## Recent Updates

### 0.6.9

- Fixed bugs in component parsers (class comparison, assignment operator, set literal)
- Fixed `return` in `finally` block in requests searcher
- Added captcha detection to feature extractor
- Added captcha handling and jittered delay to demo searches
- Dropped pandas from core dependencies
- Cleaned up legacy typing imports
- Removed poetry.toml

### 0.6.8

- Migrated from Poetry to uv for dependency management
- Added Python 3.12-3.14 test matrix in GitHub Actions
- Added `flights` classifier and `standard-4` layout
- Added local service ad parser
- Extracted bottom ads before main column
- Fixed `return` in `finally` block warning in selenium searcher

### 0.6.7

- Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
- Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
- Added `sub_type` to perspectives parser from header text
- Added CI test workflow on push to dev branch
- Added compressed test fixtures with `condense_fixtures.py` script
- Updated dependency lower bounds for security patches (protobuf, orjson)
- Updated GitHub Actions to checkout v6 and setup-python v6
### 0.7.0 (dev)

- **Breaking:** `details` field is now always `dict | None` with a self-describing `type` key (e.g. `{"type": "menu", "items": [...]}`)
- **Breaking:** `parse_serp()` now always returns a dict with `results` and `features` keys; the `extract_features` parameter has been removed
- Standardized all models on Pydantic BaseModel (removed dataclasses)
- Added `ResponseOutput` and `ParsedSERP` typed models
- Removed `DetailsItem`, `DetailsList` classes
- Normalized `local_results` sub_type for location-specific headers
- Replaced `os` with `pathlib.Path` throughout
- Consolidated `webutils.py` into `utils.py`
- Added ruff formatting, linting, and pre-commit hooks
- Added test coverage reporting (69%)
- Added unit tests for utils, locations, models, and feature extractor
- Replaced pandas with polars in demo scripts

---
## Getting Started
Expand Down Expand Up @@ -132,7 +118,7 @@ Example search and parse pipeline (via requests):
import WebSearcher as ws
se = ws.SearchEngine() # 1. Initialize collector
se.search('immigration news') # 2. Conduct a search
se.parse_results() # 3. Parse search results
se.parse_serp() # 3. Parse search results
se.save_serp(append_to='serps.json') # 4. Save HTML and metadata
se.save_results(append_to='results.json') # 5. Save parsed results

Expand Down Expand Up @@ -164,14 +150,14 @@ se.search('immigration news')

#### 3. Parse Search Results

The example below is primarily for parsing search results as you collect HTML.
The example below is primarily for parsing search results as you collect HTML.
See `ws.parse_serp(html)` for parsing existing HTML data.

```python
se.parse_results()
se.parse_serp()

# Show first result
se.results[0]
se.parsed.results[0]
{'section': 'main',
'cmpt_rank': 0,
'sub_rank': 0,
Expand Down Expand Up @@ -288,10 +274,34 @@ To release a new version:
---
## Update Log

`0.7.0`
- Standardize data models on Pydantic, typed details field, remove DetailsItem/DetailsList

`0.6.9`
- Fixed bugs in component parsers (class comparison, assignment operator, set literal)
- Fixed `return` in `finally` block in requests searcher
- Added captcha detection to feature extractor
- Added captcha handling and jittered delay to demo searches
- Dropped pandas from core dependencies
- Cleaned up legacy typing imports
- Removed poetry.toml

`0.6.8`
- Migrated from Poetry to uv for dependency management
- Added Python 3.12-3.14 test matrix in GitHub Actions
- Added `flights` classifier and `standard-4` layout
- Added local service ad parser
- Extracted bottom ads before main column
- Fixed `return` in `finally` block warning in selenium searcher

`0.6.7`
- Add `get_text_by_selectors()` utility, CI test workflow, compressed test fixtures
- Add `perspectives`, `recent_posts`, `latest_from` classifiers and `sub_type` for perspectives
- Update dependency bounds for security patches, GitHub Actions to v6
- Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
- Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
- Added `sub_type` to perspectives parser from header text
- Added CI test workflow on push to dev branch
- Added compressed test fixtures with `condense_fixtures.py` script
- Updated dependency lower bounds for security patches (protobuf, orjson)
- Updated GitHub Actions to checkout v6 and setup-python v6

`0.6.6`
- Update packages with dependabot alerts (brotli, urllib3)
Expand Down
26 changes: 20 additions & 6 deletions WebSearcher/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,22 @@
__version__ = "0.6.9"
from .searchers import SearchEngine
from .parsers import parse_serp
from .feature_extractor import FeatureExtractor
__version__ = "0.7.0"

from .classifiers import ClassifyFooter, ClassifyMain
from .extractors import Extractor
from .feature_extractor import FeatureExtractor
from .locations import download_locations
from .classifiers import ClassifyMain, ClassifyFooter
from .webutils import load_html, make_soup, load_soup
from .parsers import parse_serp
from .searchers import SearchEngine
from .utils import load_html, load_soup, make_soup

__all__ = [
"ClassifyFooter",
"ClassifyMain",
"Extractor",
"FeatureExtractor",
"download_locations",
"parse_serp",
"SearchEngine",
"load_html",
"load_soup",
"make_soup",
]
11 changes: 9 additions & 2 deletions WebSearcher/classifiers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
from .header_text import ClassifyHeaderText
from .footer import ClassifyFooter
from .header_components import ClassifyHeaderComponent
from .header_text import ClassifyHeaderText
from .main import ClassifyMain
from .footer import ClassifyFooter

__all__ = [
"ClassifyFooter",
"ClassifyHeaderComponent",
"ClassifyHeaderText",
"ClassifyMain",
]
38 changes: 21 additions & 17 deletions WebSearcher/classifiers/footer.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
import bs4
from .. import webutils

from .. import utils
from .main import ClassifyMain

class ClassifyFooter:

class ClassifyFooter:
@staticmethod
def classify(cmpt: bs4.element.Tag) -> str:
layout_conditions = [
('id' in cmpt.attrs and cmpt.attrs['id'] in {'bres', 'brs'}),
('class' in cmpt.attrs and cmpt.attrs['class'] == ['MjjYud']),
("id" in cmpt.attrs and cmpt.attrs["id"] in {"bres", "brs"}),
("class" in cmpt.attrs and cmpt.attrs["class"] == ["MjjYud"]),
]

# Ordered list of classifiers to try
Expand All @@ -26,37 +27,40 @@ def classify(cmpt: bs4.element.Tag) -> str:
# Default unknown, exit on first successful classification
cmpt_type = "unknown"
for classifier in classifier_list:
if cmpt_type != "unknown": break
if cmpt_type != "unknown":
break
cmpt_type = classifier(cmpt)

# Fall back to main classifier
if cmpt_type == 'unknown':
if cmpt_type == "unknown":
cmpt_type = ClassifyMain.classify(cmpt)

return cmpt_type

@staticmethod
def discover_more(cmpt):
conditions = [
cmpt.find("g-scrolling-carousel"),
]
return 'discover_more' if all(conditions) else "unknown"
return "discover_more" if all(conditions) else "unknown"

@staticmethod
def omitted_notice(cmpt):
conditions = [
cmpt.find("p", {"id":"ofr"}),
(webutils.get_text(cmpt, "h2") == "Notices about Filtered Results"),
cmpt.find("p", {"id": "ofr"}),
(utils.get_text(cmpt, "h2") == "Notices about Filtered Results"),
]
return "omitted_notice" if any(conditions) else "unknown"

@staticmethod
def searches_related(cmpt):
known_labels = {'Related',
'Related searches',
'People also search for',
'Related to this search',
'Searches related to'}
h3 = cmpt.find('h3')
known_labels = {
"Related",
"Related searches",
"People also search for",
"Related to this search",
"Searches related to",
}
h3 = cmpt.find("h3")
h3_matches = [h3.text.strip().startswith(text) for text in known_labels] if h3 else []
return 'searches_related' if any(h3_matches) else 'unknown'
return "searches_related" if any(h3_matches) else "unknown"
7 changes: 4 additions & 3 deletions WebSearcher/classifiers/header_components.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
from .. import webutils
import bs4

from .. import utils


class ClassifyHeaderComponent:
"""Classify a component from the header section based on its bs4.element.Tag"""

@staticmethod
def classify(cmpt: bs4.element.Tag) -> str:
"""Classify the component type based on header text"""

cmpt_type = "unknown"
if webutils.check_dict_value(cmpt.attrs, "id", ["taw", "topstuff"]):
if utils.check_dict_value(cmpt.attrs, "id", ["taw", "topstuff"]):
cmpt_type = "notice"
return cmpt_type
Loading
Loading