-
Notifications
You must be signed in to change notification settings - Fork 501
Description
I am using 'strategy="all"' and it ignores urls not in same domain. Also 'strategy="same-domain"' doesn't work as expected, as for example https://ca.wikipedia.org/wiki/Portada has links to http://es.wikipedia... and using "same-domain" doesn't enqueue them either way. I think the problem is in "context.extract_links()" as I not see these links there (where I haven't specified a strategy yet).
I can share more code if needed but in the "default_handler" I just log stuff and do await context.enqueue_links(strategy="all") or await context.enqueue_links(strategy="same-domain") and my crawler is this one
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
max_requests_per_crawl=10000,
playwright_crawler_specific_kwargs={
"browser_type": "chromium",
"headless": True,
},
max_session_rotations=10,
retry_on_blocked=True,
configure_logging=True,
use_session_pool=True,
max_request_retries=5,
request_handler_timeout=timedelta(seconds=120),
)
On another note, do you have another channel like Slack for reporting bugs and interacting more easily (and privately)? We are going to be heavy users of this project and we really like it, but we feel it is not super stable yet and we could help each other out a bit better with more direct communication. Of course being an OSS project I understand if you'd rather not have it:)