Skip to content

context.enqueue_links strategy parameter not working as expected for AdaptivePlaywrightCrawlingContext #1504

@ericvg97

Description

@ericvg97

I am using 'strategy="all"' and it ignores urls not in same domain. Also 'strategy="same-domain"' doesn't work as expected, as for example https://ca.wikipedia.org/wiki/Portada has links to http://es.wikipedia... and using "same-domain" doesn't enqueue them either way. I think the problem is in "context.extract_links()" as I not see these links there (where I haven't specified a strategy yet).

I can share more code if needed but in the "default_handler" I just log stuff and do await context.enqueue_links(strategy="all") or await context.enqueue_links(strategy="same-domain") and my crawler is this one

        crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
            max_requests_per_crawl=10000,
            playwright_crawler_specific_kwargs={
                "browser_type": "chromium",
                "headless": True,
            },
            max_session_rotations=10,
            retry_on_blocked=True,
            configure_logging=True,
            use_session_pool=True,
            max_request_retries=5,
            request_handler_timeout=timedelta(seconds=120),
        )

On another note, do you have another channel like Slack for reporting bugs and interacting more easily (and privately)? We are going to be heavy users of this project and we really like it, but we feel it is not super stable yet and we could help each other out a bit better with more direct communication. Of course being an OSS project I understand if you'd rather not have it:)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions