Skip to content

RobotsURLFilter #24

@jnioche

Description

@jnioche

The filtering of URLs based on the robots.txt directives is done within the Fetcher for an incoming URL. It would be efficient to be able to filter on the outlinks so that URLs do not get added to the queues (or any other form of persistence) if they can't be fetched anyway.

We should provide a RobotsURLFilter for this where we'd refetch the robots.txt file for a given URL and store it in a cache. The additional cost of pulling the robots.txt would be outweighed by the benefits of not adding unnecessary URLs to the queues.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions