Skip to content

Add OR operator for filter logic in DelegatorProtocol #1098

Closed
@jnioche

Description

@jnioche

The DelegatorProtocol can route the URLs to various protocol implementations based on metadata. This is particularly helpful when using with the SeleniumProtocol e.g. to make sure sitemaps are processed by another implementation (e.g. OKHTTP) to avoid rendering, which makes them unparsable.

For the record, this is done like this

  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

More than one filter can be defined but with the current version (<2.9), all filters must match in order for an implementation to be selected.

We will add the possibility to define an OR operator, so that only one condition is required for a match.

Now

  # use the normal protocol for sitemaps
  protocol.delegator.config:
   - className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
     filters:
       isSitemap: "true"
       noSelenium:
       robots.txt:
   - className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"

The okhttp implementation will be selected if any of the three conditions match, i.e. the URL has a key value isSitemap: "true" in its metadata, OR it has a key noSelenium (regardless of value) OR it has a key robots.txt (regardless of value).

The latter will be added as part of the commit to allow filtering rules to handle robots.txt (for which using Selenium is an overkill).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions