Description
The DelegatorProtocol can route the URLs to various protocol implementations based on metadata. This is particularly helpful when using with the SeleniumProtocol e.g. to make sure sitemaps are processed by another implementation (e.g. OKHTTP) to avoid rendering, which makes them unparsable.
For the record, this is done like this
# use the normal protocol for sitemaps
protocol.delegator.config:
- className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
filters:
isSitemap: "true"
- className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
More than one filter can be defined but with the current version (<2.9), all filters must match in order for an implementation to be selected.
We will add the possibility to define an OR operator, so that only one condition is required for a match.
Now
# use the normal protocol for sitemaps
protocol.delegator.config:
- className: "com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
filters:
isSitemap: "true"
noSelenium:
robots.txt:
- className: "com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol"
The okhttp implementation will be selected if any of the three conditions match, i.e. the URL has a key value isSitemap: "true" in its metadata, OR it has a key noSelenium (regardless of value) OR it has a key robots.txt (regardless of value).
The latter will be added as part of the commit to allow filtering rules to handle robots.txt (for which using Selenium is an overkill).