Conversation
This commit adds an interface for robots caches, an in-memory, thread-safe cache implementation, and a basic URLFilter for robots rules.
|
A few notes on the new PR:
|
|
Hi Jake. I don't really like the changes to the ParserBolt. Either the FetcherBolt stores the robots info via the metadata (the value can be an array containing a single string) as it is available there already or the url filter itself fetches the robots file, but that should definitely not be done by the ParserBolt. |
|
I'm not a fan of the implementation either, but there's a reason for fetching the robots file in the ParserBolt rather than within the filter. Making the call to Other solutions I entertained are:
(2) is still an option, but the only fundamental difference is that the ParserBolt doesn't make an HTTP request for robots.txt. |
|
It would make sense to pass the source URL as metadata anyway (great for debugging) and that could indeed be used to determine whether or not to do the sitemap checks. I am planning to add a utility class that would help generate metadata for outlinks given the metadata for the source page. This would be configurable of course and could be used to add the 'origin' metadata.
I presume that 'direct cache query' means hitting the same cache instance as used by the Fetcher, which in the case of the memory based one would work if it is located in the same JVM. If so, why not do that when the hostname is the same as the source as well? it should definitely be there. |
The assumption is that many or most outlinks will be internal, so we would guarantee that the robots rules would be available for outlinks with the same hostname as the source URL; for external outlinks, the robots rules might be available for filtering, depending on a number of factors. To guarantee that robots rules for the hostname are in the JVM-local cache, we need to either fetch or re-parse the robots.txt file within the same process as the filters. For external outlinks, the rules might be present in the cache; if they are, then this is a secondary benefit. Keeping the filter straightforward (give it a cache, feed it URLs to filter) leaves the cache-filling strategy outside of the filter, which is where I think it should be. |
|
I agree with your definition of what the filter does (i.e. guaranteed for same hostname). Most of the time the fetcher and parser will be running on the same JVM - if not the performance of the crawler would certainly suffer a lot. I'd be happy with a simpler solution where we expect this to be the case. If we want a 100% guarantee then indeed we need to check the cache and if not found there, refetch and parse the robots.txt. What I think we disagree on is where it should be done. I don't think it should be done in the parser at all, its purpose is to be generic and moreover, people could have various implementations of parsing bolts (Gui wrote one for instance to deal with HTML exclusively). We'd want them to be able to reuse the parsingfilters as-is without having to write any bespoke code that would be a prerequisite for any of these filters to work. If we need to fetch+parse the robots rules then it should be done within the filter itself. Again, there would be something wrong in the users' setup if the fetcher and parsers were not running in the same VM. I think it is reasonable to just reuse the cache filled by the Fetcher. |
This PR adds a new mechanism for robots.txt caching and filtering.
Among the changes are: