It would be good to reuse the parsers from crawler-commons to handle sitemaps. Probably simpler to have a separate parser from the Tika-based one and trigger the use of the Sitemap parser based on the presence of an arbitrary metadata like 'parse.sitemap'=true.
The triaging of the tuples based on the metadata could be done in a meta parser wrapping both the tika or sitemap one but probably simpler if that could be done somehow in the topology class. Alternatively the sitemap parser could also wrap the tika parser.
The output of this parser would not be the same as the Tika based one as it would not generate any textual content or call the parsefilters but consists of URL,Metadata pairs. Similarly to what I suggested for #37, this bolt could generate a 'status' stream that would be constumed by a persistence bolt. It would also use the default stream to pass on URLs that are not sitemaps.
Question : what happens if we send things down a stream with nothing to consume them?
It would be good to reuse the parsers from crawler-commons to handle sitemaps. Probably simpler to have a separate parser from the Tika-based one and trigger the use of the Sitemap parser based on the presence of an arbitrary metadata like 'parse.sitemap'=true.
The triaging of the tuples based on the metadata could be done in a meta parser wrapping both the tika or sitemap one but probably simpler if that could be done somehow in the topology class. Alternatively the sitemap parser could also wrap the tika parser.
The output of this parser would not be the same as the Tika based one as it would not generate any textual content or call the parsefilters but consists of URL,Metadata pairs. Similarly to what I suggested for #37, this bolt could generate a 'status' stream that would be constumed by a persistence bolt. It would also use the default stream to pass on URLs that are not sitemaps.
Question : what happens if we send things down a stream with nothing to consume them?