Sitemap parser

It would be good to reuse the parsers from [crawler-commons](https://code.google.com/p/crawler-commons/) to handle sitemaps. Probably simpler to have a separate parser from the Tika-based one and trigger the use of the Sitemap parser based on the presence of an arbitrary metadata like 'parse.sitemap'=true.

The triaging of the tuples based on the metadata could be done in a meta parser wrapping both the tika or sitemap one but probably simpler if that could be done somehow in the topology class. Alternatively the sitemap parser could also wrap the tika parser. 

The output of this parser would not be the same as the Tika based one as it would not generate any textual content or call the parsefilters but consists of URL,Metadata pairs.  Similarly to what I suggested for #37, this bolt could generate a 'status' stream that would be constumed by a persistence bolt. It would also use the default stream to pass on URLs that are not sitemaps.

Question : what happens if we send things down a stream with nothing to consume them?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sitemap parser #38

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sitemap parser #38

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions