Sitemaps parser by jnioche · Pull Request #42 · apache/stormcrawler

jnioche · 2014-12-18T12:09:32Z

This adds a SiteMapParserBolt and a test class. It also reorganizes the test resources and makes the ParsingTester more generic.

As discussed, this Bolt uses a non-default stream 'status' to output the newly discovered URLs and the default stream for documents that are not marked as being sitemaps.

Could you please review this? Thanks!

jakekdodd · 2014-12-18T14:48:04Z

I think we need to have a stream and a bolt for failing tuples. The fail bolt implementation would be up to the user. Unfortunately, failing a tuple causes many spouts (including Storm's Kafka spout) to replay the tuple; since the sitemap parser will always fail to collect outlinks from the same source, this means the tuple will be replayed ad infinitum (I think), clogging up the topology.

I'd assumed that the handling of the fails should be done within the spouts. The 'external' resources would extend the default components where available e.g. Kafka's spout to allow keeping track of how many times it already failed, for instance by adding a custom key value in the metadata. But there could be a separate stream as you suggested. The whole thing would get a bit more complex (I know, I like to keep the library and the assumptions that go with its use as lean as possible). Let's keep that in mind and tack it in a separate discussion.

Shall we assume for now that :
a) the errors will be reproduced every time we run the parser on the same content == always ack?
or
b) there could be something in the Fetching step that could vary from one run to the other (e.g. server returning partial content) which would justify giving the same URL another try?

I'd assumed that the handling of the fails should be done within the spouts.

I think that the semantics of failing a tuple, in the context of Storm, indicate that something went wrong mechanically in the topology, suggesting that it should be retried. In this case, nothing broke--the bolt just wasn't able to parse the content as a sitemap.

there could be something in the Fetching step that could vary from one run to the other (e.g. server returning partial content) which would justify giving the same URL another try?

Yeah that's a good point, and it's a situation where replaying the tuple might be warranted. Another related case is where the content was truncated because of the http.content.limit option, which would look almost identical. But in the former case, replaying the tuple might change the outcome, whereas in the latter, the content would still be truncated and the tuple would fail again.

The interpretation of what should be failed is not just about 'going wrong mechanically in the topology'. If a JVM crashes the execute() method will never get a chance of calling fail(). Storm will detect that it crashed and will call fail on the spout and let you handle it. There are cases where we want to use fail() like for instance during the fetch of a URL.

Chapter 4 Storm Applied section 4.4.2 contains some useful guidelines :

For known errors that cannot be safely retried (like a POST to REST API) or something that does not make sense to be retried (like a ParseException while handling JSON or XML), you shouldn’t fail the tuple.
When you have one of these non-retriable errors, instead of failing the tuple you will need to ack the tuple (without
emitting a new one), as you don’t want to engage the replay mechanism for it. We do recommend some sort of logging or reporting here to know that there was an error in your topology.

This applies to our case here. Let's just ack() the tuple but log the error.

jakekdodd · 2014-12-18T15:05:26Z

A few initial comments.

First, it looks like the bolt emits the outlinks one at a time. This behavior is different from the Tika-based parser bolt, which emits the outlinks in bulk. My preference would be to treat the SitemapParserBolt as a drop-in replacement for the Tika-based bolt, so that upstream and downstream behavior is unmodified.

Second, it occurs to me that we need some type of outlink class. The metadata tuple field is associated with the resource being processed, but the outlinks themselves have associated metadata. This is true not just for sitemap-sourced outlinks, but also for those gathered from RSS feeds and the like.

Third, I think we can easily handle filtering by decoupling it from parsing. In fact, I already do this in my own topology--there's a simple URLFilterBolt that applies the filters to the outlinks, and sits downstream from parsing.

jnioche · 2014-12-18T15:06:25Z

The SitemapParser (isn't that a better name for it without the uppercased M) could also detect whether a document is a robots.txt based on the URL in which case passing a robots.txt as seed could in turn discover sitemap URLs.

jnioche · 2014-12-18T15:20:25Z

Thanks Jake!

First, it looks like the bolt emits the outlinks one at a time. This behavior is different from the Tika-based parser bolt, which emits the outlinks in bulk. My preference would be to treat the SitemapParserBolt as a drop-in replacement for the Tika-based bolt, so that upstream and downstream behavior is unmodified.

They would not replace each other but would be used side by side with the SitemapParser being used in front of the Parser one, so that for a given URL if no metadata is found that marks the URL as a sitemap, it then gets parsed by the normal parser as usual. The behaviour is different because they do not use the same streams.

Second, it occurs to me that we need some type of outlink class. The metadata tuple field is associated with the resource being processed, but the outlinks themselves have associated metadata. This is true not just for sitemap-sourced outlinks, but also for those gathered from RSS feeds and the like.

I had initially written a class like that and thought it easier to rely on the Values class. It probably depends on what contract we set for the status stream, which would be used in other places see #37. A Tuple passed on the status stream could have the following fields

URL
metadata
status (NEW, REDIRECT, GONE, etc...)

Third, I think we can easily handle filtering by decoupling it from parsing. In fact, I already do this in my own topology--there's a simple URLFilterBolt that applies the filters to the outlinks, and sits downstream from parsing.

It could also be added as part of whatever bolt updates the status of the URLs. I tried to design the URL filtering mechanism in such a way that it can be used easily in various places. The benefit of doing it as early as possible is that it reduces the traffic downstream. Also doing small tasks in discrete bolts puts an additional strain on the resources : you need at least 2 threads and 2 internal queues per Executor. Probably not a big issue mind you...

jnioche · 2014-12-19T09:25:08Z

Have changed the behaviour so that it now acks even if there is an exception.
Any more issues with this or can I merge it?

jakekdodd · 2014-12-19T15:14:18Z

Yeah one more question...are we sure that we want to send the sitemap-collected outlinks to the status stream? My understanding was that the status stream would be used for HTTP status codes and maybe other metadata (session/cookie, Last-Modified, cache keys, etc), which we won't have for the sitemap outlinks.

My original thought was that the sitemap parsing would be sort of a rivulet, rather than a stream--if the sitemap key was present in the metadata, the topology DAG would branch to the sitemap parser, and then rejoin the default stream at whichever bolt sits directly downstream from the Tika parser. So it might look something like this:

FetcherBolt (isSiteMap?)--true--> ParserBolt-->(do something)
|--false-->SitemapParserBolt-->^

jnioche · 2014-12-19T15:36:55Z

Have a look at #44 and #43

The status stream is used to pass URL, METADATA, STATUS to bolts whose job it is to update the persistence. This is used to create new URLs, keep track of failures, manage redirects etc... Think of it as the equivalent of the Nutch update step.

The SiteMapProcessor is no different : it discovers URLs which BTW will have metadata (last updated, refresh frequency etc...). We want to pass that to the underlying storage. This is functionally not quite the same as the 'normal' parser which gives us outlinks with metadata but also text and metadata from the parse.

Here is what I had suggested in #38

The output of this parser would not be the same as the Tika based one as it would not generate any textual content or call the parsefilters but consists of URL,Metadata pairs. Similarly to what I suggested for #37, this bolt could generate a 'status' stream that would be consumed by a persistence bolt. It would also use the default stream to pass on URLs that are not sitemaps.

Another reason why we want to use separate the output of the sitemaps is that by themselves their content is not really interesting e.g. we don't need the text and won't index it.

Note to self : we need to also pass the URL and status of the sitemap itself to the status stream and not just its outlinks.

jnioche · 2014-12-20T09:21:30Z

Maybe 'update' would be a better name than 'status'.

…failing + handle redirections

…ption message into metadata

… tuples from status stream

… + adds a status field when sending to the status stream

…ologgy (even if the demo spout doesn't currently generate a relevant URL and metadata)

jnioche · 2015-01-05T13:26:29Z

Have rebased with changes from master (status stream). Now generates an update about the status of the main URL

@DigitalPebble/committers-crawler any objections to committing this?

GuiForget · 2015-01-05T16:55:23Z

Can't be merged yet

jnioche · 2015-01-05T16:57:06Z

Care to elaborate a bit ? ;-)

GuiForget · 2015-01-05T17:04:13Z

There are conflicts preventing this to be merged

jnioche · 2015-01-05T17:12:16Z

I see. I probably did something wrong with the rebase. Will have a look on Wednesday

jnioche · 2015-01-06T12:59:53Z

Have merged the problematic file. @GuiForget could you please have another look?

GuiForget · 2015-01-07T06:15:58Z

Did you mean to name it isBoolean?

another variable with existential issues ;-) fixed

GuiForget · 2015-01-07T06:18:26Z

Looks like this could simply be if (!Boolean.valueOf(isBoolean))

definitely more elegant. fixed

Sitemaps parser

jnioche · 2015-01-07T10:27:43Z

thanks for the comments and discussion @DigitalPebble/committers-crawler

jnioche added 2 commits December 18, 2014 12:04

reorganized test resources, made ParsingTester more generic

306f0ca

Added SiteMapParserBolt w. tests

7368457

jnioche added the enhancement label Dec 18, 2014

jnioche added this to the 0.4 milestone Dec 18, 2014

jakekdodd reviewed Dec 18, 2014
View reviewed changes

SiteMapParser acks even if an exception is caught

29afc19

jnioche mentioned this pull request Dec 19, 2014

Parser to send tuple to 'status' stream instead of failing #43

Closed

jnioche mentioned this pull request Dec 19, 2014

Fetcher and Parser use the status stream to report errors instead of fail #44

Merged

jnioche added 7 commits January 5, 2015 13:00

Fetcher and Parser use the status stream to report errors instead of …

ea0a5c2

…failing + handle redirections

Fetcher : use the metadata from the protocol response + pass the exce…

7151485

…ption message into metadata

fetcher and parser declare status stream + use printerbolt to display…

dce0f95

… tuples from status stream

Bug fix : finishFetchItem and ack only needed in finally section

9fcc05d

renamed REDIR status into REDIRECTION

788f79c

SitemapParserBolt sends a tuple to the status stream for the main URL…

e2c60b9

… + adds a status field when sending to the status stream

updated the tests for SitemapParserBolt + showed how to use it in Top…

127f68a

…ologgy (even if the demo spout doesn't currently generate a relevant URL and metadata)

merged CrawlTopology

6d08d34

GuiForget reviewed Jan 7, 2015
View reviewed changes

Implemented Gui's comments to sitemap parser

c786cfe

jnioche added a commit that referenced this pull request Jan 7, 2015

Merge pull request #42 from DigitalPebble/sitemaps

29c0323

Sitemaps parser

jnioche merged commit 29c0323 into master Jan 7, 2015

jnioche deleted the sitemaps branch January 7, 2015 10:32

jnioche mentioned this pull request Jan 7, 2015

Sitemap parser #38

Closed

Conversation

jnioche commented Dec 18, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakekdodd commented Dec 18, 2014

Uh oh!

jnioche commented Dec 18, 2014

Uh oh!

jnioche commented Dec 18, 2014

Uh oh!

jnioche commented Dec 19, 2014

Uh oh!

jakekdodd commented Dec 19, 2014

Uh oh!

jnioche commented Dec 19, 2014

Uh oh!

jnioche commented Dec 20, 2014

Uh oh!

jnioche commented Jan 5, 2015

Uh oh!

GuiForget commented Jan 5, 2015

Uh oh!

jnioche commented Jan 5, 2015

Uh oh!

GuiForget commented Jan 5, 2015

Uh oh!

jnioche commented Jan 5, 2015

Uh oh!

jnioche commented Jan 6, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnioche commented Jan 7, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants