Conversation
There was a problem hiding this comment.
I think we need to have a stream and a bolt for failing tuples. The fail bolt implementation would be up to the user. Unfortunately, failing a tuple causes many spouts (including Storm's Kafka spout) to replay the tuple; since the sitemap parser will always fail to collect outlinks from the same source, this means the tuple will be replayed ad infinitum (I think), clogging up the topology.
There was a problem hiding this comment.
I'd assumed that the handling of the fails should be done within the spouts. The 'external' resources would extend the default components where available e.g. Kafka's spout to allow keeping track of how many times it already failed, for instance by adding a custom key value in the metadata. But there could be a separate stream as you suggested. The whole thing would get a bit more complex (I know, I like to keep the library and the assumptions that go with its use as lean as possible). Let's keep that in mind and tack it in a separate discussion.
Shall we assume for now that :
a) the errors will be reproduced every time we run the parser on the same content == always ack?
or
b) there could be something in the Fetching step that could vary from one run to the other (e.g. server returning partial content) which would justify giving the same URL another try?
There was a problem hiding this comment.
I'd assumed that the handling of the fails should be done within the spouts.
I think that the semantics of failing a tuple, in the context of Storm, indicate that something went wrong mechanically in the topology, suggesting that it should be retried. In this case, nothing broke--the bolt just wasn't able to parse the content as a sitemap.
there could be something in the Fetching step that could vary from one run to the other (e.g. server returning partial content) which would justify giving the same URL another try?
Yeah that's a good point, and it's a situation where replaying the tuple might be warranted. Another related case is where the content was truncated because of the http.content.limit option, which would look almost identical. But in the former case, replaying the tuple might change the outcome, whereas in the latter, the content would still be truncated and the tuple would fail again.
There was a problem hiding this comment.
The interpretation of what should be failed is not just about 'going wrong mechanically in the topology'. If a JVM crashes the execute() method will never get a chance of calling fail(). Storm will detect that it crashed and will call fail on the spout and let you handle it. There are cases where we want to use fail() like for instance during the fetch of a URL.
Chapter 4 Storm Applied section 4.4.2 contains some useful guidelines :
For known errors that cannot be safely retried (like a POST to REST API) or something that does not make sense to be retried (like a ParseException while handling JSON or XML), you shouldn’t fail the tuple.
When you have one of these non-retriable errors, instead of failing the tuple you will need to ack the tuple (without
emitting a new one), as you don’t want to engage the replay mechanism for it. We do recommend some sort of logging or reporting here to know that there was an error in your topology.
This applies to our case here. Let's just ack() the tuple but log the error.
|
A few initial comments. First, it looks like the bolt emits the outlinks one at a time. This behavior is different from the Tika-based parser bolt, which emits the outlinks in bulk. My preference would be to treat the SitemapParserBolt as a drop-in replacement for the Tika-based bolt, so that upstream and downstream behavior is unmodified. Second, it occurs to me that we need some type of outlink class. The metadata tuple field is associated with the resource being processed, but the outlinks themselves have associated metadata. This is true not just for sitemap-sourced outlinks, but also for those gathered from RSS feeds and the like. Third, I think we can easily handle filtering by decoupling it from parsing. In fact, I already do this in my own topology--there's a simple URLFilterBolt that applies the filters to the outlinks, and sits downstream from parsing. |
|
The SitemapParser (isn't that a better name for it without the uppercased M) could also detect whether a document is a robots.txt based on the URL in which case passing a robots.txt as seed could in turn discover sitemap URLs. |
|
Thanks Jake!
They would not replace each other but would be used side by side with the SitemapParser being used in front of the Parser one, so that for a given URL if no metadata is found that marks the URL as a sitemap, it then gets parsed by the normal parser as usual. The behaviour is different because they do not use the same streams.
I had initially written a class like that and thought it easier to rely on the Values class. It probably depends on what contract we set for the status stream, which would be used in other places see #37. A Tuple passed on the status stream could have the following fields
It could also be added as part of whatever bolt updates the status of the URLs. I tried to design the URL filtering mechanism in such a way that it can be used easily in various places. The benefit of doing it as early as possible is that it reduces the traffic downstream. Also doing small tasks in discrete bolts puts an additional strain on the resources : you need at least 2 threads and 2 internal queues per Executor. Probably not a big issue mind you... |
|
Have changed the behaviour so that it now acks even if there is an exception. |
|
Yeah one more question...are we sure that we want to send the sitemap-collected outlinks to the status stream? My understanding was that the status stream would be used for HTTP status codes and maybe other metadata (session/cookie, Last-Modified, cache keys, etc), which we won't have for the sitemap outlinks. My original thought was that the sitemap parsing would be sort of a rivulet, rather than a stream--if the sitemap key was present in the metadata, the topology DAG would branch to the sitemap parser, and then rejoin the default stream at whichever bolt sits directly downstream from the Tika parser. So it might look something like this:
|
|
The The SiteMapProcessor is no different : it discovers URLs which BTW will have metadata (last updated, refresh frequency etc...). We want to pass that to the underlying storage. This is functionally not quite the same as the 'normal' parser which gives us outlinks with metadata but also text and metadata from the parse. Here is what I had suggested in #38
Another reason why we want to use separate the output of the sitemaps is that by themselves their content is not really interesting e.g. we don't need the text and won't index it. Note to self : we need to also pass the URL and status of the sitemap itself to the status stream and not just its outlinks. |
|
Maybe 'update' would be a better name than 'status'. |
…failing + handle redirections
…ption message into metadata
… tuples from status stream
… + adds a status field when sending to the status stream
…ologgy (even if the demo spout doesn't currently generate a relevant URL and metadata)
|
Have rebased with changes from master (status stream). Now generates an update about the status of the main URL @DigitalPebble/committers-crawler any objections to committing this? |
|
Can't be merged yet |
|
Care to elaborate a bit ? ;-) |
|
There are conflicts preventing this to be merged |
|
I see. I probably did something wrong with the rebase. Will have a look on Wednesday |
|
Have merged the problematic file. @GuiForget could you please have another look? |
There was a problem hiding this comment.
another variable with existential issues ;-) fixed
There was a problem hiding this comment.
Looks like this could simply be if (!Boolean.valueOf(isBoolean))
There was a problem hiding this comment.
definitely more elegant. fixed
|
thanks for the comments and discussion @DigitalPebble/committers-crawler |
See #38
This adds a SiteMapParserBolt and a test class. It also reorganizes the test resources and makes the ParsingTester more generic.
As discussed, this Bolt uses a non-default stream 'status' to output the newly discovered URLs and the default stream for documents that are not marked as being sitemaps.
Could you please review this? Thanks!