Prevent timeout events from being emitted when item is already fetched #170
Prevent timeout events from being emitted when item is already fetched #170
Conversation
Fixes simplecrawler#136. It was possible for timeout events to be emitted after a page had already emitted a 404 event. The reason for this is likely the fact that when 404's are handled, the HTTP response socket is closed, but the request socket isn't. I'm not too familiar with the node http api, but I'm guessing this is why the timeout event for that request can still be emitted.
I'm wondering if there's room for some more generic logic that can be applied to all events, to verify that they should be emitted at certain stages. In the meantime this looks good. Is it possible to reproduce it in a test — or wasn't that doable? |
Sure, let me try and add a test to reproduce it! |
This was a difficult one to reproduce in a test actually. Using a local URL, I'm having a hard time getting it to fail properly. The example below uses a URL that triggers the problem I see if I test this in production, though. Can you make any sense of why this URL would trigger this behavior, but not a local I've tried setting a @cgiffard let me know if you can reproduce the problem using the provided test case. it("should not emit both 404 and timeout events", function(done) {
this.timeout("7s");
var localCrawler = new Crawler("www.jackan.com", "/sida/0%29;");
var didTimeout = false;
localCrawler.timeout = 1000;
localCrawler.on("fetchtimeout", function(queueItem) {
console.log("fetchtimeout", queueItem.url);
didTimeout = true;
});
localCrawler.on("fetch404", function() {
setTimeout(function() {
didTimeout.should.equal(false);
done();
}, localCrawler.timeout * 4);
});
localCrawler.start();
}); |
@cgiffard you think we can merge this and release a new version as a solution until we find the root of the problem? |
Prevent timeout events from being emitted when item is already fetched
@cgiffard thanks for taking the time to push the new release 🌟 👍 |
Hello, after this commit i found another problem here clientRequest.setTimeout(crawler.timeout, function() {
if (queueItem.fetched) {
return;
}
if (crawler.running && !queueItem.fetched) {
crawler._openRequests--;
}
queueItem.fetched = true;
queueItem.status = "timeout";
crawler.emit("fetchtimeout", queueItem, crawler.timeout);
clientRequest._crawlerHandled = true;
clientRequest.abort();
}); In my test case when some redirects happen one by one socket already die, but clientRequest is still alive and when timeout event fired queueItem.fetched = true and we can't proceed to clientRequest.abort(); and crawl stoppes forever because _openRequests == maxConcurrency. clientRequest.setTimeout(crawler.timeout, function() {
if (clientRequest._crawlerHandled) {
return;
}
if (crawler.running && !queueItem.fetched) {
crawler._openRequests--;
}
if (!queueItem.fetched) {
queueItem.fetched = true;
queueItem.status = "timeout";
crawler.emit("fetchtimeout", queueItem, crawler.timeout);
}
clientRequest._crawlerHandled = true;
clientRequest.abort();
}); Please tell me if i can provide more usefull information. |
Hmm I see, do you know anything more about the conditions that cause this problem? I had a lot of trouble creating a reproducable scenario for the first issue here. Anyway, if you open a PR we could still merge it - these changes look good to me, and make sense (although @cgiffard would know a lot more about how the code behaves) |
I've run project http://www.swannetoscar.com with maxConcurrency = 5 I will open a PR later, thanks |
Fixes #136. It was possible for timeout events to be emitted after a page had already emitted a 404 event. The reason for this is likely the fact that when 404's are handled, the HTTP response socket is closed, but the request socket isn't. I'm not too familiar with the node http api, but I'm guessing this is why the timeout event for that request can still be emitted.
Just like @fantapop thought, this was likely introduced in 40a71c1, and this change makes simplecrawler behave the way it did before that commit.
Perhaps what's really needed here is for the request socket to be closed together with the response socket (once again, I haven't worked much with the node
http
module directly), but this fixes the problem until further notice anyway.