Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Prevent timeout events from being emitted when item is already fetched #170

Merged
merged 1 commit into from Oct 2, 2015

Conversation

fredrikekelund
Copy link
Collaborator

Fixes #136. It was possible for timeout events to be emitted after a page had already emitted a 404 event. The reason for this is likely the fact that when 404's are handled, the HTTP response socket is closed, but the request socket isn't. I'm not too familiar with the node http api, but I'm guessing this is why the timeout event for that request can still be emitted.

Just like @fantapop thought, this was likely introduced in 40a71c1, and this change makes simplecrawler behave the way it did before that commit.

Perhaps what's really needed here is for the request socket to be closed together with the response socket (once again, I haven't worked much with the node http module directly), but this fixes the problem until further notice anyway.

Fixes simplecrawler#136. It was possible for timeout events to be emitted after
a page had already emitted a 404 event. The reason for this is
likely the fact that when 404's are handled, the HTTP response
socket is closed, but the request socket isn't. I'm not too familiar
with the node http api, but I'm guessing this is why the timeout
event for that request can still be emitted.
@cgiffard
Copy link
Member

I'm wondering if there's room for some more generic logic that can be applied to all events, to verify that they should be emitted at certain stages. In the meantime this looks good.

Is it possible to reproduce it in a test — or wasn't that doable?

@fredrikekelund
Copy link
Collaborator Author

Sure, let me try and add a test to reproduce it!

@fredrikekelund
Copy link
Collaborator Author

This was a difficult one to reproduce in a test actually. Using a local URL, I'm having a hard time getting it to fail properly.

The example below uses a URL that triggers the problem I see if I test this in production, though. Can you make any sense of why this URL would trigger this behavior, but not a local /410?

I've tried setting a Connection: close header on local responses, but that didn't help. The server in question responds with HTTP 1.0 (in contrast to node responding with HTTP 1.1), but I don't know if that makes a difference or not.

@cgiffard let me know if you can reproduce the problem using the provided test case.

it("should not emit both 404 and timeout events", function(done) {

    this.timeout("7s");

    var localCrawler = new Crawler("www.jackan.com", "/sida/0%29;");
    var didTimeout = false;
    localCrawler.timeout = 1000;

    localCrawler.on("fetchtimeout", function(queueItem) {
        console.log("fetchtimeout", queueItem.url);
        didTimeout = true;
    });

    localCrawler.on("fetch404", function() {
        setTimeout(function() {
            didTimeout.should.equal(false);
            done();
        }, localCrawler.timeout * 4);
    });

    localCrawler.start();
});

@fredrikekelund
Copy link
Collaborator Author

@cgiffard you think we can merge this and release a new version as a solution until we find the root of the problem?

cgiffard added a commit that referenced this pull request Oct 2, 2015
Prevent timeout events from being emitted when item is already fetched
@cgiffard cgiffard merged commit c3ad5c8 into simplecrawler:master Oct 2, 2015
@fredrikekelund
Copy link
Collaborator Author

@cgiffard thanks for taking the time to push the new release 🌟 👍

@gorbatenkov
Copy link
Contributor

Hello, after this commit i found another problem here

clientRequest.setTimeout(crawler.timeout, function() {
            if (queueItem.fetched) {
                return;
            }

            if (crawler.running && !queueItem.fetched) {
                crawler._openRequests--;
            }

            queueItem.fetched = true;
            queueItem.status = "timeout";
            crawler.emit("fetchtimeout", queueItem, crawler.timeout);
            clientRequest._crawlerHandled = true;
            clientRequest.abort();
        });

In my test case when some redirects happen one by one socket already die, but clientRequest is still alive and when timeout event fired queueItem.fetched = true and we can't proceed to clientRequest.abort(); and crawl stoppes forever because _openRequests == maxConcurrency.
I've modified code like this and everything works fine on several projects:

clientRequest.setTimeout(crawler.timeout, function() {
            if (clientRequest._crawlerHandled) {
                return;
            }

            if (crawler.running && !queueItem.fetched) {
                crawler._openRequests--;
            }

            if (!queueItem.fetched) {
                queueItem.fetched = true;
                queueItem.status = "timeout";
                crawler.emit("fetchtimeout", queueItem, crawler.timeout);
            }

            clientRequest._crawlerHandled = true;
            clientRequest.abort();
        });

Please tell me if i can provide more usefull information.

@fredrikekelund
Copy link
Collaborator Author

Hmm I see, do you know anything more about the conditions that cause this problem? I had a lot of trouble creating a reproducable scenario for the first issue here.

Anyway, if you open a PR we could still merge it - these changes look good to me, and make sense (although @cgiffard would know a lot more about how the code behaves)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

events thrown for notfound and then for timeout
3 participants