Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Stop overriding default port configurations #40

Merged
merged 1 commit into from

2 participants

@distracteddev

I was getting internal client errors because the Queue was being populated with port:80 queueItems but the protocol was listed as https. This caused a call to https.get({port:80}) which throws an OpenSSH error:

[Error: 140735148347776:error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol:../deps/openssl/openssl/ssl/s23_clnt.c:766:
@cgiffard
Owner

Sorry I've been slow off the mark with this - I've been bedridden with a terrible flu. This looks good to me! Sorry it was causing you trouble.

@cgiffard cgiffard merged commit 845faca into cgiffard:master
@cgiffard
Owner

I'll push to npm soonish, but I'm trying to resolve an error/semantic problem (#39) before I do so. :)

@distracteddev

No worries and thanks for the great crawler :) Autodiscovery saved me a ton of time.

One issue I'm still having however is when I try to discover 50+ links the state of the app ends up in a situation where

if (crawler._openRequests >= crawler.maxConcurrency) 

always evaluates to true and the openRequests never timeout (even after setting the timeout option to something small like 5 or 10 seconds) and the app is just left spinning in cycles waiting on requests that never seem to end.

I've tried to fix it for hours to no avail. As such, I was thinking of porting the Queue to use Async's built in queue implementation to handle making and receiving requests. Any thoughts on this?

@cgiffard
Owner

(BTW I totally forgot to tell you, but 0.2.7 is on npm now.:))

In regards to that issue, perhaps I need to be tighter with the timeouts. I'm pretty sure the problem does not relate to the queue, and I'd like it to retain a relatively general interface so that a queue can be implemented, for example, in Redis, and shared between multiple machines running the same crawl between then.

I'll have a look into whether there's a simple way to fix this problem for you - that's most definitely 100% a bug. :)

@distracteddev

I looked into it more and I think the Queue is functioning properly, its just that some requests were taking 150+ seconds to respond. I also realized you aren't using the timeout anywhere so I added:

// Around Line 700 of crawler.js
clientRequest.setTimeout(crawler.timeout, function () {
  console.log('TIMEOUT REACHED', queueItem.url);
  clientRequest.abort();
})

and that fixed my problem right up once I set the timeout to 10 seconds.

@cgiffard
Owner

Sure, the missing timer is the bug. :)

I've got a bit of stuff on right now but I'll see if I can get to this later today. Thanks!

@cgiffard
Owner

I've made issue #41 for managing the timer bug described here. :)

@SaltwaterC SaltwaterC referenced this pull request in SaltwaterC/http-request
Closed

Strange SSL error when trying to get this URL #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 31, 2013
  1. @distracteddev

    Fix bug where https requests were being sent with port:80 option whic…

    distracteddev authored
    …h would end up in a OpenSSH error.
This page is out of date. Refresh to see the latest.
Showing with 5 additions and 0 deletions.
  1. +5 −0 lib/crawler.js
View
5 lib/crawler.js
@@ -654,6 +654,11 @@ Crawler.prototype.fetchQueueItem = function(queueItem) {
"User-Agent": crawler.userAgent
}
};
+
+ // If port is one of the HTTP/HTTPS defaults, delete the option to avoid conflicts
+ if (requestOptions.port === 80 || requestOptions.port === 443) {
+ delete requestOptions.port;
+ }
// Add cookie header from cookie jar if we're configured to
// send/accept cookies
Something went wrong with that request. Please try again.