Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Fix 0.8 .domain compatibility #6

Closed
wants to merge 1 commit into from

2 participants

pettom Christopher Giffard
pettom

Simple rename domain to hostname.
Rename all occurrences of domain hostname in order to keep the semantics

Christopher Giffard
Owner

Looks cool, but I'm not keen on 'subhostname' et. al.

I'll have a think about what to do to address the issue - and then I'll probably pull and then patch your request.

Thanks for helping out!

Christopher Giffard cgiffard closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 22, 2012
  1. pettom

    replace domain to hostname

    pettom authored
This page is out of date. Refresh to see the latest.
Showing with 85 additions and 85 deletions.
  1. +11 −11 README.markdown
  2. +1 −1  cache-backend-fs.js
  3. +67 −67 index.js
  4. +6 −6 queue.js
22 README.markdown
View
@@ -70,7 +70,7 @@ myCrawler.on("fetchcomplete",function(queueItem, responseBuffer, response) {
```
Then, when you're satisfied you're ready to go, start the crawler! It'll run through its queue finding linked
-resources on the domain to download, until it can't find any more.
+resources on the hostname to download, until it can't find any more.
```javascript
myCrawler.start();
@@ -112,7 +112,7 @@ If this is annoying, and you'd really like to retain error pages by default, let
Here's a complete list of what you can stuff with at this stage:
-* `crawler.domain` - The domain to scan. By default, simplecrawler will restrict all requests to this domain.
+* `crawler.hostname` - The hostname to scan. By default, simplecrawler will restrict all requests to this hostname.
* `crawler.initialPath` - The initial path with which the crawler will formulate its first request. Does not restrict subsequent requests.
* `crawler.initialPort` - The initial port with which the crawler will formulate its first request. Does not restrict subsequent requests.
* `crawler.initialProtocol` - The initial protocol with which the crawler will formulate its first request. Does not restrict subsequent requests.
@@ -121,21 +121,21 @@ Here's a complete list of what you can stuff with at this stage:
* `crawler.timeout` - The maximum time the crawler will wait for headers before aborting the request.
* `crawler.userAgent` - The user agent the crawler will report. Defaults to `Node/SimpleCrawler 0.1 (http://www.github.com/cgiffard/node-simplecrawler)`.
* `crawler.queue` - The queue in use by the crawler (Must implement the `FetchQueue` interface)
-* `crawler.filterByDomain` - Specifies whether the crawler will restrict queued requests to a given domain/domains.
-* `crawler.scanSubdomains` - Enables scanning subdomains (other than www) as well as the specified domain. Defaults to false.
-* `crawler.ignoreWWWDomain` - Treats the `www` domain the same as the originally specified domain. Defaults to true.
-* `crawler.stripWWWDomain` - Or go even further and strip WWW subdomain from requests altogether!
+* `crawler.filterByhostname` - Specifies whether the crawler will restrict queued requests to a given hostname/hostnames.
+* `crawler.scanSubhostnames` - Enables scanning subhostnames (other than www) as well as the specified hostname. Defaults to false.
+* `crawler.ignoreWWWhostname` - Treats the `www` hostname the same as the originally specified hostname. Defaults to true.
+* `crawler.stripWWWhostname` - Or go even further and strip WWW subhostname from requests altogether!
* `crawler.discoverResources` - Use simplecrawler's internal resource discovery function. Defaults to true. (switch it off if you'd prefer to discover and queue resources yourself!)
* `crawler.cache` - Specify a cache architecture to use when crawling. Must implement `SimpleCache` interface.
* `crawler.useProxy` - The crawler should use an HTTP proxy to make its requests.
* `crawler.proxyHostname` - The hostname of the proxy to use for requests.
* `crawler.proxyPort` - The port of the proxy to use for requests.
-* `crawler.domainWhitelist` - An array of domains the crawler is permitted to crawl from. If other settings are more permissive, they will override this setting.
+* `crawler.hostnameWhitelist` - An array of hostnames the crawler is permitted to crawl from. If other settings are more permissive, they will override this setting.
* `crawler.supportedMimeTypes` - An array of RegEx objects used to determine supported MIME types (types of data simplecrawler will scan for links.) If you're not using simplecrawler's resource discovery function, this won't have any effect.
* `crawler.allowedProtocols` - An array of RegEx objects used to determine whether a URL protocol is supported. This is to deal with nonstandard protocol handlers that regular HTTP is sometimes given, like `feed:`. It does not provide support for non-http protocols (and why would it!?)
* `crawler.maxResourceSize` - The maximum resource size, in bytes, which will be downloaded. Defaults to 16MB.
* `crawler.downloadUnsupported` - Simplecrawler will download files it can't parse. Defaults to true, but if you'd rather save the RAM and GC lag, switch it off.
-* `crawler.needsAuth` - Flag to specify if the domain you are hitting requires basic authentication
+* `crawler.needsAuth` - Flag to specify if the hostname you are hitting requires basic authentication
* `crawler.authUser` - Username provdied for needsAuth flag
* `crawler.authPass` - Passowrd provided for needsAuth flag
@@ -160,7 +160,7 @@ It's not just an array though.
You could always just `.push` a new resource onto the queue, but you'd need to have it all in the correct format, and validate the URL yourself, and oh wouldn't that be a pain. Instead, use the `queue.add` function provided for your convenience:
```javascript
-crawler.queue.add(protocol,domain,port,path);
+crawler.queue.add(protocol,hostname,port,path);
```
That's it! It's basically just a URL, but comma separated (that's how you can remember the order.)
@@ -171,9 +171,9 @@ Because when working with simplecrawler, you'll constantly be handed queue items
* `url` - The complete, canonical URL of the resource.
* `protocol` - The protocol of the resource (http, https)
-* `domain` - The full domain of the resource
+* `hostname` - The full hostname of the resource
* `port` - The port of the resource
-* `path` - The bit of the URL after the domain - includes the querystring.
+* `path` - The bit of the URL after the hostname - includes the querystring.
* `fetched` - Has the request for this item been completed? You can monitor this as requests are processed.
* `status` - The internal status of the item, always a string. This can be one of:
* `queued` - The resource is in the queue to be fetched, but nothing's happened to it yet.
2  cache-backend-fs.js
View
@@ -129,7 +129,7 @@ backend.prototype.setItem = function(queueObject,data,callback) {
callback = callback instanceof Function ? callback : function(){};
var backend = this;
- var pathStack = [queueObject.protocol, queueObject.domain, queueObject.port];
+ var pathStack = [queueObject.protocol, queueObject.hostname, queueObject.port];
pathStack = pathStack.concat(sanitisePath(queueObject.path,queueObject).split(/\/+/g));
var cacheItemExists = false;
134 index.js
View
@@ -11,10 +11,10 @@ var http = require("http"),
https = require("https");
// Crawler Constructor
-var Crawler = function(domain,initialPath,initialPort,interval) {
+var Crawler = function(hostname,initialPath,initialPort,interval) {
// SETTINGS TO STUFF WITH (not here! Do it when you create a `new Crawler()`)
- // Domain to crawl
- this.domain = domain || "";
+ // hostname to crawl
+ this.hostname = hostname || "";
// Gotta start crawling *somewhere*
this.initialPath = initialPath || "/";
@@ -37,18 +37,18 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
// Queue for requests - FetchQueue gives us stats and other sugar (but it's basically just an array)
this.queue = new FetchQueue();
- // Do we filter by domain?
+ // Do we filter by hostname?
// Unless you want to be crawling the entire internet, I would recommend leaving this on!
- this.filterByDomain = true;
+ this.filterByhostname = true;
- // Do we scan subdomains?
- this.scanSubdomains = false;
+ // Do we scan subhostnames?
+ this.scanSubhostnames = false;
- // Treat WWW subdomain the same as the main domain (and don't count it as a separate subdomain)
- this.ignoreWWWDomain = true;
+ // Treat WWW subhostname the same as the main hostname (and don't count it as a separate subhostname)
+ this.ignoreWWWhostname = true;
- // Or go even further and strip WWW subdomain from domains altogether!
- this.stripWWWDomain = false;
+ // Or go even further and strip WWW subhostname from hostnames altogether!
+ this.stripWWWhostname = false;
// Use simplecrawler's internal resource discovery function (switch it off if you'd prefer to discover and queue resources yourself!)
this.discoverResources = true;
@@ -66,9 +66,9 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
this.authUser = "";
this.authPass = "";
- // Domain Whitelist
- // We allow domains to be whitelisted, so cross-domain requests can be made.
- this.domainWhitelist = [];
+ // hostname Whitelist
+ // We allow hostnames to be whitelisted, so cross-hostname requests can be made.
+ this.hostnameWhitelist = [];
// Supported Protocols
this.allowedProtocols = [
@@ -98,16 +98,16 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
this.fetchConditions = [];
// Initialise our queue by pushing the initial request data into it...
- this.queue.add(this.initialProtocol,this.domain,this.initialPort,this.initialPath);
+ this.queue.add(this.initialProtocol,this.hostname,this.initialPort,this.initialPath);
- // Takes a URL, and extracts the protocol, domain, port, and resource
+ // Takes a URL, and extracts the protocol, hostname, port, and resource
function processURL(URL,URLContext) {
- var split, protocol = "http", domain = crawler.domain, port = 80, path = "/";
+ var split, protocol = "http", hostname = crawler.hostname, port = 80, path = "/";
var hostData = "", pathStack, relativePathStack, invalidPath = false;
if (URLContext) {
port = URLContext.port;
- domain = URLContext.domain;
+ hostname = URLContext.hostname;
protocol = URLContext.protocol;
path = URLContext.path;
}
@@ -115,15 +115,15 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
// Trim URL
URL = URL.replace(/^\s+/,"").replace(/\s+$/,"");
- // Check whether we're global, domain-absolute or relative
+ // Check whether we're global, hostname-absolute or relative
if (URL.match(/^http(s)?:\/\//i)) {
- // We're global. Try and extract domain and port
+ // We're global. Try and extract hostname and port
split = URL.replace(/^http(s)?:\/\//i,"").split(/\//g);
- hostData = split[0] && split[0].length ? split[0] : domain;
+ hostData = split[0] && split[0].length ? split[0] : hostname;
if (hostData.split(":").length > 0) {
hostData = hostData.split(":");
- domain = hostData[0];
+ hostname = hostData[0];
port = hostData.pop();
port = isNaN(port) ? 80 : port;
}
@@ -187,20 +187,20 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
path = "/" + pathStack.join("/");
}
- // Strip the www subdomain out if required
- if (crawler.stripWWWDomain) {
- domain = domain.replace(/^www\./ig,"");
+ // Strip the www subhostname out if required
+ if (crawler.stripWWWhostname) {
+ hostname = hostname.replace(/^www\./ig,"");
}
// Replace problem entities...
path = path.replace(/&/ig,"&");
- // Ensure domain is always lower-case
- domain = domain.toLowerCase();
+ // Ensure hostname is always lower-case
+ hostname = hostname.toLowerCase();
return {
"protocol": protocol,
- "domain": domain,
+ "hostname": hostname,
"port": port,
"path": path
};
@@ -286,62 +286,62 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
return resources;
}
- // Checks to see whether domain is valid for crawling.
- function domainValid(domain) {
- function domainInWhitelist(domain) {
+ // Checks to see whether hostname is valid for crawling.
+ function hostnameValid(hostname) {
+ function hostnameInWhitelist(hostname) {
// If there's no whitelist, or the whitelist is of zero length, just return false.
- if (!crawler.domainWhitelist || !crawler.domainWhitelist.length) return false;
+ if (!crawler.hostnameWhitelist || !crawler.hostnameWhitelist.length) return false;
// Otherwise, scan through it.
- return !!crawler.domainWhitelist.reduce(function(prev,cur,index,array) {
- // If we already located the relevant domain in the whitelist...
+ return !!crawler.hostnameWhitelist.reduce(function(prev,cur,index,array) {
+ // If we already located the relevant hostname in the whitelist...
if (prev) return prev;
- // If the domain is just equal, return true.
- if (domain === cur) return true;
- // If we're ignoring WWW subdomains, and both domains, less www. are the same, return true.
- if (crawler.ignoreWWWDomain && domain.replace(/^www\./i,"") === cur.replace(/^www\./i,"")) return true;
+ // If the hostname is just equal, return true.
+ if (hostname === cur) return true;
+ // If we're ignoring WWW subhostnames, and both hostnames, less www. are the same, return true.
+ if (crawler.ignoreWWWhostname && hostname.replace(/^www\./i,"") === cur.replace(/^www\./i,"")) return true;
// Otherwise, sorry. No dice.
return false;
},false);
}
- // Checks if the first domain is a subdomain of the second
- function isSubdomainOf(subdomain,domain) {
- domainParts = domain.split(/\./g);
- subdomainParts = subdomain.split(/\./g);
+ // Checks if the first hostname is a subhostname of the second
+ function isSubhostnameOf(subhostname,hostname) {
+ hostnameParts = hostname.split(/\./g);
+ subhostnameParts = subhostname.split(/\./g);
- // If we're ignoring www, remove it from both (if www is the first domain component...)
- if (crawler.ignoreWWWDomain) {
- if (domainParts[0].match(/^www\./i)) domainParts = domainParts.slice(1);
- if (subdomainParts[0].match(/^www\./i)) domainParts = domainParts.slice(1);
+ // If we're ignoring www, remove it from both (if www is the first hostname component...)
+ if (crawler.ignoreWWWhostname) {
+ if (hostnameParts[0].match(/^www\./i)) hostnameParts = hostnameParts.slice(1);
+ if (subhostnameParts[0].match(/^www\./i)) hostnameParts = hostnameParts.slice(1);
}
- // Can't have a subdomain that's shorter than its parent.
- if (subdomain.length < domain.length) return false;
+ // Can't have a subhostname that's shorter than its parent.
+ if (subhostname.length < hostname.length) return false;
- // Loop through subdomain backwards, from TLD to least significant domain, break on first error.
- var index = subdomainParts.length - 1;
- while (index >= 0 && index >= subdomainParts.length - domainParts.length) {
- if (subdomainParts[index] !== domainParts[index]) return false;
+ // Loop through subhostname backwards, from TLD to least significant hostname, break on first error.
+ var index = subhostnameParts.length - 1;
+ while (index >= 0 && index >= subhostnameParts.length - hostnameParts.length) {
+ if (subhostnameParts[index] !== hostnameParts[index]) return false;
index --;
}
return true;
}
- // If we're not filtering by domain, just return true.
- return (!crawler.filterByDomain ||
- // Or if the domain is just the right one, return true.
- (domain === crawler.domain) ||
- // Or if we're ignoring WWW subdomains, and both domains, less www. are the same, return true.
- (crawler.ignoreWWWDomain && crawler.domain.replace(/^www\./i,"") === domain.replace(/^www\./i,"")) ||
- // Or if the domain in question exists in the domain whitelist, return true.
- domainInWhitelist(domain) ||
- // Or if we're scanning subdomains, and this domain is a subdomain of the crawler's set domain, return true.
- (crawler.scanSubdomains && isSubdomainOf(domain,crawler.domain)));
+ // If we're not filtering by hostname, just return true.
+ return (!crawler.filterByhostname ||
+ // Or if the hostname is just the right one, return true.
+ (hostname === crawler.hostname) ||
+ // Or if we're ignoring WWW subhostnames, and both hostnames, less www. are the same, return true.
+ (crawler.ignoreWWWhostname && crawler.hostname.replace(/^www\./i,"") === hostname.replace(/^www\./i,"")) ||
+ // Or if the hostname in question exists in the hostname whitelist, return true.
+ hostnameInWhitelist(hostname) ||
+ // Or if we're scanning subhostnames, and this hostname is a subhostname of the crawler's set hostname, return true.
+ (crawler.scanSubhostnames && isSubhostnameOf(hostname,crawler.hostname)));
}
// Make available externally to this scope
- crawler.isDomainValid = domainValid;
+ crawler.ishostnameValid = hostnameValid;
// Externally accessible function for auditing the number of open requests...
crawler.openRequests = function() {
@@ -374,12 +374,12 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
return false;
}
- // Check the domain is valid before adding it to the queue
- if (domainValid(parsedURL.domain)) {
+ // Check the hostname is valid before adding it to the queue
+ if (hostnameValid(parsedURL.hostname)) {
try {
crawler.queue.add(
parsedURL.protocol,
- parsedURL.domain,
+ parsedURL.hostname,
parsedURL.port,
parsedURL.path,
function queueAddCallback(error,newQueueItem) {
@@ -415,7 +415,7 @@ var Crawler = function(domain,initialPath,initialPort,interval) {
client = (queueItem.protocol === "https" ? https : http);
// Extract request options from queue;
- var requestHost = queueItem.domain,
+ var requestHost = queueItem.hostname,
requestPort = queueItem.port,
requestPath = queueItem.path;
12 queue.js
View
@@ -21,7 +21,7 @@ var FetchQueue = function(){
};
FetchQueue.prototype = [];
-FetchQueue.prototype.add = function(protocol,domain,port,path,callback) {
+FetchQueue.prototype.add = function(protocol,hostname,port,path,callback) {
callback = callback && callback instanceof Function ? callback : function(){};
var self = this;
@@ -32,9 +32,9 @@ FetchQueue.prototype.add = function(protocol,domain,port,path,callback) {
return callback(new Error("Port must be numeric!"));
}
- var url = protocol + "://" + domain + (port !== 80 ? ":" + port : "") + path;
+ var url = protocol + "://" + hostname + (port !== 80 ? ":" + port : "") + path;
- this.exists(protocol,domain,port,path,
+ this.exists(protocol,hostname,port,path,
function(err,exists) {
if (err) return callback(err);
@@ -42,7 +42,7 @@ FetchQueue.prototype.add = function(protocol,domain,port,path,callback) {
var queueItem = {
"url": url,
"protocol": protocol,
- "domain": domain,
+ "hostname": hostname,
"port": port,
"path": path,
"fetched": false,
@@ -59,10 +59,10 @@ FetchQueue.prototype.add = function(protocol,domain,port,path,callback) {
};
// Check if an item already exists in the queue...
-FetchQueue.prototype.exists = function(protocol,domain,port,path,callback) {
+FetchQueue.prototype.exists = function(protocol,hostname,port,path,callback) {
callback = callback && callback instanceof Function ? callback : function(){};
- var url = (protocol + "://" + domain + (port !== 80 ? ":" + port : "") + path).toLowerCase();
+ var url = (protocol + "://" + hostname + (port !== 80 ? ":" + port : "") + path).toLowerCase();
if (!!this.scanIndex[url]) {
callback(null,1);
Something went wrong with that request. Please try again.