Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Updated documentation. Added quickcrawl methods. More doco fixes to c…

…ome. Complete TomDoc and test coverage to come.
  • Loading branch information...
commit b14b1ddcf64de519ba21e1f39d136d747ce6ae01 1 parent 098d64e
@cgiffard authored
View
145 README.markdown
@@ -1,6 +1,20 @@
# Simple web-crawler for node.js [![Build Status](https://travis-ci.org/cgiffard/node-simplecrawler.png?branch=master)](https://travis-ci.org/cgiffard/node-simplecrawler)
-Simplecrawler is designed to provide the most basic possible API for crawling websites, while being as flexible and robust as possible. I wrote simplecrawler to archive, analyse, and search some very large websites. It has happily chewed through 50,000 pages and written tens of gigabytes to disk without issue.
+Simplecrawler is designed to provide the most basic possible API for crawling
+websites, while being as flexible and robust as possible. I wrote simplecrawler
+to archive, analyse, and search some very large websites. It has happily chewed
+through 50,000 pages and written tens of gigabytes to disk without issue.
+
+#### Example (simple mode)
+
+```javascript
+var Crawler = require("simplecrawler");
+
+Crawler.crawl("http://example.com/")
+ .on("fetchcomplete",function(queueItem){
+ console.log("Completed fetching resource:",queueItem.url);
+ });
+```
### What does simplecrawler do?
@@ -19,13 +33,51 @@ npm install simplecrawler
### Getting Started
-Creating a new crawler is very simple. First you'll need to include it:
+There are two ways of instantiating a new crawler - a simple but less flexible
+method inspired by [anemone](http://anemone.rubyforge.org), and the traditional
+method which provides a little more room to configure crawl parameters.
+
+Regardless of wether you use the simple or traditional methods of instantiation,
+you'll need to require simplecrawler:
+
+```javascript
+var Crawler = require("simplecrawler");
+```
+
+#### Simple Mode
+
+Simple mode generates a new crawler for you, preconfigures it based on a URL you
+provide, and returns the crawler to you for further configuration and so you can
+attach event handlers.
+
+Simply call `Crawler.crawl`, with a URL first parameter, and two optional
+functions that will be added as event listeners for `fetchcomplete` and
+`fetcherror` respectively.
```javascript
-var Crawler = require("simplecrawler").Crawler;
+Crawler.crawl("http://example.com/", function(queueItem){
+ console.log("Completed fetching resource:",queueItem.url);
+});
```
-Then create your crawler:
+Alternately, if you decide to omit these functions, you can use the returned
+crawler object to add the event listeners yourself, and tweak configuration
+options:
+
+```javascript
+var crawler = Crawler.crawl("http://example.com/");
+
+crawler.interval = 500;
+
+crawler.on("fetchcomplete",function(queueItem){
+ console.log("Completed fetching resource:",queueItem.url);
+});
+```
+
+#### Advanced Mode
+
+The alternative method of creating a crawler is to call the `simplecrawler`
+constructor yourself, and to initiate the crawl manually.
```javascript
var myCrawler = new Crawler("www.example.com");
@@ -143,15 +195,53 @@ Here's a complete list of what you can stuff with at this stage:
* `crawler.authUser` - Username provdied for needsAuth flag
* `crawler.authPass` - Passowrd provided for needsAuth flag
+#### Excluding certain resources from downloading
+
+Simplecrawler has a mechanism you can use to prevent certain resources from being
+fetched, based on the URL, called *Fetch Conditions**. A fetch condition is just
+a function, which, when given a parsed URL object, will return a true or a false
+value, indicating whether a given resource should be downloaded.
+
+You may add as many fetch conditions as you like, and remove them at runtime.
+Simplecrawler will evaluate every single condition against every queued URL, and
+should just one of them return a falsy value (this includes null and undefined,
+so remember to always return a value!) then the resource in question will not be
+fetched.
+
+##### Adding a fetch condition
+
+This example fetch condition prevents URLs ending in `.pdf` from downloading.
+Adding a fetch condition assigns it an ID, which the `addFetchCondition` function
+returns. You can use this ID to remove the condition later.
+
+```javascript
+var conditionID = myCrawler.addFetchCondition(function(parsedURL) {
+ return !parsedURL.path.match(/\.pdf$/i);
+});
+```
+
+##### Removing a fetch condition
+
+If you stored the ID of the fetch condition you added earlier, you can remove it
+from the crawler:
+
+```javascript
+myCrawler.removeFetchCondition(conditionID);
+```
+
### The Simplecrawler Queue
-Simplecrawler has a queue like any other web crawler. It can be directly accessed at `crawler.queue` (assuming you called your Crawler() object `crawler`.) It provides array access, so you can get to queue items just with array notation and an index.
+Simplecrawler has a queue like any other web crawler. It can be directly accessed
+at `crawler.queue` (assuming you called your Crawler() object `crawler`.) It
+provides array access, so you can get to queue items just with array notation
+and an index.
```javascript
crawler.queue[5];
```
-For compatibility with different backing stores, it now provides an alternate interface which the crawler core makes use of:
+For compatibility with different backing stores, it now provides an alternate
+interface which the crawler core makes use of:
```javascript
crawler.queue.get(5);
@@ -161,17 +251,23 @@ It's not just an array though.
#### Adding to the queue
-You could always just `.push` a new resource onto the queue, but you'd need to have it all in the correct format, and validate the URL yourself, and oh wouldn't that be a pain. Instead, use the `queue.add` function provided for your convenience:
+You could always just `.push` a new resource onto the queue, but you'd need to
+have it all in the correct format, and validate the URL yourself, and oh wouldn't
+that be a pain. Instead, use the `queue.add` function provided for your
+convenience:
```javascript
crawler.queue.add(protocol,domain,port,path);
```
-That's it! It's basically just a URL, but comma separated (that's how you can remember the order.)
+That's it! It's basically just a URL, but comma separated (that's how you can
+remember the order.)
#### Queue items
-Because when working with simplecrawler, you'll constantly be handed queue items, it helps to know what's inside them. These are the properties every queue item is expected to have:
+Because when working with simplecrawler, you'll constantly be handed queue items,
+it helps to know what's inside them. These are the properties every queue item
+is expected to have:
* `url` - The complete, canonical URL of the resource.
* `protocol` - The protocol of the resource (http, https)
@@ -206,11 +302,15 @@ queueItem.stateData.contentLength;
queueItem.status === "queued";
```
-As you can see, you can get a lot of meta-information out about each request. The upside is, the queue actually has some convenient functions for getting simple aggregate data about the queue...
+As you can see, you can get a lot of meta-information out about each request. The
+upside is, the queue actually has some convenient functions for getting simple
+aggregate data about the queue...
#### Queue Statistics and Reporting
-First of all, the queue can provide some basic statistics about the network performance of your crawl (so far.) This is done live, so don't check it thirty times a second. You can test the following properties:
+First of all, the queue can provide some basic statistics about the network
+performance of your crawl (so far.) This is done live, so don't check it thirty
+times a second. You can test the following properties:
* `requestTime`
* `requestLatency`
@@ -218,7 +318,9 @@ First of all, the queue can provide some basic statistics about the network perf
* `contentLength`
* `actualDataSize`
-And you can get the maximum, minimum, and average values for each with the `crawler.queue.max`, `crawler.queue.min`, and `crawler.queue.avg` functions respectively. Like so:
+And you can get the maximum, minimum, and average values for each with the
+`crawler.queue.max`, `crawler.queue.min`, and `crawler.queue.avg` functions
+respectively. Like so:
```javascript
console.log("The maximum request latency was %dms.",crawler.queue.max("requestLatency"));
@@ -226,9 +328,13 @@ console.log("The minimum download time was %dms.",crawler.queue.min("downloadTim
console.log("The average resource size received is %d bytes.",crawler.queue.avg("actualDataSize"));
```
-You'll probably often need to determine how many items in the queue have a given status at any one time, and/or retreive them. That's easy with `crawler.queue.countWithStatus` and `crawler.queue.getWithStatus`.
+You'll probably often need to determine how many items in the queue have a given
+status at any one time, and/or retreive them. That's easy with
+`crawler.queue.countWithStatus` and `crawler.queue.getWithStatus`.
-`crawler.queue.getwithStatus` returns the number of queued items with a given status, while `crawler.queue.getWithStatus` returns an array of the queue items themselves.
+`crawler.queue.getwithStatus` returns the number of queued items with a given
+status, while `crawler.queue.getWithStatus` returns an array of the queue items
+themselves.
```javascript
var redirectCount = crawler.queue.countWithStatus("redirected");
@@ -247,9 +353,16 @@ Then there's some even simpler convenience functions:
#### Saving and reloading the queue (freeze/defrost)
-You'll probably want to be able to save your progress and reload it later, if your application fails or you need to abort the crawl for some reason. (Perhaps you just want to finish off for the night and pick it up tomorrow!) The `crawler.queue.freeze` and `crawler.queue.defrost` functions perform this task.
+You'll probably want to be able to save your progress and reload it later, if
+your application fails or you need to abort the crawl for some reason. (Perhaps
+you just want to finish off for the night and pick it up tomorrow!) The
+`crawler.queue.freeze` and `crawler.queue.defrost` functions perform this task.
-**A word of warning though** - they are not CPU friendly or set up to be asynchronous, as they rely on JSON.parse and JSON.stringify. Use them only when you need to save the queue - don't call them every request or your application's performance will be incredibly poor - they block like *crazy*. That said, using them when your crawler commences and stops is perfectly reasonable.
+**A word of warning though** - they are not CPU friendly or set up to be
+asynchronous, as they rely on JSON.parse and JSON.stringify. Use them only when
+you need to save the queue - don't call them every request or your application's
+performance will be incredibly poor - they block like *crazy*. That said, using
+them when your crawler commences and stops is perfectly reasonable.
```javascript
// Freeze queue
View
12 example/quickcrawl-example.js
@@ -0,0 +1,12 @@
+// Example demonstrating the simple (but less flexible) way of initiating
+// a crawler.
+
+var Crawler = require("../lib");
+
+Crawler.crawl("http://deewr.gov.au/")
+ .on("fetchstart",function(queueItem){
+ console.log("Starting request for:",queueItem.url);
+ })
+ .on("fetchcomplete",function(queueItem){
+ console.log("Completed fetching resource:",queueItem.url);
+ });
View
92 lib/crawler.js
@@ -34,7 +34,7 @@ var Crawler = function(host,initialPath,initialPort,interval) {
// (as long as concurrency is under cap)
// One request will be spooled per tick, up to the concurrency threshold.
this.interval = interval || 250;
-
+
// Maximum request concurrency. Be sensible. Five ties in with node's
// default maxSockets value.
this.maxConcurrency = 5;
@@ -50,7 +50,7 @@ var Crawler = function(host,initialPath,initialPort,interval) {
// Queue for requests - FetchQueue gives us stats and other sugar
// (but it's basically just an array)
this.queue = new FetchQueue();
-
+
// Do we filter by domain?
// Unless you want to be crawling the entire internet, I would
// recommend leaving this on!
@@ -66,10 +66,6 @@ var Crawler = function(host,initialPath,initialPort,interval) {
// Or go even further and strip WWW subdomain from domains altogether!
this.stripWWWDomain = false;
- // Use simplecrawler's internal resource discovery function (switch it off
- // if you'd prefer to discover and queue resources yourself!)
- this.discoverResources = true;
-
// Internal cachestore
this.cache = null;
@@ -109,20 +105,14 @@ var Crawler = function(host,initialPath,initialPort,interval) {
this.downloadUnsupported = true;
// STATE (AND OTHER) VARIABLES NOT TO STUFF WITH
- var crawler = this;
- var openRequests = 0;
+ this.openRequests = 0;
this.fetchConditions = [];
-
- // Externally accessible function for auditing the number of open requests...
- crawler.openRequests = function() {
- return openRequests;
- };
-
};
Crawler.prototype = new EventEmitter();
Crawler.prototype.start = function() {
+ var crawler = this;
// only if we haven't already got stuff in our queue...
if (!this.queue.length) {
@@ -138,7 +128,10 @@ Crawler.prototype.start = function() {
});
}
- this.crawlIntervalID = setInterval(this.crawl,this.interval);
+ this.crawlIntervalID = setInterval(function() {
+ crawler.crawl.call(crawler);
+ },this.interval);
+
this.emit("crawlstart");
this.running = true;
@@ -252,7 +245,7 @@ Crawler.prototype.discoverResources = function(resourceData,queueItem) {
cleanAndQueue(
resourceText.match(regex)));
},[]);
-}
+};
// Checks to see whether domain is valid for crawling.
Crawler.prototype.domainValid = function(host) {
@@ -309,22 +302,24 @@ Crawler.prototype.domainValid = function(host) {
domainInWhitelist(host) ||
// Or if we're scanning subdomains, and this domain is a subdomain of the crawler's set domain, return true.
(crawler.scanSubdomains && isSubdomainOf(host,crawler.host)));
-}
+};
// Input some text/html and this function will delegate resource discovery, check link validity
// and queue up resources for downloading!
-function queueLinkedItems(resourceData,queueItem) {
- var resources = this.discoverResources(resourceData,queueItem);
+Crawler.prototype.queueLinkedItems = function(resourceData,queueItem) {
+ var resources = this.discoverResources(resourceData,queueItem),
+ crawler = this;
// Emit discovered resources. ie: might be useful in building a graph of
// page relationships.
this.emit("discoverycomplete",queueItem,resources);
- resources.forEach(function(url){ queueURL(url,queueItem); });
-}
+ resources.forEach(function(url){ crawler.queueURL(url,queueItem); });
+};
// Clean and queue a single URL...
-function queueURL(url,queueItem) {
+Crawler.prototype.queueURL = function(url,queueItem) {
+ var crawler = this;
var parsedURL = typeof(url) === "object" ? url : crawler.processURL(url,queueItem);
// URL Parser decided this URL was junky. Next please!
@@ -344,7 +339,7 @@ function queueURL(url,queueItem) {
}
// Check the domain is valid before adding it to the queue
- if (domainValid(parsedURL.host)) {
+ if (crawler.domainValid(parsedURL.host)) {
try {
crawler.queue.add(
parsedURL.protocol,
@@ -366,18 +361,28 @@ function queueURL(url,queueItem) {
crawler.emit("queueerror",error,parsedURL);
}
}
-}
+};
// Fetch a queue item
-function fetchQueueItem(queueItem) {
- openRequests ++;
+Crawler.prototype.fetchQueueItem = function(queueItem) {
+ var crawler = this;
+ crawler.openRequests ++;
// Emit fetchstart event
crawler.emit("fetchstart",queueItem);
// Variable declarations
- var fetchData = false, requestOptions, clientRequest, timeCommenced, timeHeadersReceived, timeDataReceived, parsedURL;
- var responseBuffer, responseLength, responseLengthReceived, contentType;
+ var fetchData = false,
+ requestOptions,
+ clientRequest,
+ timeCommenced,
+ timeHeadersReceived,
+ timeDataReceived,
+ parsedURL,
+ responseBuffer,
+ responseLength,
+ responseLengthReceived,
+ contentType;
// Mark as spooled
queueItem.status = "spooled";
@@ -462,11 +467,11 @@ function fetchQueueItem(queueItem) {
// We only process the item if it's of a valid mimetype
// and only if the crawler is set to discover its own resources
- if (mimeTypeSupported(contentType) && crawler.discoverResources) {
- queueLinkedItems(responseBuffer,queueItem);
+ if (crawler.mimeTypeSupported(contentType) && crawler.discoverResources) {
+ crawler.queueLinkedItems(responseBuffer,queueItem);
}
- openRequests --;
+ crawler.openRequests --;
}
}
@@ -554,9 +559,9 @@ function fetchQueueItem(queueItem) {
crawler.emit("fetchredirect",queueItem,parsedURL,response);
// Clean URL, add to queue...
- queueURL(parsedURL,queueItem);
+ crawler.queueURL(parsedURL,queueItem);
- openRequests --;
+ crawler.openRequests --;
// Ignore this request, but record that we had a 404
} else if (response.statusCode === 404) {
@@ -566,7 +571,7 @@ function fetchQueueItem(queueItem) {
// Emit 404 event
crawler.emit("fetch404",queueItem,response);
- openRequests --;
+ crawler.openRequests --;
// And oh dear. Handle this one as well. (other 400s, 500s, etc)
} else {
@@ -576,12 +581,12 @@ function fetchQueueItem(queueItem) {
// Emit 5xx / 4xx event
crawler.emit("fetcherror",queueItem,response);
- openRequests --;
+ crawler.openRequests --;
}
});
clientRequest.on("error",function(errorData) {
- openRequests --;
+ crawler.openRequests --;
// Emit 5xx / 4xx event
crawler.emit("fetchclienterror",queueItem,errorData);
@@ -589,16 +594,18 @@ function fetchQueueItem(queueItem) {
queueItem.stateData.code = 599;
queueItem.status = "failed";
});
-}
+};
// Crawl init
-this.crawl = function() {
- if (openRequests > crawler.maxConcurrency) return;
+Crawler.prototype.crawl = function() {
+ var crawler = this;
+
+ if (crawler.openRequests > crawler.maxConcurrency) return;
crawler.queue.oldestUnfetchedItem(function(err,queueItem) {
if (queueItem) {
- fetchQueueItem(queueItem);
- } else if (openRequests === 0) {
+ crawler.fetchQueueItem(queueItem);
+ } else if (crawler.openRequests === 0) {
crawler.queue.complete(function(err,completeCount) {
if (completeCount === crawler.queue.length) {
crawler.emit("complete");
@@ -633,5 +640,4 @@ Crawler.prototype.removeFetchCondition = function(index) {
}
};
-module.exports = Crawler;
-module.exports.Crawler = Crawler;
+module.exports = Crawler;
View
12 lib/index.js
@@ -1,9 +1,13 @@
// SimpleCrawler
// Export interfaces
-module.exports =
- module.exports.crawler =
- require("./crawler.js");
+module.exports = require("./crawler.js");
+
+// Aliasing for compatibility with legacy code.
+module.exports.Crawler = module.exports;
module.exports.queue = require("./queue.js");
-module.exports.cache = require("./cache.js");
+module.exports.cache = require("./cache.js");
+
+// Convenience function for small, fast crawls
+module.exports.crawl = require("./quickcrawl.js");
View
73 lib/quickcrawl.js
@@ -0,0 +1,73 @@
+var Crawler = require("./crawler.js"),
+ URI = require("URIjs");
+
+
+/*
+ Public: Convenience function for really quick, simple crawls. It generates
+ a new crawler, parses the URL provided, and sets up the new crawler with
+ the host and path information extracted from the URL. It returns the crawler
+ object, so you can set up event handlers, and waits until `process.nextTick`
+ before kicking off the crawl.
+
+ url - URL to begin crawl from.
+ successCallback - Optional function called once an item is completely
+ downloaded. Functionally identical to a fetchcomplete
+ event listener.
+ failCallback - Optional function to be called if an item fails to
+ download. Functionally identical to a fetcherror
+ event listener.
+
+ Examples
+
+ Crawler.crawl(
+ "http://example.com:3000/start",
+ function(queueItem,data) {
+ console.log("I got a new item!");
+ }
+ );
+
+ Crawler
+ .crawl("http://www.example.com/")
+ .on("fetchstart",function(queueItem) {
+ console.log("Beginning fetch for",queueItem.url);
+ });
+
+ Returns the new Vixen object which has now been constructed.
+
+*/
+module.exports = function crawl(url,successCallback,failCallback) {
+
+ // Parse the URL first
+ url = URI(url);
+
+ // If either the protocol, path, or hostname are unset, we can't really
+ // do much. Die with error.
+ if (!url.protocol())
+ throw new Error("Can't crawl with unspecified protocol.");
+
+ if (!url.hostname())
+ throw new Error("Can't crawl with unspecified hostname.");
+
+ if (!url.path())
+ throw new Error("Can't crawl with unspecified path.");
+
+ var tmpCrawler =
+ new Crawler(
+ url.hostname(),
+ url.path(),
+ url.port() || 80);
+
+ // Attach callbacks if they were provided
+ if (successCallback) tmpCrawler.on("fetchcomplete",successCallback);
+ if (failCallback) tmpCrawler.on("fetcherror",failCallback);
+
+ // Start the crawler on the next runloop
+ // This enables initial configuration options and event handlers to take
+ // effect before the first resource is queued.
+ process.nextTick(function() {
+ tmpCrawler.start();
+ });
+
+ // Return crawler
+ return tmpCrawler;
+};
View
6 test/resourcevalidity.js
@@ -1,2 +1,6 @@
// Tests whether a given resource is considered 'valid' for crawling under
-// a number of different conditions.
+// a number of different conditions.
+
+describe("Resource validity checker",function() {
+
+});
Please sign in to comment.
Something went wrong with that request. Please try again.