Permalink
Browse files

Big update, see README

  • Loading branch information...
sylvinus committed Sep 8, 2012
1 parent da5598f commit fb41724c200abde0dfc03090d14813d694cfbe2f
Showing with 1,394 additions and 803 deletions.
  1. +1 −0 .gitignore
  2. +108 −33 README.md
  3. +227 −151 lib/crawler.js
  4. +17 −20 package.json
  5. +10 −0 test/mockserver.js
  6. +0 −554 test/selector.js
  7. +0 −26 test/simple.js
  8. +16 −2 test/testrunner.js
  9. +39 −0 test/units/errors.js
  10. +29 −0 test/units/forceutf8.js
  11. +64 −0 test/units/simple.js
  12. +33 −17 test/{index.html → units/sizzle.html}
  13. +848 −0 test/units/sizzle.js
  14. +2 −0 vendor/jquery-1.8.1.min.js
View
@@ -1 +1,2 @@
node_modules
+.DS_Store
View
141 README.md
@@ -1,58 +1,58 @@
-[![build status](https://secure.travis-ci.org/joshfire/node-crawler.png)](http://travis-ci.org/joshfire/node-crawler)
node-crawler
------------
-How to install
+node-crawler aims to be the best crawling/scraping package for Node.
- $ npm install crawler
+It features:
+ * A clean, simple API
+ * server-side DOM & automatic jQuery insertion
+ * Configurable pool size and retries
+ * Priority of requests
+ * forceUTF8 mode to let node-crawler deal for you with charset detection and conversion
+ * A local cache
-How to test
+The argument for creating this package was made at ParisJS #2 in 2010 ( [lightning talk slides](http://www.slideshare.net/sylvinus/web-crawling-with-nodejs) )
- $ node test/simple.js
- $ node test/testrunner.js
-
+Help & Forks welcomed!
-Why / What ?
-------------
-
-For now just check my [lightning talk slides](http://www.slideshare.net/sylvinus/web-crawling-with-nodejs)
+How to install
+--------------
-Help & Forks welcomed! This is just starting for now.
+ $ npm install crawler
-Rough todolist :
-
- * Make Sizzle tests pass (jsdom bug? https://github.com/tmpvar/jsdom/issues#issue/81)
- * More crawling tests
- * Document the API
- * Get feedback on featureset for a 1.0 release (option for autofollowing links?)
- * Check how we can support other mimetypes than HTML
- * Add+test timeout parameter
- * Option to wait for callback to finish before freeing the pool resource (via another callback like next())
- * Events on queue empty / full
-API
----
+Crash course
+------------
- var Crawler = require("node-crawler").Crawler;
+ var Crawler = require("crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
+
+ // This will be called for each crawled page
"callback":function(error,result,$) {
+
+ // $ is a jQuery instance scoped to the server-side DOM of the page
$("#content a:link").each(function(a) {
c.queue(a.href);
- })
+ });
}
});
- // Queue a list of URLs, with default callback
- c.queue(["http://jamendo.com/","http://tedxparis.com", ...]);
+ // Queue just one URL, with default callback
+ c.queue("http://joshfire.com");
+
+ // Queue a list of URLs
+ c.queue(["http://jamendo.com/","http://tedxparis.com"]);
- // Queue URLs with custom callbacks
+ // Queue URLs with custom callbacks & parameters
c.queue([{
- "uri":"http://parisjs.org/register",
- "method":"POST",
- "callback":function(error,result,$) {
- $("div:contains(Thank you)").after(" very much");
+ "uri":"http://parishackers.org/",
+ "jQuery":false,
+
+ // The global callback won't be called
+ "callback":function(error,result) {
+ console.log("Grabbed",result.body.length,"bytes");
}
}]);
@@ -61,10 +61,85 @@ API
"html":"<p>This is a <strong>test</strong></p>"
}]);
+
+Options reference
+-----------------
+
+You can pass these options to the Crawler() constructor if you want them to be global or as
+items in the queue() calls if you want them to be specific to that item (overwriting global options)
+
+This options list is a strict superset of mikeal's request options and will be directly passed to
+the request() method.
+
+Basic request options:
+
+ * uri: String, the URL you want to crawl
+ * timeout : Number, in milliseconds (Default 60000)
+ * method, xxx: All mikeal's requests options are accepted
+
+Callbacks:
+
+ * callback(error, result, $): A request was completed
+ * onDrain(): There is no more queued requests
+
+Pool options:
+
+ * maxConnections: Number, Size of the worker pool (Default 10),
+ * priorityRange: Number, Range of acceptable priorities starting from 0 (Default 10),
+ * priority: Number, Priority of this request (Default 5),
+
+Retry options:
+
+ * retries: Number of retries if the request fails (Default 3),
+ * retryTimeout: Number of milliseconds to wait before retrying (Default 10000),
+
+Server-side DOM options:
+
+ * jQuery: Boolean, if true creates a server-side DOM and adds jQuery (Default true)
+ * jQueryUrl: String, path to the jQuery file you want to insert (Defaults to bundled jquery-1.8.1.min.js)
+
+Charset encoding:
+
+ * forceUTF8: Boolean, if true will try to detect the page charset and convert it to UTF8 if necessary. Never worry about encoding anymore! (Default false),
+
+Cache:
+
+ * cache: Boolean, if true stores requests in memory (Default false)
+ * skipDuplicates: Boolean, if true skips URIs that were already crawled, without even calling callback() (Default false)
+
+
+
+How to test
+-----------
+
+ $ npm install && npm test
+
+Feel free to add more tests!
+
+
+Rough todolist
+--------------
+
+ * Make Sizzle tests pass (jsdom bug? https://github.com/tmpvar/jsdom/issues#issue/81)
+ * More crawling tests
+ * Document the API
+ * Get feedback on featureset for a 1.0 release (option for autofollowing links?)
+ * Check how we can support other mimetypes than HTML
+ * Option to wait for callback to finish before freeing the pool resource (via another callback like next())
+
ChangeLog
---------
+0.2.0
+ - Updated code & dependencies for node 0.6/0.8, cleaned package.json
+ - Added a forceUTF8 mode
+ - Added real unit tests & travis-ci
+ - Added some docs!
+ - Added Crawler.onDrain()
+ - Code refactor
+ - [BACKWARD-INCOMPATIBLE] Timeout parameters now in milliseconds (weren't documented)
+
0.1.0
- Updated dependencies, notably node 0.4.x
- Fixes jQuery being redownloaded at each page + include it in the tree
Oops, something went wrong.

0 comments on commit fb41724

Please sign in to comment.