Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

Merge branch 'master' of github.com:cgiffard/node-simplecrawler

  • Loading branch information...
commit 091e38a11fcf8c36da9afa937c92202ad37fe017 2 parents 42c5742 + 07eb35c
@cgiffard authored
Showing with 8 additions and 6 deletions.
  1. +4 −3 README.markdown
  2. +4 −3 lib/crawler.js
View
7 README.markdown
@@ -219,6 +219,7 @@ var conditionID = myCrawler.addFetchCondition(function(parsedURL) {
return !parsedURL.path.match(/\.pdf$/i);
});
```
+NOTE: simplecrawler uses slightly different terminology to URIjs. `parsedURL.path` includes the query string too. If you want the path without the query string, use `parsedURL.uriPath`.
##### Removing a fetch condition
@@ -257,7 +258,7 @@ that be a pain. Instead, use the `queue.add` function provided for your
convenience:
```javascript
-crawler.queue.add(protocol,domain,port,path);
+crawler.queue.add(protocol,hostname,port,path);
```
That's it! It's basically just a URL, but comma separated (that's how you can
@@ -271,7 +272,7 @@ is expected to have:
* `url` - The complete, canonical URL of the resource.
* `protocol` - The protocol of the resource (http, https)
-* `domain` - The full domain of the resource
+* `host` - The full domain/hostname of the resource
* `port` - The port of the resource
* `path` - The bit of the URL after the domain - includes the querystring.
* `fetched` - Has the request for this item been completed? You can monitor this as requests are processed.
@@ -394,4 +395,4 @@ I'd like to extend sincere thanks to:
* [Mike Iannacone](https://github.com/mikeiannacone) for correcting a keyword naming collision with node 0.8's EventEmitter.
* [Greg Molnar](https://github.com/gregmolnar) for [adding a querystring-free path parameter to parsed URL objects.](https://github.com/cgiffard/node-simplecrawler/pull/31)
-And everybody else who has helped out in some way! :)
+And everybody else who has helped out in some way! :)
View
7 lib/crawler.js
@@ -257,9 +257,10 @@ Crawler.prototype.processURL = function(URL,context) {
// simplecrawler uses slightly different terminology to URIjs. Sorry!
return {
"protocol": newURL.protocol() || "http",
- "host": newURL.hostname(),
- "port": newURL.port() || 80,
- "path": newURL.resource()
+ "host": newURL.hostname(),
+ "port": newURL.port() || 80,
+ "path": newURL.resource(),
+ "uriPath": newURL.path()
};
};
Please sign in to comment.
Something went wrong with that request. Please try again.