node-crawler

Nodejs web crawler functional for a single domain call

Usage

git clone git@github.com:gamebak/node-crawler.git
npm install
node index.js

Url configuration

// change this url to your desired url inside index.js
let url = 'http://wiprodigital.com/';

I took the liberty to add a parsing limit, in case there's a big website like google, without a limit the crawler will keep going.

// this value is located in the contructor()
this.limitCrawling = 10;

REST API
database to store previous sessions and to keep track when was the last time a page was crawled
hash tables for efficient indexing
publisher/subscriber with a queue to distribute workload to multiple servers
paralell crawling with multiple downloads in the same thread and multithreading to scale on each core with cluster module
reduce indexed links length by removing main domain from internal links

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json