A collection of awesome web scaper, crawler.
- Apache Nutch - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
- websphinx - Website-Specific Processors for HTML INformation eXtraction.
- Open Search Server - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
- crawler4j - open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
- HTTrack - Highly extensible, highly scalable Web crawler. Pluggable parsing, protocols, storage and indexing.
- ccrawler - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
- ebot - Opensource Web Crawler built on top of a nosql database (apache couchdb, riak), AMQP database (rabbitmq), webmachine and mochiweb.
- scrapy - Scrapy, a fast high-level web crawling & scraping framework for Python.
- Goutte - Goutte, a simple PHP Web Scraper.
- DiDOM - Simple and fast HTML parser.
- simple_html_dom - Just a Simple HTML DOM library fork.
- PHPCrawl - PHPCrawl is a framework for crawling/spidering websites written in PHP.
- Phantomjs - Scriptable Headless WebKit.
- node-crawler - Web Crawler/Spider for NodeJS + server-side jQuery.
- node-simplecrawler - Flexible event driven crawler for node.
- spider - Programmable spidering of web sites with node.js and jQuery.
- slimerjs - A PhantomJS-like tool running Gecko.
- casperjs - Navigation scripting & testing utility for PhantomJS and SlimerJS.
- zombie - Insanely fast, full-stack, headless browser testing using node.js.
- nightmare - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
- wombat - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
- gocrawl - Polite, slim and concurrent web crawler.
- fetchbot - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Please, read the Contribution Guidelines before submitting your suggestion.