Skip to content

fs-hao/awesome-crawlers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-crawlers Awesome

A collection of awesome web crawler, spider and resources in different languages.

Update date: 2023-08-25

Contents

All

name stars language update date description
Scrapy 48228 Python 2023-08-25 A fast high-level screen scraping and web crawling framework.
you-get 48019 Python 2023-08-25 Dumb downloader that scrapes the web.
colly 20557 Go 2023-08-25 Fast and Elegant Scraping Framework for Gophers.
pyspider 15994 Python 2023-08-25 A powerful spider system.
newspaper 13039 Python 2023-08-24 News, full-text, and article metadata extraction in Python 3
Webmagic 10927 Java 2023-08-24 A scalable crawler framework.
Goutte 9239 PHP 2023-08-24 A screen scraping and web crawling library for PHP.
portia 8975 Python 2023-08-24 Visual scraping for Scrapy.
crawlee 8949 TypeScript 2023-08-25 A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
spider-flow 8329 Java 2023-08-25 A visual spider framework, it's so good that you don't need to write any code to crawl the website.
pholcus 7443 Go 2023-08-24 A distributed, high concurrency and powerful web crawler.
node-crawler 6491 JavaScript 2023-08-24 Node-crawler has clean,simple api.
Nokogiri 6052 C 2023-08-24 A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
x-ray 5760 JavaScript 2023-08-19 Web scraper with pagination and crawler support.
ferret 5451 Go 2023-08-23 Declarative web scraping.
headless-chrome-crawler 5403 JavaScript 2023-08-24 Headless Chrome crawls with jQuery support
Scrapy-Redis 5333 Python 2023-08-25 Redis-based components for Scrapy.
MechanicalSoup 4429 Python 2023-08-24 A Python library for automating interaction with websites.
Crawler4j 4417 Java 2023-08-24 Simple and lightweight web crawler.
mechanize 4310 Ruby 2023-08-21 Automated web interaction & crawling.
node-osmosis 4083 JavaScript 2023-08-23 HTML/XML parser and web scraper for Node.js.
scrape-it 3929 JavaScript 2023-08-19 A Node.js scraper for humans.
Hakrawler 3894 Go 2023-08-24 Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
dom-crawler 3828 PHP 2023-08-23 The DomCrawler component eases DOM navigation for HTML and XML documents.
DotnetSpider 3718 C# 2023-08-24 This is a cross platfrom, ligth spider develop by C#.
scraperjs 3685 JavaScript 2023-08-15 A complete and versatile web scraper.
RoboBrowser 3679 Python 2023-08-23 A simple, Pythonic library for browsing the web without a standalone web browser.
distribute_crawler 3228 Python 2023-08-23 Uses scrapy,redis, mongodb,graphite to create a distributed spider.
Hawk 3080 C# 2023-08-25 Advanced Crawler and ETL tool written in C#/WPF.
WebCollector 2993 Java 2023-08-17 Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
anthelion 2846 Java 2023-08-07 A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
dht 2628 Go 2023-08-24 BitTorrent DHT Protocol && DHT Spider.
httrack 2615 C 2023-08-24 Copy websites to your computer.
QueryList 2579 PHP 2023-08-07 The progressive PHP crawler framework.
Heritrix3 2527 Java 2023-08-25 Extensible, web-scale, archival-quality web crawler project.
Gecco 2460 Java 2023-08-18 A easy to use lightweight web crawler
spatie/crawler 2372 PHP 2023-08-24 An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
simplecrawler 2137 JavaScript 2023-08-24 Event driven web crawler.
Abot 2120 C# 2023-08-24 C# web crawler built for speed and flexibility.
Gain 2025 Python 2023-08-11 Web crawling framework based on asyncio for everyone.
gocrawl 2017 Go 2023-08-23 Polite, slim and concurrent web crawler.
SeimiCrawler 1927 Java 2023-08-24 An agile, distributed crawler framework.
Scrapely 1831 HTML 2023-08-20 A pure-python HTML screen-scraping library.
go_spider 1814 Go 2023-08-18 An awesome Go concurrent Crawler(spider) framework.
PSpider 1783 Python 2023-08-24 A simple spider frame in Python3.
aspider 1693 Python 2023-08-14 An async web scraping micro-framework based on asyncio.
upton 1613 HTML 2023-08-20 A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
scrape 1497 Go 2023-08-22 A simple, higher level interface for Go web scraping.
open-source-search-engine 1478 C++ 2023-08-23 A distributed open source search engine and spider/crawler written in C/C++.
cola 1474 Python 2023-07-29 A distributed crawling framework.
rvest 1415 R 2023-08-23 Simple web scraping for R.
php-spider 1303 PHP 2023-08-24 A configurable and extensible PHP web spider.
wombat 1292 Ruby 2023-08-18 Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
web-scraper-chrome-extension 1241 JavaScript 2023-08-22 Web data extraction tool implemented as chrome extension.
scrapy-cluster 1125 Python 2023-08-23 Uses Redis and Kafka to create a distributed on demand scraping cluster.
django-dynamic-scraper 1124 Python 2023-08-25 Creating Scrapy scrapers via the Django admin interface.
sukhoi 878 Python 2023-07-01 Minimalist and powerful Web Crawler.
StormCrawler 815 HTML 2023-08-22 An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
creeper 775 Go 2023-08-23 The Next Generation Crawler Framework (Go).
fetchbot 775 Go 2023-08-24 A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Spidr 760 Ruby 2023-07-06 Spider a site, multiple domains, certain links or infinitely.
Dataflow kit 610 Go 2023-08-24 Extract structured data from web pages. Web sites scraping.
webster 462 JavaScript 2023-08-13 A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
laravel-goutte 452 PHP 2023-08-10 Laravel 5 Facade for Goutte.
ACHE Crawler 412 Java 2023-08-17 An easy to use web crawler for domain-specific search.
PHPScraper 404 PHP 2023-08-23 PHPScraper is a scraper & crawler built for simplicity.
Spark-Crawler 403 Java 2023-08-25 Evolving Apache Nutch to run on Spark.
ants-go 365 Go 2023-03-13 A open source, distributed, restful crawler engine in golang.
supercrawler 353 JavaScript 2023-08-09 Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
MSpider 344 Python 2023-05-31 A simple ,easy spider using gevent and js render.
ebot 325 Erlang 2023-07-05 A scalable, distribuited and highly configurable web cawler.
spidy 312 Python 2023-08-03 The simple, easy to use command line web crawler.
spider 265 Rust 2023-08-22 The fastest web crawler and indexer.
pspider 264 PHP 2023-05-22 Parallel web crawler written in PHP.
js-crawler 246 TypeScript 2023-08-06 Web crawler for Node.JS, both HTTP and HTTPS are supported.
Cobweb 228 JavaScript 2023-01-02 Web crawler with very flexible crawling options, standalone or using sidekiq.
Infinity Crawler 221 C# 2023-08-15 A simple but powerful web crawler library in C#.
webBee 186 Java 2023-07-27 A DFS web spider.
crawley 177 Python 2023-08-10 Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
CoCrawler 170 Python 2023-08-03 A versatile web crawler built using modern tools and concurrency.
Squidwarc 156 JavaScript 2023-05-13 High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
brownant 156 Python 2023-05-05 A lightweight web data extracting framework.
crawler 146 Scala 2023-07-15 Scala DSL for web crawling.
RubyRetriever 140 Ruby 2023-08-06 RubyRetriever is a Web Crawler, Scraper & File Harvester.
scrala 112 Scala 2023-03-15 Scala crawler(spider) framework, inspired by scrapy.
Demiurge 111 Python 2023-06-02 PyQuery-based scraping micro-framework.
web-scraper 100 Perl 2023-05-27 Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
crawlzone/crawlzone 73 PHP 2023-08-04 Crawlzone is a fast asynchronous internet crawling framework for PHP.
ferrit 61 Scala 2022-11-02 Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
SkyScraper 56 C# 2023-03-05 An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
Apache Nutch -- -- -- Highly extensible, highly scalable web crawler for production environment.
JSoup -- -- -- Scrapes, parses, manipulates and cleans HTML.
Open Search Server -- -- -- A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Spiderman -- -- -- A scalable ,extensible, multi-threaded web crawler.
feedparser -- -- -- Universal feed parser.

C

name stars update date description
Nokogiri 6052 2023-08-24 A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
httrack 2615 2023-08-24 Copy websites to your computer.

C#

name stars update date description
DotnetSpider 3718 2023-08-24 This is a cross platfrom, ligth spider develop by C#.
Hawk 3080 2023-08-25 Advanced Crawler and ETL tool written in C#/WPF.
Abot 2120 2023-08-24 C# web crawler built for speed and flexibility.
Infinity Crawler 221 2023-08-15 A simple but powerful web crawler library in C#.
SkyScraper 56 2023-03-05 An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

C++

name stars update date description
open-source-search-engine 1478 2023-08-23 A distributed open source search engine and spider/crawler written in C/C++.

Erlang

name stars update date description
ebot 325 2023-07-05 A scalable, distribuited and highly configurable web cawler.

Go

name stars update date description
colly 20557 2023-08-25 Fast and Elegant Scraping Framework for Gophers.
pholcus 7443 2023-08-24 A distributed, high concurrency and powerful web crawler.
ferret 5451 2023-08-23 Declarative web scraping.
Hakrawler 3894 2023-08-24 Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
dht 2628 2023-08-24 BitTorrent DHT Protocol && DHT Spider.
gocrawl 2017 2023-08-23 Polite, slim and concurrent web crawler.
go_spider 1814 2023-08-18 An awesome Go concurrent Crawler(spider) framework.
scrape 1497 2023-08-22 A simple, higher level interface for Go web scraping.
creeper 775 2023-08-23 The Next Generation Crawler Framework (Go).
fetchbot 775 2023-08-24 A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Dataflow kit 610 2023-08-24 Extract structured data from web pages. Web sites scraping.
ants-go 365 2023-03-13 A open source, distributed, restful crawler engine in golang.

HTML

name stars update date description
Scrapely 1831 2023-08-20 A pure-python HTML screen-scraping library.
upton 1613 2023-08-20 A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
StormCrawler 815 2023-08-22 An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm

Java

name stars update date description
Webmagic 10927 2023-08-24 A scalable crawler framework.
spider-flow 8329 2023-08-25 A visual spider framework, it's so good that you don't need to write any code to crawl the website.
Crawler4j 4417 2023-08-24 Simple and lightweight web crawler.
WebCollector 2993 2023-08-17 Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
anthelion 2846 2023-08-07 A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
Heritrix3 2527 2023-08-25 Extensible, web-scale, archival-quality web crawler project.
Gecco 2460 2023-08-18 A easy to use lightweight web crawler
SeimiCrawler 1927 2023-08-24 An agile, distributed crawler framework.
ACHE Crawler 412 2023-08-17 An easy to use web crawler for domain-specific search.
Spark-Crawler 403 2023-08-25 Evolving Apache Nutch to run on Spark.
webBee 186 2023-07-27 A DFS web spider.

JavaScript

name stars update date description
node-crawler 6491 2023-08-24 Node-crawler has clean,simple api.
x-ray 5760 2023-08-19 Web scraper with pagination and crawler support.
headless-chrome-crawler 5403 2023-08-24 Headless Chrome crawls with jQuery support
node-osmosis 4083 2023-08-23 HTML/XML parser and web scraper for Node.js.
scrape-it 3929 2023-08-19 A Node.js scraper for humans.
scraperjs 3685 2023-08-15 A complete and versatile web scraper.
simplecrawler 2137 2023-08-24 Event driven web crawler.
web-scraper-chrome-extension 1241 2023-08-22 Web data extraction tool implemented as chrome extension.
webster 462 2023-08-13 A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
supercrawler 353 2023-08-09 Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Cobweb 228 2023-01-02 Web crawler with very flexible crawling options, standalone or using sidekiq.
Squidwarc 156 2023-05-13 High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

name stars update date description
Goutte 9239 2023-08-24 A screen scraping and web crawling library for PHP.
dom-crawler 3828 2023-08-23 The DomCrawler component eases DOM navigation for HTML and XML documents.
QueryList 2579 2023-08-07 The progressive PHP crawler framework.
spatie/crawler 2372 2023-08-24 An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
php-spider 1303 2023-08-24 A configurable and extensible PHP web spider.
laravel-goutte 452 2023-08-10 Laravel 5 Facade for Goutte.
PHPScraper 404 2023-08-23 PHPScraper is a scraper & crawler built for simplicity.
pspider 264 2023-05-22 Parallel web crawler written in PHP.
crawlzone/crawlzone 73 2023-08-04 Crawlzone is a fast asynchronous internet crawling framework for PHP.

Perl

name stars update date description
web-scraper 100 2023-05-27 Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Python

name stars update date description
Scrapy 48228 2023-08-25 A fast high-level screen scraping and web crawling framework.
you-get 48019 2023-08-25 Dumb downloader that scrapes the web.
pyspider 15994 2023-08-25 A powerful spider system.
newspaper 13039 2023-08-24 News, full-text, and article metadata extraction in Python 3
portia 8975 2023-08-24 Visual scraping for Scrapy.
Scrapy-Redis 5333 2023-08-25 Redis-based components for Scrapy.
MechanicalSoup 4429 2023-08-24 A Python library for automating interaction with websites.
RoboBrowser 3679 2023-08-23 A simple, Pythonic library for browsing the web without a standalone web browser.
distribute_crawler 3228 2023-08-23 Uses scrapy,redis, mongodb,graphite to create a distributed spider.
Gain 2025 2023-08-11 Web crawling framework based on asyncio for everyone.
PSpider 1783 2023-08-24 A simple spider frame in Python3.
aspider 1693 2023-08-14 An async web scraping micro-framework based on asyncio.
cola 1474 2023-07-29 A distributed crawling framework.
scrapy-cluster 1125 2023-08-23 Uses Redis and Kafka to create a distributed on demand scraping cluster.
django-dynamic-scraper 1124 2023-08-25 Creating Scrapy scrapers via the Django admin interface.
sukhoi 878 2023-07-01 Minimalist and powerful Web Crawler.
MSpider 344 2023-05-31 A simple ,easy spider using gevent and js render.
spidy 312 2023-08-03 The simple, easy to use command line web crawler.
crawley 177 2023-08-10 Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
CoCrawler 170 2023-08-03 A versatile web crawler built using modern tools and concurrency.
brownant 156 2023-05-05 A lightweight web data extracting framework.
Demiurge 111 2023-06-02 PyQuery-based scraping micro-framework.

R

name stars update date description
rvest 1415 2023-08-23 Simple web scraping for R.

Ruby

name stars update date description
mechanize 4310 2023-08-21 Automated web interaction & crawling.
wombat 1292 2023-08-18 Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Spidr 760 2023-07-06 Spider a site, multiple domains, certain links or infinitely.
RubyRetriever 140 2023-08-06 RubyRetriever is a Web Crawler, Scraper & File Harvester.

Rust

name stars update date description
spider 265 2023-08-22 The fastest web crawler and indexer.

Scala

name stars update date description
crawler 146 2023-07-15 Scala DSL for web crawling.
scrala 112 2023-03-15 Scala crawler(spider) framework, inspired by scrapy.
ferrit 61 2022-11-02 Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

TypeScript

name stars update date description
crawlee 8949 2023-08-25 A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
js-crawler 246 2023-08-06 Web crawler for Node.JS, both HTTP and HTTPS are supported.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages