Awesome-crawlers

A collection of awesome web crawler, spider and resources in different languages.

Update date: 2023-08-25

All

name	stars	language	update date	description
Scrapy	48228	Python	2023-08-25	A fast high-level screen scraping and web crawling framework.
you-get	48019	Python	2023-08-25	Dumb downloader that scrapes the web.
colly	20557	Go	2023-08-25	Fast and Elegant Scraping Framework for Gophers.
pyspider	15994	Python	2023-08-25	A powerful spider system.
newspaper	13039	Python	2023-08-24	News, full-text, and article metadata extraction in Python 3
Webmagic	10927	Java	2023-08-24	A scalable crawler framework.
Goutte	9239	PHP	2023-08-24	A screen scraping and web crawling library for PHP.
portia	8975	Python	2023-08-24	Visual scraping for Scrapy.
crawlee	8949	TypeScript	2023-08-25	A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
spider-flow	8329	Java	2023-08-25	A visual spider framework, it's so good that you don't need to write any code to crawl the website.
pholcus	7443	Go	2023-08-24	A distributed, high concurrency and powerful web crawler.
node-crawler	6491	JavaScript	2023-08-24	Node-crawler has clean,simple api.
Nokogiri	6052	C	2023-08-24	A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
x-ray	5760	JavaScript	2023-08-19	Web scraper with pagination and crawler support.
ferret	5451	Go	2023-08-23	Declarative web scraping.
headless-chrome-crawler	5403	JavaScript	2023-08-24	Headless Chrome crawls with jQuery support
Scrapy-Redis	5333	Python	2023-08-25	Redis-based components for Scrapy.
MechanicalSoup	4429	Python	2023-08-24	A Python library for automating interaction with websites.
Crawler4j	4417	Java	2023-08-24	Simple and lightweight web crawler.
mechanize	4310	Ruby	2023-08-21	Automated web interaction & crawling.
node-osmosis	4083	JavaScript	2023-08-23	HTML/XML parser and web scraper for Node.js.
scrape-it	3929	JavaScript	2023-08-19	A Node.js scraper for humans.
Hakrawler	3894	Go	2023-08-24	Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
dom-crawler	3828	PHP	2023-08-23	The DomCrawler component eases DOM navigation for HTML and XML documents.
DotnetSpider	3718	C#	2023-08-24	This is a cross platfrom, ligth spider develop by C#.
scraperjs	3685	JavaScript	2023-08-15	A complete and versatile web scraper.
RoboBrowser	3679	Python	2023-08-23	A simple, Pythonic library for browsing the web without a standalone web browser.
distribute_crawler	3228	Python	2023-08-23	Uses scrapy,redis, mongodb,graphite to create a distributed spider.
Hawk	3080	C#	2023-08-25	Advanced Crawler and ETL tool written in C#/WPF.
WebCollector	2993	Java	2023-08-17	Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
anthelion	2846	Java	2023-08-07	A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
dht	2628	Go	2023-08-24	BitTorrent DHT Protocol && DHT Spider.
httrack	2615	C	2023-08-24	Copy websites to your computer.
QueryList	2579	PHP	2023-08-07	The progressive PHP crawler framework.
Heritrix3	2527	Java	2023-08-25	Extensible, web-scale, archival-quality web crawler project.
Gecco	2460	Java	2023-08-18	A easy to use lightweight web crawler
spatie/crawler	2372	PHP	2023-08-24	An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
simplecrawler	2137	JavaScript	2023-08-24	Event driven web crawler.
Abot	2120	C#	2023-08-24	C# web crawler built for speed and flexibility.
Gain	2025	Python	2023-08-11	Web crawling framework based on asyncio for everyone.
gocrawl	2017	Go	2023-08-23	Polite, slim and concurrent web crawler.
SeimiCrawler	1927	Java	2023-08-24	An agile, distributed crawler framework.
Scrapely	1831	HTML	2023-08-20	A pure-python HTML screen-scraping library.
go_spider	1814	Go	2023-08-18	An awesome Go concurrent Crawler(spider) framework.
PSpider	1783	Python	2023-08-24	A simple spider frame in Python3.
aspider	1693	Python	2023-08-14	An async web scraping micro-framework based on asyncio.
upton	1613	HTML	2023-08-20	A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
scrape	1497	Go	2023-08-22	A simple, higher level interface for Go web scraping.
open-source-search-engine	1478	C++	2023-08-23	A distributed open source search engine and spider/crawler written in C/C++.
cola	1474	Python	2023-07-29	A distributed crawling framework.
rvest	1415	R	2023-08-23	Simple web scraping for R.
php-spider	1303	PHP	2023-08-24	A configurable and extensible PHP web spider.
wombat	1292	Ruby	2023-08-18	Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
web-scraper-chrome-extension	1241	JavaScript	2023-08-22	Web data extraction tool implemented as chrome extension.
scrapy-cluster	1125	Python	2023-08-23	Uses Redis and Kafka to create a distributed on demand scraping cluster.
django-dynamic-scraper	1124	Python	2023-08-25	Creating Scrapy scrapers via the Django admin interface.
sukhoi	878	Python	2023-07-01	Minimalist and powerful Web Crawler.
StormCrawler	815	HTML	2023-08-22	An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
creeper	775	Go	2023-08-23	The Next Generation Crawler Framework (Go).
fetchbot	775	Go	2023-08-24	A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Spidr	760	Ruby	2023-07-06	Spider a site, multiple domains, certain links or infinitely.
Dataflow kit	610	Go	2023-08-24	Extract structured data from web pages. Web sites scraping.
webster	462	JavaScript	2023-08-13	A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
laravel-goutte	452	PHP	2023-08-10	Laravel 5 Facade for Goutte.
ACHE Crawler	412	Java	2023-08-17	An easy to use web crawler for domain-specific search.
PHPScraper	404	PHP	2023-08-23	PHPScraper is a scraper & crawler built for simplicity.
Spark-Crawler	403	Java	2023-08-25	Evolving Apache Nutch to run on Spark.
ants-go	365	Go	2023-03-13	A open source, distributed, restful crawler engine in golang.
supercrawler	353	JavaScript	2023-08-09	Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
MSpider	344	Python	2023-05-31	A simple ,easy spider using gevent and js render.
ebot	325	Erlang	2023-07-05	A scalable, distribuited and highly configurable web cawler.
spidy	312	Python	2023-08-03	The simple, easy to use command line web crawler.
spider	265	Rust	2023-08-22	The fastest web crawler and indexer.
pspider	264	PHP	2023-05-22	Parallel web crawler written in PHP.
js-crawler	246	TypeScript	2023-08-06	Web crawler for Node.JS, both HTTP and HTTPS are supported.
Cobweb	228	JavaScript	2023-01-02	Web crawler with very flexible crawling options, standalone or using sidekiq.
Infinity Crawler	221	C#	2023-08-15	A simple but powerful web crawler library in C#.
webBee	186	Java	2023-07-27	A DFS web spider.
crawley	177	Python	2023-08-10	Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
CoCrawler	170	Python	2023-08-03	A versatile web crawler built using modern tools and concurrency.
Squidwarc	156	JavaScript	2023-05-13	High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
brownant	156	Python	2023-05-05	A lightweight web data extracting framework.
crawler	146	Scala	2023-07-15	Scala DSL for web crawling.
RubyRetriever	140	Ruby	2023-08-06	RubyRetriever is a Web Crawler, Scraper & File Harvester.
scrala	112	Scala	2023-03-15	Scala crawler(spider) framework, inspired by scrapy.
Demiurge	111	Python	2023-06-02	PyQuery-based scraping micro-framework.
web-scraper	100	Perl	2023-05-27	Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.
crawlzone/crawlzone	73	PHP	2023-08-04	Crawlzone is a fast asynchronous internet crawling framework for PHP.
ferrit	61	Scala	2022-11-02	Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
SkyScraper	56	C#	2023-03-05	An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
Apache Nutch	--	--	--	Highly extensible, highly scalable web crawler for production environment.
JSoup	--	--	--	Scrapes, parses, manipulates and cleans HTML.
Open Search Server	--	--	--	A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
Spiderman	--	--	--	A scalable ,extensible, multi-threaded web crawler.
feedparser	--	--	--	Universal feed parser.

C

name	stars	update date	description
Nokogiri	6052	2023-08-24	A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
httrack	2615	2023-08-24	Copy websites to your computer.

C#

name	stars	update date	description
DotnetSpider	3718	2023-08-24	This is a cross platfrom, ligth spider develop by C#.
Hawk	3080	2023-08-25	Advanced Crawler and ETL tool written in C#/WPF.
Abot	2120	2023-08-24	C# web crawler built for speed and flexibility.
Infinity Crawler	221	2023-08-15	A simple but powerful web crawler library in C#.
SkyScraper	56	2023-03-05	An asynchronous web scraper / web crawler using async / await and Reactive Extensions.

C++

name	stars	update date	description
open-source-search-engine	1478	2023-08-23	A distributed open source search engine and spider/crawler written in C/C++.

Erlang

name	stars	update date	description
ebot	325	2023-07-05	A scalable, distribuited and highly configurable web cawler.

Go

name	stars	update date	description
colly	20557	2023-08-25	Fast and Elegant Scraping Framework for Gophers.
pholcus	7443	2023-08-24	A distributed, high concurrency and powerful web crawler.
ferret	5451	2023-08-23	Declarative web scraping.
Hakrawler	3894	2023-08-24	Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
dht	2628	2023-08-24	BitTorrent DHT Protocol && DHT Spider.
gocrawl	2017	2023-08-23	Polite, slim and concurrent web crawler.
go_spider	1814	2023-08-18	An awesome Go concurrent Crawler(spider) framework.
scrape	1497	2023-08-22	A simple, higher level interface for Go web scraping.
creeper	775	2023-08-23	The Next Generation Crawler Framework (Go).
fetchbot	775	2023-08-24	A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
Dataflow kit	610	2023-08-24	Extract structured data from web pages. Web sites scraping.
ants-go	365	2023-03-13	A open source, distributed, restful crawler engine in golang.

HTML

name	stars	update date	description
Scrapely	1831	2023-08-20	A pure-python HTML screen-scraping library.
upton	1613	2023-08-20	A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
StormCrawler	815	2023-08-22	An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm

Java

name	stars	update date	description
Webmagic	10927	2023-08-24	A scalable crawler framework.
spider-flow	8329	2023-08-25	A visual spider framework, it's so good that you don't need to write any code to crawl the website.
Crawler4j	4417	2023-08-24	Simple and lightweight web crawler.
WebCollector	2993	2023-08-17	Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
anthelion	2846	2023-08-07	A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
Heritrix3	2527	2023-08-25	Extensible, web-scale, archival-quality web crawler project.
Gecco	2460	2023-08-18	A easy to use lightweight web crawler
SeimiCrawler	1927	2023-08-24	An agile, distributed crawler framework.
ACHE Crawler	412	2023-08-17	An easy to use web crawler for domain-specific search.
Spark-Crawler	403	2023-08-25	Evolving Apache Nutch to run on Spark.
webBee	186	2023-07-27	A DFS web spider.

JavaScript

name	stars	update date	description
node-crawler	6491	2023-08-24	Node-crawler has clean,simple api.
x-ray	5760	2023-08-19	Web scraper with pagination and crawler support.
headless-chrome-crawler	5403	2023-08-24	Headless Chrome crawls with jQuery support
node-osmosis	4083	2023-08-23	HTML/XML parser and web scraper for Node.js.
scrape-it	3929	2023-08-19	A Node.js scraper for humans.
scraperjs	3685	2023-08-15	A complete and versatile web scraper.
simplecrawler	2137	2023-08-24	Event driven web crawler.
web-scraper-chrome-extension	1241	2023-08-22	Web data extraction tool implemented as chrome extension.
webster	462	2023-08-13	A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
supercrawler	353	2023-08-09	Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
Cobweb	228	2023-01-02	Web crawler with very flexible crawling options, standalone or using sidekiq.
Squidwarc	156	2023-05-13	High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

PHP

name	stars	update date	description
Goutte	9239	2023-08-24	A screen scraping and web crawling library for PHP.
dom-crawler	3828	2023-08-23	The DomCrawler component eases DOM navigation for HTML and XML documents.
QueryList	2579	2023-08-07	The progressive PHP crawler framework.
spatie/crawler	2372	2023-08-24	An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
php-spider	1303	2023-08-24	A configurable and extensible PHP web spider.
laravel-goutte	452	2023-08-10	Laravel 5 Facade for Goutte.
PHPScraper	404	2023-08-23	PHPScraper is a scraper & crawler built for simplicity.
pspider	264	2023-05-22	Parallel web crawler written in PHP.
crawlzone/crawlzone	73	2023-08-04	Crawlzone is a fast asynchronous internet crawling framework for PHP.

Perl

name	stars	update date	description
web-scraper	100	2023-05-27	Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

Python

name	stars	update date	description
Scrapy	48228	2023-08-25	A fast high-level screen scraping and web crawling framework.
you-get	48019	2023-08-25	Dumb downloader that scrapes the web.
pyspider	15994	2023-08-25	A powerful spider system.
newspaper	13039	2023-08-24	News, full-text, and article metadata extraction in Python 3
portia	8975	2023-08-24	Visual scraping for Scrapy.
Scrapy-Redis	5333	2023-08-25	Redis-based components for Scrapy.
MechanicalSoup	4429	2023-08-24	A Python library for automating interaction with websites.
RoboBrowser	3679	2023-08-23	A simple, Pythonic library for browsing the web without a standalone web browser.
distribute_crawler	3228	2023-08-23	Uses scrapy,redis, mongodb,graphite to create a distributed spider.
Gain	2025	2023-08-11	Web crawling framework based on asyncio for everyone.
PSpider	1783	2023-08-24	A simple spider frame in Python3.
aspider	1693	2023-08-14	An async web scraping micro-framework based on asyncio.
cola	1474	2023-07-29	A distributed crawling framework.
scrapy-cluster	1125	2023-08-23	Uses Redis and Kafka to create a distributed on demand scraping cluster.
django-dynamic-scraper	1124	2023-08-25	Creating Scrapy scrapers via the Django admin interface.
sukhoi	878	2023-07-01	Minimalist and powerful Web Crawler.
MSpider	344	2023-05-31	A simple ,easy spider using gevent and js render.
spidy	312	2023-08-03	The simple, easy to use command line web crawler.
crawley	177	2023-08-10	Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
CoCrawler	170	2023-08-03	A versatile web crawler built using modern tools and concurrency.
brownant	156	2023-05-05	A lightweight web data extracting framework.
Demiurge	111	2023-06-02	PyQuery-based scraping micro-framework.

R

name	stars	update date	description
rvest	1415	2023-08-23	Simple web scraping for R.

Ruby

name	stars	update date	description
mechanize	4310	2023-08-21	Automated web interaction & crawling.
wombat	1292	2023-08-18	Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
Spidr	760	2023-07-06	Spider a site, multiple domains, certain links or infinitely.
RubyRetriever	140	2023-08-06	RubyRetriever is a Web Crawler, Scraper & File Harvester.

Rust

name	stars	update date	description
spider	265	2023-08-22	The fastest web crawler and indexer.

Scala

name	stars	update date	description
crawler	146	2023-07-15	Scala DSL for web crawling.
scrala	112	2023-03-15	Scala crawler(spider) framework, inspired by scrapy.
ferrit	61	2022-11-02	Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

TypeScript

name	stars	update date	description
crawlee	8949	2023-08-25	A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast.
js-crawler	246	2023-08-06	Web crawler for Node.JS, both HTTP and HTTPS are supported.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
autoupdate.py		autoupdate.py
crawlers.json		crawlers.json
mdtoc.py		mdtoc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-crawlers

Contents

All

C

C#

C++

Erlang

Go

HTML

Java

JavaScript

PHP

Perl

Python

R

Ruby

Rust

Scala

TypeScript

About

Releases

Packages

Languages

fs-hao/awesome-crawlers

Folders and files

Latest commit

History

Repository files navigation

Awesome-crawlers

Contents

All

C

C#

C++

Erlang

Go

HTML

Java

JavaScript

PHP

Perl

Python

R

Ruby

Rust

Scala

TypeScript

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages