A collection of awesome web crawler, spider and resources in different languages.
Update date: 2023-08-25
name | stars | language | update date | description |
---|---|---|---|---|
Scrapy | 48228 | Python | 2023-08-25 | A fast high-level screen scraping and web crawling framework. |
you-get | 48019 | Python | 2023-08-25 | Dumb downloader that scrapes the web. |
colly | 20557 | Go | 2023-08-25 | Fast and Elegant Scraping Framework for Gophers. |
pyspider | 15994 | Python | 2023-08-25 | A powerful spider system. |
newspaper | 13039 | Python | 2023-08-24 | News, full-text, and article metadata extraction in Python 3 |
Webmagic | 10927 | Java | 2023-08-24 | A scalable crawler framework. |
Goutte | 9239 | PHP | 2023-08-24 | A screen scraping and web crawling library for PHP. |
portia | 8975 | Python | 2023-08-24 | Visual scraping for Scrapy. |
crawlee | 8949 | TypeScript | 2023-08-25 | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. |
spider-flow | 8329 | Java | 2023-08-25 | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
pholcus | 7443 | Go | 2023-08-24 | A distributed, high concurrency and powerful web crawler. |
node-crawler | 6491 | JavaScript | 2023-08-24 | Node-crawler has clean,simple api. |
Nokogiri | 6052 | C | 2023-08-24 | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. |
x-ray | 5760 | JavaScript | 2023-08-19 | Web scraper with pagination and crawler support. |
ferret | 5451 | Go | 2023-08-23 | Declarative web scraping. |
headless-chrome-crawler | 5403 | JavaScript | 2023-08-24 | Headless Chrome crawls with jQuery support |
Scrapy-Redis | 5333 | Python | 2023-08-25 | Redis-based components for Scrapy. |
MechanicalSoup | 4429 | Python | 2023-08-24 | A Python library for automating interaction with websites. |
Crawler4j | 4417 | Java | 2023-08-24 | Simple and lightweight web crawler. |
mechanize | 4310 | Ruby | 2023-08-21 | Automated web interaction & crawling. |
node-osmosis | 4083 | JavaScript | 2023-08-23 | HTML/XML parser and web scraper for Node.js. |
scrape-it | 3929 | JavaScript | 2023-08-19 | A Node.js scraper for humans. |
Hakrawler | 3894 | Go | 2023-08-24 | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
dom-crawler | 3828 | PHP | 2023-08-23 | The DomCrawler component eases DOM navigation for HTML and XML documents. |
DotnetSpider | 3718 | C# | 2023-08-24 | This is a cross platfrom, ligth spider develop by C#. |
scraperjs | 3685 | JavaScript | 2023-08-15 | A complete and versatile web scraper. |
RoboBrowser | 3679 | Python | 2023-08-23 | A simple, Pythonic library for browsing the web without a standalone web browser. |
distribute_crawler | 3228 | Python | 2023-08-23 | Uses scrapy,redis, mongodb,graphite to create a distributed spider. |
Hawk | 3080 | C# | 2023-08-25 | Advanced Crawler and ETL tool written in C#/WPF. |
WebCollector | 2993 | Java | 2023-08-17 | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. |
anthelion | 2846 | Java | 2023-08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
dht | 2628 | Go | 2023-08-24 | BitTorrent DHT Protocol && DHT Spider. |
httrack | 2615 | C | 2023-08-24 | Copy websites to your computer. |
QueryList | 2579 | PHP | 2023-08-07 | The progressive PHP crawler framework. |
Heritrix3 | 2527 | Java | 2023-08-25 | Extensible, web-scale, archival-quality web crawler project. |
Gecco | 2460 | Java | 2023-08-18 | A easy to use lightweight web crawler |
spatie/crawler | 2372 | PHP | 2023-08-24 | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. |
simplecrawler | 2137 | JavaScript | 2023-08-24 | Event driven web crawler. |
Abot | 2120 | C# | 2023-08-24 | C# web crawler built for speed and flexibility. |
Gain | 2025 | Python | 2023-08-11 | Web crawling framework based on asyncio for everyone. |
gocrawl | 2017 | Go | 2023-08-23 | Polite, slim and concurrent web crawler. |
SeimiCrawler | 1927 | Java | 2023-08-24 | An agile, distributed crawler framework. |
Scrapely | 1831 | HTML | 2023-08-20 | A pure-python HTML screen-scraping library. |
go_spider | 1814 | Go | 2023-08-18 | An awesome Go concurrent Crawler(spider) framework. |
PSpider | 1783 | Python | 2023-08-24 | A simple spider frame in Python3. |
aspider | 1693 | Python | 2023-08-14 | An async web scraping micro-framework based on asyncio. |
upton | 1613 | HTML | 2023-08-20 | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). |
scrape | 1497 | Go | 2023-08-22 | A simple, higher level interface for Go web scraping. |
open-source-search-engine | 1478 | C++ | 2023-08-23 | A distributed open source search engine and spider/crawler written in C/C++. |
cola | 1474 | Python | 2023-07-29 | A distributed crawling framework. |
rvest | 1415 | R | 2023-08-23 | Simple web scraping for R. |
php-spider | 1303 | PHP | 2023-08-24 | A configurable and extensible PHP web spider. |
wombat | 1292 | Ruby | 2023-08-18 | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. |
web-scraper-chrome-extension | 1241 | JavaScript | 2023-08-22 | Web data extraction tool implemented as chrome extension. |
scrapy-cluster | 1125 | Python | 2023-08-23 | Uses Redis and Kafka to create a distributed on demand scraping cluster. |
django-dynamic-scraper | 1124 | Python | 2023-08-25 | Creating Scrapy scrapers via the Django admin interface. |
sukhoi | 878 | Python | 2023-07-01 | Minimalist and powerful Web Crawler. |
StormCrawler | 815 | HTML | 2023-08-22 | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm |
creeper | 775 | Go | 2023-08-23 | The Next Generation Crawler Framework (Go). |
fetchbot | 775 | Go | 2023-08-24 | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. |
Spidr | 760 | Ruby | 2023-07-06 | Spider a site, multiple domains, certain links or infinitely. |
Dataflow kit | 610 | Go | 2023-08-24 | Extract structured data from web pages. Web sites scraping. |
webster | 462 | JavaScript | 2023-08-13 | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. |
laravel-goutte | 452 | PHP | 2023-08-10 | Laravel 5 Facade for Goutte. |
ACHE Crawler | 412 | Java | 2023-08-17 | An easy to use web crawler for domain-specific search. |
PHPScraper | 404 | PHP | 2023-08-23 | PHPScraper is a scraper & crawler built for simplicity. |
Spark-Crawler | 403 | Java | 2023-08-25 | Evolving Apache Nutch to run on Spark. |
ants-go | 365 | Go | 2023-03-13 | A open source, distributed, restful crawler engine in golang. |
supercrawler | 353 | JavaScript | 2023-08-09 | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. |
MSpider | 344 | Python | 2023-05-31 | A simple ,easy spider using gevent and js render. |
ebot | 325 | Erlang | 2023-07-05 | A scalable, distribuited and highly configurable web cawler. |
spidy | 312 | Python | 2023-08-03 | The simple, easy to use command line web crawler. |
spider | 265 | Rust | 2023-08-22 | The fastest web crawler and indexer. |
pspider | 264 | PHP | 2023-05-22 | Parallel web crawler written in PHP. |
js-crawler | 246 | TypeScript | 2023-08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |
Cobweb | 228 | JavaScript | 2023-01-02 | Web crawler with very flexible crawling options, standalone or using sidekiq. |
Infinity Crawler | 221 | C# | 2023-08-15 | A simple but powerful web crawler library in C#. |
webBee | 186 | Java | 2023-07-27 | A DFS web spider. |
crawley | 177 | Python | 2023-08-10 | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. |
CoCrawler | 170 | Python | 2023-08-03 | A versatile web crawler built using modern tools and concurrency. |
Squidwarc | 156 | JavaScript | 2023-05-13 | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
brownant | 156 | Python | 2023-05-05 | A lightweight web data extracting framework. |
crawler | 146 | Scala | 2023-07-15 | Scala DSL for web crawling. |
RubyRetriever | 140 | Ruby | 2023-08-06 | RubyRetriever is a Web Crawler, Scraper & File Harvester. |
scrala | 112 | Scala | 2023-03-15 | Scala crawler(spider) framework, inspired by scrapy. |
Demiurge | 111 | Python | 2023-06-02 | PyQuery-based scraping micro-framework. |
web-scraper | 100 | Perl | 2023-05-27 | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
crawlzone/crawlzone | 73 | PHP | 2023-08-04 | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
ferrit | 61 | Scala | 2022-11-02 | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |
SkyScraper | 56 | C# | 2023-03-05 | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. |
Apache Nutch | -- | -- | -- | Highly extensible, highly scalable web crawler for production environment. |
JSoup | -- | -- | -- | Scrapes, parses, manipulates and cleans HTML. |
Open Search Server | -- | -- | -- | A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything. |
Spiderman | -- | -- | -- | A scalable ,extensible, multi-threaded web crawler. |
feedparser | -- | -- | -- | Universal feed parser. |
name | stars | update date | description |
---|---|---|---|
Nokogiri | 6052 | 2023-08-24 | A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support. |
httrack | 2615 | 2023-08-24 | Copy websites to your computer. |
name | stars | update date | description |
---|---|---|---|
DotnetSpider | 3718 | 2023-08-24 | This is a cross platfrom, ligth spider develop by C#. |
Hawk | 3080 | 2023-08-25 | Advanced Crawler and ETL tool written in C#/WPF. |
Abot | 2120 | 2023-08-24 | C# web crawler built for speed and flexibility. |
Infinity Crawler | 221 | 2023-08-15 | A simple but powerful web crawler library in C#. |
SkyScraper | 56 | 2023-03-05 | An asynchronous web scraper / web crawler using async / await and Reactive Extensions. |
name | stars | update date | description |
---|---|---|---|
open-source-search-engine | 1478 | 2023-08-23 | A distributed open source search engine and spider/crawler written in C/C++. |
name | stars | update date | description |
---|---|---|---|
ebot | 325 | 2023-07-05 | A scalable, distribuited and highly configurable web cawler. |
name | stars | update date | description |
---|---|---|---|
colly | 20557 | 2023-08-25 | Fast and Elegant Scraping Framework for Gophers. |
pholcus | 7443 | 2023-08-24 | A distributed, high concurrency and powerful web crawler. |
ferret | 5451 | 2023-08-23 | Declarative web scraping. |
Hakrawler | 3894 | 2023-08-24 | Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application |
dht | 2628 | 2023-08-24 | BitTorrent DHT Protocol && DHT Spider. |
gocrawl | 2017 | 2023-08-23 | Polite, slim and concurrent web crawler. |
go_spider | 1814 | 2023-08-18 | An awesome Go concurrent Crawler(spider) framework. |
scrape | 1497 | 2023-08-22 | A simple, higher level interface for Go web scraping. |
creeper | 775 | 2023-08-23 | The Next Generation Crawler Framework (Go). |
fetchbot | 775 | 2023-08-24 | A simple and flexible web crawler that follows the robots.txt policies and crawl delays. |
Dataflow kit | 610 | 2023-08-24 | Extract structured data from web pages. Web sites scraping. |
ants-go | 365 | 2023-03-13 | A open source, distributed, restful crawler engine in golang. |
name | stars | update date | description |
---|---|---|---|
Scrapely | 1831 | 2023-08-20 | A pure-python HTML screen-scraping library. |
upton | 1613 | 2023-08-20 | A batteries-included framework for easy web-scraping. Just add CSS(Or do more). |
StormCrawler | 815 | 2023-08-22 | An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm |
name | stars | update date | description |
---|---|---|---|
Webmagic | 10927 | 2023-08-24 | A scalable crawler framework. |
spider-flow | 8329 | 2023-08-25 | A visual spider framework, it's so good that you don't need to write any code to crawl the website. |
Crawler4j | 4417 | 2023-08-24 | Simple and lightweight web crawler. |
WebCollector | 2993 | 2023-08-17 | Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes. |
anthelion | 2846 | 2023-08-07 | A plugin for Apache Nutch to crawl semantic annotations within HTML pages. |
Heritrix3 | 2527 | 2023-08-25 | Extensible, web-scale, archival-quality web crawler project. |
Gecco | 2460 | 2023-08-18 | A easy to use lightweight web crawler |
SeimiCrawler | 1927 | 2023-08-24 | An agile, distributed crawler framework. |
ACHE Crawler | 412 | 2023-08-17 | An easy to use web crawler for domain-specific search. |
Spark-Crawler | 403 | 2023-08-25 | Evolving Apache Nutch to run on Spark. |
webBee | 186 | 2023-07-27 | A DFS web spider. |
name | stars | update date | description |
---|---|---|---|
node-crawler | 6491 | 2023-08-24 | Node-crawler has clean,simple api. |
x-ray | 5760 | 2023-08-19 | Web scraper with pagination and crawler support. |
headless-chrome-crawler | 5403 | 2023-08-24 | Headless Chrome crawls with jQuery support |
node-osmosis | 4083 | 2023-08-23 | HTML/XML parser and web scraper for Node.js. |
scrape-it | 3929 | 2023-08-19 | A Node.js scraper for humans. |
scraperjs | 3685 | 2023-08-15 | A complete and versatile web scraper. |
simplecrawler | 2137 | 2023-08-24 | Event driven web crawler. |
web-scraper-chrome-extension | 1241 | 2023-08-22 | Web data extraction tool implemented as chrome extension. |
webster | 462 | 2023-08-13 | A reliable web crawling framework which can scrape ajax and js rendered content in a web page. |
supercrawler | 353 | 2023-08-09 | Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. |
Cobweb | 228 | 2023-01-02 | Web crawler with very flexible crawling options, standalone or using sidekiq. |
Squidwarc | 156 | 2023-05-13 | High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head |
name | stars | update date | description |
---|---|---|---|
Goutte | 9239 | 2023-08-24 | A screen scraping and web crawling library for PHP. |
dom-crawler | 3828 | 2023-08-23 | The DomCrawler component eases DOM navigation for HTML and XML documents. |
QueryList | 2579 | 2023-08-07 | The progressive PHP crawler framework. |
spatie/crawler | 2372 | 2023-08-24 | An easy to use, powerful crawler implemented in PHP. Can execute Javascript. |
php-spider | 1303 | 2023-08-24 | A configurable and extensible PHP web spider. |
laravel-goutte | 452 | 2023-08-10 | Laravel 5 Facade for Goutte. |
PHPScraper | 404 | 2023-08-23 | PHPScraper is a scraper & crawler built for simplicity. |
pspider | 264 | 2023-05-22 | Parallel web crawler written in PHP. |
crawlzone/crawlzone | 73 | 2023-08-04 | Crawlzone is a fast asynchronous internet crawling framework for PHP. |
name | stars | update date | description |
---|---|---|---|
web-scraper | 100 | 2023-05-27 | Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions. |
name | stars | update date | description |
---|---|---|---|
Scrapy | 48228 | 2023-08-25 | A fast high-level screen scraping and web crawling framework. |
you-get | 48019 | 2023-08-25 | Dumb downloader that scrapes the web. |
pyspider | 15994 | 2023-08-25 | A powerful spider system. |
newspaper | 13039 | 2023-08-24 | News, full-text, and article metadata extraction in Python 3 |
portia | 8975 | 2023-08-24 | Visual scraping for Scrapy. |
Scrapy-Redis | 5333 | 2023-08-25 | Redis-based components for Scrapy. |
MechanicalSoup | 4429 | 2023-08-24 | A Python library for automating interaction with websites. |
RoboBrowser | 3679 | 2023-08-23 | A simple, Pythonic library for browsing the web without a standalone web browser. |
distribute_crawler | 3228 | 2023-08-23 | Uses scrapy,redis, mongodb,graphite to create a distributed spider. |
Gain | 2025 | 2023-08-11 | Web crawling framework based on asyncio for everyone. |
PSpider | 1783 | 2023-08-24 | A simple spider frame in Python3. |
aspider | 1693 | 2023-08-14 | An async web scraping micro-framework based on asyncio. |
cola | 1474 | 2023-07-29 | A distributed crawling framework. |
scrapy-cluster | 1125 | 2023-08-23 | Uses Redis and Kafka to create a distributed on demand scraping cluster. |
django-dynamic-scraper | 1124 | 2023-08-25 | Creating Scrapy scrapers via the Django admin interface. |
sukhoi | 878 | 2023-07-01 | Minimalist and powerful Web Crawler. |
MSpider | 344 | 2023-05-31 | A simple ,easy spider using gevent and js render. |
spidy | 312 | 2023-08-03 | The simple, easy to use command line web crawler. |
crawley | 177 | 2023-08-10 | Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. |
CoCrawler | 170 | 2023-08-03 | A versatile web crawler built using modern tools and concurrency. |
brownant | 156 | 2023-05-05 | A lightweight web data extracting framework. |
Demiurge | 111 | 2023-06-02 | PyQuery-based scraping micro-framework. |
name | stars | update date | description |
---|---|---|---|
rvest | 1415 | 2023-08-23 | Simple web scraping for R. |
name | stars | update date | description |
---|---|---|---|
mechanize | 4310 | 2023-08-21 | Automated web interaction & crawling. |
wombat | 1292 | 2023-08-18 | Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages. |
Spidr | 760 | 2023-07-06 | Spider a site, multiple domains, certain links or infinitely. |
RubyRetriever | 140 | 2023-08-06 | RubyRetriever is a Web Crawler, Scraper & File Harvester. |
name | stars | update date | description |
---|---|---|---|
spider | 265 | 2023-08-22 | The fastest web crawler and indexer. |
name | stars | update date | description |
---|---|---|---|
crawler | 146 | 2023-07-15 | Scala DSL for web crawling. |
scrala | 112 | 2023-03-15 | Scala crawler(spider) framework, inspired by scrapy. |
ferrit | 61 | 2022-11-02 | Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra. |
name | stars | update date | description |
---|---|---|---|
crawlee | 8949 | 2023-08-25 | A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. |
js-crawler | 246 | 2023-08-06 | Web crawler for Node.JS, both HTTP and HTTPS are supported. |