Skip to content
Simple crawling websites by following links.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Web crawler

Simply library for crawling websites by following links with minimal dependencies.

Czech documentation


Install simply by Composer:

composer require baraja-core/webcrawler

and use in your project. :)

How to use

Crawler can run without dependencies.

In default settings create instance and call crawl() method:

$crawler = \Baraja\WebCrawler\Crawler;

$result = $crawler->crawl('');

In $result variable will be entity of type CrawledResult.

Advanced checking of multiple URLs

In real case you need download multiple URLs in single domain and check if some specific URLs works.

Simple example:

$crawler = \Baraja\WebCrawler\Crawler;

$result = $crawler->crawlList(
    '', // Starting (main) URL
    [ // Additional URLs
        '/robots.txt', // Relative links are also allowed

Notice: File robots.txt and sitemap will be downloaded automatically if exist.


In constructor of service Crawler you can define your project specific configuration.

Simply like:

$crawler = \Baraja\WebCrawler\Crawler(
    new \Baraja\WebCrawler\Config([
        // key => value

No one value is required. Please use as key-value array.

Configuration options:

Option Default value Possible values
followExternalLinks false Bool: Stay only in given domain?
sleepBetweenRequests 1 Int: Sleep in seconds.
maxHttpRequests 1000000 Int: Crawler budget limit.
maxCrawlTimeInSeconds 30 Int: Stop crawling when limit is exceeded.
allowedUrls ['.+'] String[]: List of valid regex about allowed URL format.
forbiddenUrls [''] String[]: List of valid regex about banned URL format.
You can’t perform that action at this time.