Skip to content
php class for parse all directives from robots.txt files
DIGITAL Command Language PHP
Branch: master
Clone or download
Yauheni Yurkevich Yauheni Yurkevich
Yauheni Yurkevich and Yauheni Yurkevich fix google robots.txt http test
Latest commit b6b41ed Jul 25, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/RobotsTxtParser Fix of not allowed char - brackets [], due to the fail in preg_match … Jul 23, 2019
tests
.gitignore
.travis.yml
LICENSE Initial commit Dec 8, 2014
README.md Added composer.json file and fixed file structure (#55) Jan 2, 2019
composer.json Added composer.json file and fixed file structure (#55) Jan 2, 2019
phpunit.xml Added composer.json file and fixed file structure (#55) Jan 2, 2019

README.md

robots-txt-parser

Build Status

RobotsTxtParser — PHP class for parsing all the directives of the robots.txt files

RobotsTxtValidator — PHP class for check is url allow or disallow according to robots.txt rules.

Try demo of RobotsTxtParser on-line on live domains.

Parsing is carried out according to the rules in accordance with Google & Yandex specifications:

Last improvements:

  1. Pars the Clean-param directive according to the clean-param syntax.
  2. Deleting comments (everything following the '#' character, up to the first line break, is disregarded)
  3. The improvement of the Parse of Host — the intersection directive, should refer to the user-agent '*'; If there are multiple hosts, the search engines take the value of the first.
  4. From the class, unused methods are removed, refactoring done, the scope of properties of the class is corrected.
  5. Added more test cases, as well as test cases added to the whole new functionality.
  6. RobotsTxtValidator class added to check if url is allowed to parsing.
  7. With version 2.0, the speed of RobotsTxtParser was significantly improved.

Supported Directives:

  • DIRECTIVE_ALLOW = 'allow';
  • DIRECTIVE_DISALLOW = 'disallow';
  • DIRECTIVE_HOST = 'host';
  • DIRECTIVE_SITEMAP = 'sitemap';
  • DIRECTIVE_USERAGENT = 'user-agent';
  • DIRECTIVE_CRAWL_DELAY = 'crawl-delay';
  • DIRECTIVE_CLEAN_PARAM = 'clean-param';
  • DIRECTIVE_NOINDEX = 'noindex';

Installation

Install the latest version with

$ composer require bopoda/robots-txt-parser

Run tests

Run phpunit tests using command

$ php vendor/bin/phpunit

Usage example

You can start the parser by getting the content of a robots.txt file from a website:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
var_dump($parser->getRules());

Or simply using the contents of the file as input (ie: when the content is already cached):

$parser = new RobotsTxtParser("
	User-Agent: *
	Disallow: /ajax
	Disallow: /search
	Clean-param: param1 /path/file.php

	User-agent: Yahoo
	Disallow: /

	Host: example.com
	Host: example2.com
");
var_dump($parser->getRules());

This will output:

array(2) {
  ["*"]=>
  array(3) {
    ["disallow"]=>
    array(2) {
      [0]=>
      string(5) "/ajax"
      [1]=>
      string(7) "/search"
    }
    ["clean-param"]=>
    array(1) {
      [0]=>
      string(21) "param1 /path/file.php"
    }
    ["host"]=>
    string(11) "example.com"
  }
  ["yahoo"]=>
  array(1) {
    ["disallow"]=>
    array(1) {
      [0]=>
      string(1) "/"
    }
  }
}

In order to validate URL, use the RobotsTxtValidator class:

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$validator = new RobotsTxtValidator($parser->getRules());

$url = '/';
$userAgent = 'MyAwesomeBot';

if ($validator->isUrlAllow($url, $userAgent)) {
    // Crawl the site URL and do nice stuff
}

Contribution

Feel free to create PR in this repository. Please, follow PSR style.

See the list of contributors which participated in this project.

Final Notes:

Please use v2.0+ version which works by same rules but is more highly performance.

You can’t perform that action at this time.