Skip to content

A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.

License

Notifications You must be signed in to change notification settings

bjoern-hempel/php-web-crawler

Repository files navigation

Attention: This package is outdated and is no longer maintained. Use PHP Web Crawler instead.

WebCrawler

This php class allows you to crawl recursively a given webpage (or a given html file) and collect some data from it. Simply define the url (or a html file) and a set of xpath expressions which should map with the output data object. The final representation will be a php array which can be easily converted into the json format for further processing.

0. Introduction

0.1 Installation

user$ git clone git@github.com:bjoern-hempel/php-web-crawler.git .

0.2 requirements

TODO...

1. Execute the examples

user$ php examples/simple.php 
{
    "version": "1.0.0",
    "title": "Test Title",
    "paragraph": "Test Paragraph"
}

2. How to use

2.1 Basic usage simple.php (simple html page)

basic.html

<html>
    <head>
        <title>Test Page</title>
    </head>
    <body>
        <h1>Test Title</h1>
        <p>Test Paragraph</p>
    </body>
</html>

simple.php

<?php

include dirname(__FILE__).'/../autoload.php';

use Ixno\WebCrawler\Output\Field;
use Ixno\WebCrawler\Value\Text;
use Ixno\WebCrawler\Value\XpathTextnode;
use Ixno\WebCrawler\Source\File;

$file = dirname(__FILE__).'/html/basic.html';

$html = new File(
    $file,
    new Field('version', new Text('1.0.0')),
    new Field('title', new XpathTextnode('//h1')),
    new Field('paragraph', new XpathTextnode('//p'))
);

$data = json_encode($html->parse(), JSON_PRETTY_PRINT);

print_r($data);

echo "\n";

It returns:

{
    "version": "1.0.0",
    "title": "Test Title",
    "paragraph": "Test Paragraph"
}

2.2 More examples

2.2 Complex examples

TODO...

2.3 Converter

TODO...

2.4 Filters

TODO...

3. Running the tests

user$ phpunit tests/Basic.php 
PHPUnit 7.0.2 by Sebastian Bergmann and contributors.

..                                                                  2 / 2 (100%)

Time: 126 ms, Memory: 8.00MB

OK (2 tests, 16 assertions)

Using composer's autoload (manual installation)

Using the autoloader function of the composer it is possible to use this classes without including the source files.

Make some changes to your composer.json:

"autoload": {
    "psr-0": {
        ...
        "Ixno\\WebCrawler\\":"vendor/ixno/webcrawler/",
        ...
    }
},

Add this project to your vendor directory:

user$ cd /path/to/root/of/project
user$ mkdir vendor/ixno/webcrawler && cd vendor/ixno/webcrawler
user$ git clone git@github.com:bjoern-hempel/php-web-crawler.git . && cd ../../..

Call the composer to create the composer autoloading mappings:

user$ composer.phar dumpautoload -o

Check the result:

user$ grep -r Ixno vendor/composer/.

You will something like the following lines:

vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\Converter\\Converter' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/Converter.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\Converter\\DateParser' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Converter/DateParser.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\CrawlRule' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\Crawler' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\Page' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\PageGroup' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
vendor/composer/./autoload_classmap.php:    'Ixno\\WebCrawler\\PageList' => $vendorDir . '/ixno/webcrawler/lib/Ixno/WebCrawler/Crawler.php',
...

Now you can simply use all classes without including the source files.

A. Authors

B. License

This project is licensed under the MIT License - see the LICENSE file for details

C. Closing words

Have fun! :)

About

A php class that crawls a given url and collects recursively some data from it. The final representation will be a json object.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published