Skip to content
A flexible spider in PHP
PHP
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
examples
src
tests
.gitignore
.travis.yml
LICENSE
README.md
composer.json
composer.lock
phpunit.xml

README.md

Spider Build Status

A flexible spider in PHP.

Concepts

A spider contains many processors called pipes, you can pass as many tasks as you like to the spider, each task go through these pipes and get processed.

Installation

composer require ddliu/spider

Requirements

  • PHP5.3+
  • curl(RequestPipe)

Dependencies

See composer.json.

Usage

use ddliu\spider\Spider;
use ddliu\spider\Pipe\NormalizeUrlPipe;
use ddliu\spider\Pipe\RequestPipe;
use ddliu\spider\Pipe\DomCrawlerPipe;

(new Spider())
    ->pipe(new NormalizeUrlPipe())
    ->pipe(new RequestPipe())
    ->pipe(new DomCrawlerPipe())
    ->pipe(function($spider, $task) {
        $task['$dom']->filter('a')->each(function($a) use ($task) {
            $href = $a->attr('href');
            $task->fork($href);
        })
    })
    // the entry task
    ->addTask('http://example.com')
    ->run()
    ->report();

Find more examples in examples folder.

Spider

The Spider class.

Options

  • limit: maxmum tasks to run

Methods

  • pipe($pipe): add a pipe
  • addTask($task): add a task
  • run(): run the spider
  • report(): write report to log

Task

A task contains the data array and some helper functions.

The Task class implements ArrayAccess interface, so you can access data like array.

Methods

  • fork($task): add a sub task to the spider
  • ignore(): ignore the task

Pipes

Pipes define how each task being processed.

A pipe can be a function:

function($spider, $task) {}

Or extends the BasePipe:

use ddliu\spider\Pipe\BasePipe;

class MyPipe extends BasePipe {
    public function run($spider, $task) {
        // process the task...
    }
}

Useful Pipes

NormalizeUrlPipe

Normalize $task['url'].

new NormalizeUrlPipe()

RequestPipe

Start an HTTP request with $task['url'] and save the result in $task['content'].

new RequestPipe(array(
    'useragent' => 'myspider',
    'timeout' => 10
));

FileCachePipe

Cache a pipe (e.g. RequestPipe).

$requestPipe = new RequestPipe();
$cacheForReqPipe = new FileCachePipe($requestPipe, [
    'input' => 'url',
    'output' => 'content',
    'root' => '/path/to/cache/root',
]);

RetryPipe

Retry on failure.

$requestPipe = new RequestPipe();
$retryForReqPipe = new RetryPipe($requestPipe, [
    'count' => 10,
]);

DomCrawlerPipe

Create a DomCrawler from $task['content']. Access it with $task['$dom'] in following pipes.

ReportPipe

Report every 10 minutes.

new ReportPipe(array(
    'seconds' => 600
))

Logging

$spider->logger is an instance of Monolog\Logger. You can add logging handlers to it before start:

use Monolog\Handler\StreamHandler;

$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));

TODO/Ideas

  • Real world examples.
  • Running tasks concurrently.(With pthread?)

Alternate

Use golang version for better performance!

You can’t perform that action at this time.