Skip to content
forked from spatie/crawler

Crawl all links found on a website based on spatie/crawler

License

Notifications You must be signed in to change notification settings

dmelearn/crawler

 
 

Repository files navigation

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score Total Downloads

This package provides a class to crawl links on a website.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

This package has been modified to return an array that includes the Url, Response and the Parent Url (the webpage that contains the hyperlink to the current url), once the current url has been crawled. It also has an option for allowing external urls to be crawled each time, if they are link to multiple times.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setObserver must be an object that implements the \Spatie\Crawler\CrawlObserver-interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url                      $url
 * @param \Psr\Http\Message\ResponseInterface|null $response
 * @param \Spatie\Crawler\Url|string               $parentUrl
 */
public function hasBeenCrawled(Url $url, $response, $parentUrl);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain url's

You can tell the crawler not to visit certain url's by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

/**
 * Determine if the given url should be crawled.
 *
 * @param \Spatie\Crawler\Url $url
 *
 * @return bool
 */
public function shouldCrawl(Url $url);

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email freek@spatie.be instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.

About

Crawl all links found on a website based on spatie/crawler

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PHP 100.0%