Skip to content
This repository has been archived by the owner on Feb 16, 2018. It is now read-only.
/ aicrawler Public archive

A web scraping pattern using heuristics with the Symfony DOMCrawler

Notifications You must be signed in to change notification settings

danrichards/aicrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#AiCrawler

Leverage Ai design patterns by using heuristics with the Symfony DOMCrawler.

Please crawl on over to the docs which are also available as a gitbook.

GitBook

Quickstart

The AiCrawler package has the responsibility of making boolean assertions on a node in the HTML DOM. It comes with a straight-forward data point trait which will record the results of your heuristics (rules) for a given "item" or context.

Install with Composer

$ composer require dan/aicrawler dev-master

Trivial example

$crawler = new AiCrawler('<html>...</html>');

$node = $crawler->filter('div[id="content-start"]');
$args = ['words' => 15];

// Does the content have at least 15 words?
$assertion = Heuristics::words($node, $args); // true / false

A more expressive example

$crawler = new AiCrawler("<html>...</html>");

$args = [
    'elements' => [
        "elements" => "/p/ /blockquote/ /(u|o)l/ /h[1-6]/",
        "regex" => true,
        'words' => [
            'words' => 15,
            'descendants' => true,
            'words2' => [
                'words' => "/(cod(ing|ed|e)|program|language|php)/",
                'regex' => true,
                'descendants' => true
            ]
        ]
    ],
    'matches' => 3
]


/**
 * Do at least 3 of this div's children which are p, blockquote, ul, ol or any
 * h element AND contain at least 15 words (including text from the child's 
 * descendants) AND words such as coding, coded, code, program, language, php 
 * (including text from the child's descendants).
 */
$crawler->filter("div")->each(function(&$node) use ($args) {
    if (Heuristics::children($node, $args) {
        $node->setDataPoint("example", "words", 1);
    }
});

Sound interested? Read on about the Heuristics class or go right to a similar example with complete notes.

Version 0.0.1

  • A Heuristics class with some cool rules to get you started.
  • A Scorable trait is on our AiCrawlerclass so there is a pattern for data points.
  • A Extra trait is on our AiCrawler class so there is a pattern for storing extra data.

Todo

Contributing

  1. Fork this project on GitHub.
  2. Existing unit tests must pass.
  3. Contributions must be unit tested.
  4. New heuristics should be portable (have few or no dependencies).
  5. New heuristics should have helpful doc blocks.
  6. Submit a pull request.
  7. See guide on extending Heuristics for special heuristics.

Documentation

  • Follow PSR-2.
  • Add PHPDoc blocks for all classes, methods, and functions
  • Omit the @return tag if the method does not return anything
  • Add a blank line before @param, @return or @throws

Any issues, please report here

License

AiCrawler is free software distributed under the terms of the MIT license.

About

A web scraping pattern using heuristics with the Symfony DOMCrawler

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages