RefineDom

RefineDom - Simple and fast Html parser refined.

Installation
Quick start
Creating new document
Search for elements
Verify if element exists
Supported selectors
Output
Creating a new element
Getting parent element
Getting sibling elements
Getting the child elements
Getting owner document
Working with element attributes
Comparing elements
Adding a child element
Replacing an element
Removing element
Working with cache
Comparison with other parsers

Installation

At the time of writing there is no public Packagist package. Therefor a custom vcs repository has to be defined in your composer.json:

...
"repositories": [
    {
        "type": "vcs",
        "url": "git@github.com:GameplayJDK/RefineDom.git"
    }
],
...

After that you can install RefineDom using the following command:

composer require gameplayjdk/refinedom

In the future you may be able to install without defining a custom vcs repository.

Quick start

use RefineDom\Document;

$document = new Document('news.html', true);

$posts = $document->find('.post');

foreach ($posts as $post)
{
    echo($post->text(), "\n");
}

Creating new document

RefineDom currently allows to load html in three ways:

With constructor

// the first parameter is a string with html
$document = new Document($html);

// file path
$document = new Document('page.html', true);

// or DOMDocument
$document = new Document($doc);

The second parameter specifies if you need to load file. Default is false.

With separate methods

$document = new Document();

$document->load($html);

$document->loadFile('page.html');

$document->loadDocument($doc);

The load method is also available for loading Xml but requires $isHtml to be set to false using either the third constructor argument or $document->setIsHtml(false);.

It then accept additional options:

$document->load($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

Search for elements

RefineDom allows CSS selectors or XPath expressions for search. You need to pass the expression as the first parameter, and specify its type in the second one (default type is Query::TYPE_CSS):

With method `find()`:

use RefineDom\Document;
use RefineDom\Query;
    
...

// with CSS selector
$posts = $document->find('.post');

// or XPath
$posts = $document->find("//div[contains(@class, 'post')]", Query::TYPE_XPATH);

With magic method `__invoke()`:

$posts = $document('.post');

With method `xPath()`:

$posts = $document->xPath("//*[contains(concat(' ', normalize-space(@class), ' '), ' post ')]");

You can search inside an element:

echo $document->find('.post')[0]->find('h2')[0]->text();

If the elements that match a given expression are found, then the methods return an array of instances of RefineDom\Element, otherwise an empty array. You could also get an array of DOMElement objects. To get this, pass false as the third parameter.

To avoid the array square brackets (these: []), use the findIndex and xPathIndex methods:

$posts = $document->findIndex('header', 1)->xPathIndex("//h1")->text();

Its first argument defaults to zero and the other arguments are the ones available for normal search methods.

Verify if element exists

To verify if an element exist use the has method:

if ($document->has('.post'))
{
    // code
}

If you need to check if element exist and then get it:

if ($document->has('.post'))
{
    $elements = $document->find('.post');
    // code
}

but it would be faster like this:

if (count($elements = $document->find('.post')) > 0)
{
    // code
}

because in the first case it makes two requests.

Note that expressions are cached thougth. See the working with cache section for details.

This check also works with findIndex (and also xPathIndex):

if (($element = $document->findIndex('.post', 3)) !== null)
{
    // code
}

which retun null when the element at index does not exist.

Supported selectors

RefineDom supports search by:

tag
class, ID, name and value of an attribute
pseudo-classes:
- first-, last-, nth-child
- empty and not-empty
- contains
- has

// all links
$document->find('a');

// any element with id = "foo" and a "bar" class
$document->find('#foo.bar');

// any element with attribute "name"
$document->find('[name]');
// which is the same as
$document->find('*[name]');

// input field with the name "foo"
$document->find('input[name=foo]');
$document->find('input[name=\'bar\']');
$document->find('input[name="baz"]');

// any element that has an attribute starting with "data-" and the value "foo"
$document->find('*[^data-=foo]');

// all links starting with https
$document->find('a[href^=https]');

// all images with the extension "png" assuming their src attribute ends 'png'
$document->find('img[src$=png]');

// all links containing the string "example.com"
$document->find('a[href*=example.com]');

// text of the links with "foo" class
$document->find('a.foo::text');

// address and title of all the links with "bar" class
$document->find('a.bar::attr(href|title)');

Output

Getting Html

With method `html()`:

$post = $document->findIndex('.post');

echo $post->html();

Casting to string:

$html = (string) $posts[0];

Formatting Html output:

$html = $document->format()->html();

An element does not have the format() method, so if you need to output formatted Html of the element, then first you have to convert it to a document like this:

$html = $element->toDocument()->format()->html();

Adittionally you can supply additional options to xml as well as to html:

$html = $document->format()->xml(LIBXML_NOEMPTYTAG);

Unformatted Html output:

To output unformatted html or xml give a boolean argument to format (or setFormat):

$unformatted = $document->format(false)->html();

Inner Html

$innerHtml = $element->innerHtml();

Document does not have the method innerHtml(), therefore, if you need to get inner Html of a document, convert it into an element first:

$innerHtml = $document->toElement()->innerHtml();

Getting content

$posts = $document->find('.post');

echo $posts[0]->text();

Creating a new element

Creating an instance of the class

use RefineDom\Element;

$element = new Element('span', 'Hello');
    
// outputs "<span>Hello</span>"
echo $element->html();

First parameter is the name of the element, the second one is its text value (optional), the third one is an array of element attributes (also optional).

An example of creating an element with attributes:

$attributes = ['name' => 'description', 'placeholder' => 'Enter description of item'];

$element = new Element('textarea', 'Text', $attributes);

An element can also be created from an instance of the class DOMElement:

use RefineDom\Element;
use DOMElement;

$domElement = new DOMElement('span', 'Hello');

$element = new Element($domElement);

Using the method `createElement` of a document

$document = new Document($html);

$element = $document->createElement('span', 'Hello');

Getting parent element

$document = new Document($html);

$input = $document->findIndex('input[name=email]');

var_dump($input->parent());

Getting sibling elements

$document = new Document($html);

$item = $document->findIndex('ul.menu > li', 1);

var_dump($item->previousSibling());

var_dump($item->nextSibling());

Getting the child elements

$html = '
<ul>
    <li>Foo</li>
    <li>Bar</li>
    <li>Baz</li>
</ul>
';

$document = new Document($html);
$list = $document->first('ul');

// string(3) "Baz"
var_dump($item->child(2)->text());

// string(3) "Foo"
var_dump($item->firstChild()->text());

// string(3) "Baz"
var_dump($item->lastChild()->text());

// array(3) { ... }
var_dump($item->children());

Getting owner document

$document = new Document($html);

$element = $document->findIndex('input[name=email]', 0);

$otherDocument = $element->getDocument();

// bool(true)
var_dump($document->is($otherDocument));

Working with element attributes

Getting the tag name

$name = $element->tag;

Creating/updating an attribute

With method `setAttribute`:

$element->setAttribute('name', 'username');

With method `attr`:

$element->attr('name', 'username');

With magic method `__set`:

$element->name = 'username';

Getting value of an attribute

With method `getAttribute`:

$username = $element->getAttribute('value');

With method `attr`:

$username = $element->attr('value');

With magic method `__get`:

$username = $element->name;

Returns null if attribute is not found.

Verify if attribute exists

With method `hasAttribute`:

if ($element->hasAttribute('name'))
{
    // code
}

With magic method `__isset`:

if (isset($element->name))
{
    // code
}

Removing attribute:

With method `removeAttribute`:

$element->removeAttribute('name');

With magic method `__unset`:

unset($element->name);

Comparing elements

$element = new Element('span', 'hello');
$otherElement = new Element('span', 'hello');

// bool(true)
var_dump($element->is($element));

// bool(false)
var_dump($element->is($otherElement));

Appending child elements

$list = new Element('ul');

$item = new Element('li', 'Item 1');

$list->appendChild($item);

$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($items);

Adding a child element

$list = new Element('ul');

$item = new Element('li', 'Item 1');
$items = [
    new Element('li', 'Item 2'),
    new Element('li', 'Item 3'),
];

$list->appendChild($item);
$list->appendChild($items);

Replacing an element

$element = new Element('span', 'hello');

$document->find('.post')[0]->replace($element);

Removing element

$document->findIndex('.post')->remove();

Working with cache

Cache is an associative array of XPath expressions, that were converted from CSS selectors. The CSS selector is the key.

Getting from cache

use RefineDom\Query;

...

$xPath = Query::compile('h2');
$compiled = Query::getCompiled();

// array('h2' => '//h2')
var_dump($compiled);

Installing a cache

Using a predefined cache can help to improve the speed as there is no need to recompile a selector.

Query::setCompiled(['h2' => '//h2']);

Comparison with other parsers

This comparison refers to DiDom, not RefineDom. Numbers should not be off that much thought.

Comparison with other parsers

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
src/RefineDom		src/RefineDom
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json

License

GameplayJDK/RefineDom

Folders and files

Latest commit

History

Repository files navigation

RefineDom

Contents

Installation

Quick start

Creating new document

With constructor

With separate methods

Search for elements

With method find():

With magic method __invoke():

With method xPath():

Verify if element exists

Supported selectors

Output

Getting Html

With method html():

Casting to string:

Formatting Html output:

Unformatted Html output:

Inner Html

Getting content

Creating a new element

Creating an instance of the class

Using the method createElement of a document

Getting parent element

Getting sibling elements

Getting the child elements

Getting owner document

Working with element attributes

Getting the tag name

Creating/updating an attribute

With method setAttribute:

With method attr:

With magic method __set:

Getting value of an attribute

With method getAttribute:

With method attr:

With magic method __get:

Verify if attribute exists

With method hasAttribute:

With magic method __isset:

Removing attribute:

With method removeAttribute:

With magic method __unset:

Comparing elements

Appending child elements

Adding a child element

Replacing an element

Removing element

Working with cache

Getting from cache

Installing a cache

Comparison with other parsers

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

With method `find()`:

With magic method `__invoke()`:

With method `xPath()`:

With method `html()`:

Using the method `createElement` of a document

With method `setAttribute`:

With method `attr`:

With magic method `__set`:

With method `getAttribute`:

With method `attr`:

With magic method `__get`:

With method `hasAttribute`:

With magic method `__isset`:

With method `removeAttribute`:

With magic method `__unset`: