HTML5 parser for Common Lisp
Common Lisp
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
cxml
tests
.gitignore
.travis.yml
COPYING
COPYING.LIB
LICENSES
README.md
cl-html5-parser.asd
constants.lisp
entities.lisp
html5-parser-class.lisp
html5-parser.lisp
inputstream.lisp
packages.lisp
simple-tree.lisp
tokenizer.lisp
tree-help.lisp
xmls.lisp

README.md

cl-html5-parser: HTML5 parser for Common Lisp

Abstract

cl-html5-parser is a HTML5 parser for Common Lisp with the following features:

  • It is a port of the Python library html5lib.
  • It passes all relevant tests from html5lib.
  • It is not tied to a specific DOM implementation.

Requirements

  • SBCL or ECL.
  • CL-PPCRE and FLEXI-STREAMS.

Might work with CLISP, ABCL and Clozure CL, but many of the tests don't pass there.

Usage

Parsing

Parsing functions are in the package HTML5-PARSER.

parse-html5 source &key encoding strictp dom
    => document, errors

Parse an HTML document from source. Source can be a string, a pathname or a stream. When parsing from a stream encoding detection is not supported, encoding must be supplied via the encoding keyword parameter.

When strictp is true, parsing stops on first error.

Returns two values. The primary value is the document node. The secondary value is a list of errors found during parsing. The format of this list is subject to change.

The type of document depends on the dom parameter. By default it's an instance of cl-html5-parser's own DOM implementation. See the DOM paragraph below for more information.

parse-html5-fragment source &key container encoding strictp dom
    => document-fragment, errors

Parses a fragment of HTML. Container sets the context, defaults to "div". Returns a document-fragment node. For the other parameters see PARSE-HTML5.

Example

(html5-parser:parse-html5-fragment "Parse <i>some</i> HTML" :dom :xmls)
==> ("Parse " ("i" NIL "some") " HTML")

The DOM

Parsing HTML5 is not possible without a DOM. cl-html5-parser defines a minimal DOM implementation for this task. Functions for traversing documents are exported by the HTML5-PARSER package.

Alternativly the parser can be instructed to to convert the document into other DOM implemenations using the dom parameter. The convertion is done by simply calling the generic function transform-html5-dom. Support for other DOM implementations can be added by defining new methods for this generic function. The dom parameter is either a symbol or a list where the car is a symbol and the rest is key arguments. Below is the currently supported target types.

:XMLS or (:XMLS &key namespace comments)

Converts a node into a simple XMLS-like list structure. If node is a document fragment a list of XMLS nodes a returned. In all other cases a single XMLS node is returned.

If namespace argument is true, tag names are conses of name and namespace URI.

By default comments are stripped. If comments argument is true, comments are returned as (:COMMENT NIL "comment text"). This extension of XMLS format.

:CXML

Convert to Closure XML Parser DOM implementation. In order to use this you must load/depend on the the system cl-html5-parser-cxml.

License

This library is available under the GNU Lesser General Public License v3.0.