Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Lua bindings for the Gumbo HTML5 parsing library, with a set of DOM APIs implemented in pure Lua.
Lua HTML Makefile C
Branch: master

README.md

lua-gumbo

Lua C API and LuaJIT FFI bindings for the Gumbo HTML5 parsing library, including a small set of core DOM APIs implemented in pure Lua.

Requirements

Installation

Using LuaRocks

To install the latest lua-gumbo release via LuaRocks, first ensure the requirements listed above are installed, then use the command:

luarocks install gumbo

Using GNU Make

By default, the Makefile will consult pkg-config for the appropriate Lua variables. Usually the following commands will be sufficient:

make
make check
[sudo] make install

The following pkg-config names are searched in order and the first one to be found is used (yes, these all exist in the wild):

lua lua52 lua5.2 lua-5.2 lua51 lua5.1 lua-5.1 luajit

If, for example, your system has both lua.pc and luajit.pc installed then lua.pc will be used by default. You can override this default behavior by specifying the LUA_PC variable. To build for LuaJIT, in this case, use:

make LUA_PC=luajit
make check LUA_PC=luajit
[sudo] make install LUA_PC=luajit

If your Lua installation doesn't include a pkg-config file, running make will simply complain and exit. In this case, the 3 relevant variables will have to be specified manually, for example:

make LUA_CFLAGS=-I/usr/include/lua5.2
make check
make install LUA_LMOD_DIR=/usr/share/lua/5.2 LUA_CMOD_DIR=/usr/lib/lua/5.2

Note: for convenience, variable overrides can be stored persistently in a file named local.mk. For example, instead of adding LUA_PC=luajit to every command, as shown above, it can just be added once to local.mk.

Usage

The gumbo module provides 2 functions:

parse

local document = gumbo.parse(html, tabStop)

Parameters:

  1. html: A string of UTF-8 encoded HTML.
  2. tabStop: The number of columns to count for tab characters when computing source positions (optional; defaults to 8).

Returns:

Either a Document node on success, or nil and an error message on failure.

parseFile

local document = gumbo.parseFile(pathOrFile, tabStop)

Parameters:

  1. pathOrFile: Either a file handle or filename string that refers to a file containing UTF-8 encoded HTML.
  2. tabStop: As above.

Returns:

As above.

Example

The following is a simple demonstration of how to find an element by ID and then print the contents of it's first child text node.

local gumbo = require "gumbo"
local document = gumbo.parse('<div id="foo">Hello World</div>')
local foo = document:getElementById("foo")
local text = foo.childNodes[1].data
print(text)

Note: this example omits error handling for the sake of simplicity. Production code should wrap each step with assert() or some other, application-specific error handling.

See also:

DOM API

The parse and parseFile functions both return a Document node, containing a tree of descendant nodes. The structure and API of this tree mostly conforms to the DOM Level 4 Core specification, with the following (intentional) exceptions:

  • DOMString types are encoded as UTF-8 instead of UTF-16.
  • Lists begin at index 1 instead of 0.
  • readonly is not fully enforced.

The following sections list the supported properties and methods, grouped by the DOM interface in which they are specified. No lua-gumbo specific documentation currently exists, but since it's an implementation of a standard API, cross-checking the list with the MDN DOM reference should suffice for now.

Note: When referring to external DOM documentation, don't forget to translate JavaScript examples to use Lua object:method() call syntax.

Document

Implements Node and ParentNode.

Element

Implements Node, ParentNode, ChildNode and NonDocumentTypeChildNode.

Text

Implements Node, ChildNode and NonDocumentTypeChildNode.

Comment

Implements Node, ChildNode and NonDocumentTypeChildNode.

DocumentType

Implements Node and ChildNode.

  • name
  • publicId
  • systemId

Node

ParentNode

ChildNode

Attr

Not Implemented

The following methods from the CharacterData interface are intentionally omitted:

  • substringData()
  • appendData()
  • insertData()
  • deleteData()
  • replaceData()

The specification for these methods has numerous flaws, assumes UTF-16 encoding and 0-based offsets and is just unnecessarily complex for the trivial amount of utility provided. A better alternative is to just manipulate the data property directly.

Testing

Build Status

  • make check: Runs all unit tests.
  • make check-html5lib: Runs just the html5lib tree-construction tests.
  • make check-install: Runs make check within a temporary, isolated installation, to ensure all modules are installed correctly.
  • make coverage.txt: Generates a test coverage report with luacov.

License

Copyright (c) 2013-2014, Craig Barnes.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Something went wrong with that request. Please try again.