Uses pyquery based DSL to turn HTML into a dictionary
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Warning this is a work in progress. I wouldn't use this for anything that will kill someone if it fails.

This project defines a really simple JSON based DSL for converting (X)HTML into dictionaries. Here is an example:

from pyquery import PyQuery
from pyqueryextract import extract
d = PyQuery('<a href="">Google!</a>')
spec = {
    'hardcoded': 'text',
    'txt':  {'$query': 'a'},
    'link': {'$query': 'a', '$op': ['attr', 'href']},

data = extract(d, spec)

This would produce the following dictionary:

{'link': '', 'txt': 'Google!', 'hardcoded': 'text'}

This Equivelent to the following Python code:

data = {}
data['hardcoded'] = 'text'
data['link'] = d("a").attr("href")
data['txt'] = d("a").text()
print data

While the python code is much more concise, pyqueryextract specs can be stored wherever JSON can be stored.


At the moment there is only operation defined which is $query. The value it is set to is what will be searched for in the document. If $op is also defined, it will call the function named in index zero. For instance, attr. The remaining items in the $op list will be applied to the function name.