Skip to content

Latest commit

 

History

History
323 lines (241 loc) · 9.31 KB

harvester.edeposit_autoparser.rst

File metadata and controls

323 lines (241 loc) · 9.31 KB

edeposit_autoparser.py

This script is used to ease creation of new parsers.

Configuration file

The script expects configuration file with patterns, specified as -c parameter. Pattern files uses YAML as serialization format.

Inside the pattern file should be multiple pattern definitions. Here is example of the test pattern file:

html: simple_xml.xml
first:
    data: i wan't this
    required: true
    notfoundmsg: Can't find variable '$name'.
second:
    data: and this
---
html: simple_xml2.xml
first:
    data: something wanted
    required: true
    notfoundmsg: Can't find variable '$name'.
second:
    data: another wanted thing

As you can see, this file contains two examples divided by ---. Each section, of file have to contain html key pointing to either file or URL resource.

After the html key, there may be unlimited number of variables. Each variable have to contain data key, which defines the match, which will be parsed from the file html key is pointing to.

Optionally, you can also specify required and notfoundmsg. If the variable is required, it means that if generated parser will found data without this variable, UserWarning exception is raised and notfoundmsg is used as message. As you can see in example, you can use $name as variable which holds variable name (first for example).

There is also special keyword tagname, which can be used to further specify correct element in case, that there is more than one element matching.

How it works

Autoparser first reads all examples and locates elements, which content matching pattern defined in data key. Spaces at the beginning and end of the pattern and element's content are ignored.

When the autoparser collects all matching elements, it generates DOM paths to each element.

After that, elimination process begins. In this step, autoparser throws away all paths, that doesn't work for all corresponding variables in all examples.

When this is done, paths with best priority are selected and .generate_parsers is called.

Result from this call is string printed to the output. This string contains all necessary parsers for each variable and also unittest.

You can then build the parser you need much more easilly, because now you have working pickers from DOM and all you need to do is to clean the data.

Live example:

$ ./edeposit_autoparser.py -c autoparser/autoparser_data/example_data.yaml 
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# Interpreter version: python 2.7
#
# HTML parser generated by Autoparser
# (https://github.com/edeposit/edeposit.amqp.harvester)
#
import os
import os.path

import httpkie
import dhtmlparser


# Utilities
def _get_source(link):
    """
    Return source of the `link` whether it is filename or url.

    Args:
        link (str): Filename or URL.

    Returns:
        str: Content.

    Raises:
        UserWarning: When the `link` couldn't be resolved.
    """
    if link.startswith("http://") or link.startswith("https://"):
        down = httpkie.Downloader()
        return down.download(link)

    if os.path.exists(link):
        with open(link) as f:
            return f.read()

    raise UserWarning("html: '%s' is neither URL or data!" % link)


def _get_encoding(dom, default="utf-8"):
    """
    Try to look for meta tag in given `dom`.

    Args:
        dom (obj): pyDHTMLParser dom of HTML elements.
        default (default "utr-8"): What to use if encoding is not found in
                                   `dom`.

    Returns:
        str/default: Given encoding or `default` parameter if not found.
    """
    encoding = dom.find("meta", {"http-equiv": "Content-Type"})

    if not encoding:
        return default

    encoding = encoding[0].params.get("content", None)

    if not encoding:
        return default

    return encoding.lower().split("=")[-1]


def handle_encodnig(html):
    """
    Look for encoding in given `html`. Try to convert `html` to utf-8.

    Args:
        html (str): HTML code as string.

    Returns:
        str: HTML code encoded in UTF.
    """
    encoding = _get_encoding(
        dhtmlparser.parseString(
            html.split("</head>")[0]
        )
    )

    if encoding == "utf-8":
        return html

    return html.decode(encoding).encode("utf-8")


def is_equal_tag(element, tag_name, params, content):
    """
    Check is `element` object match rest of the parameters.

    All checks are performed only if proper attribute is set in the HTMLElement.

    Args:
        element (obj): HTMLElement instance.
        tag_name (str): Tag name.
        params (dict): Parameters of the tag.
        content (str): Content of the tag.

    Returns:
        bool: True if everyhing matchs, False otherwise.
    """
    if tag_name and tag_name != element.getTagName():
        return False

    if params and not element.containsParamSubset(params):
        return False

    if content is not None and content.strip() != element.getContent().strip():
        return False

    return True


def has_neigh(tag_name, params=None, content=None, left=True):
    """
    This function generates functions, which matches all tags with neighbours
    defined by parameters.

    Args:
        tag_name (str): Tag has to have neighbour with this tagname.
        params (dict): Tag has to have neighbour with this parameters.
        params (str): Tag has to have neighbour with this content.
        left (bool, default True): Tag has to have neigbour on the left, or
                                   right (set to ``False``).

    Returns:
        bool: True for every matching tag.

    Note:
        This function can be used as parameter for ``.find()`` method in
        HTMLElement.
    """
    def has_neigh_closure(element):
        if not element.parent \
           or not (element.isTag() and not element.isEndTag()):
            return False

        # filter only visible tags/neighbours
        childs = element.parent.childs
        childs = filter(
            lambda x: (x.isTag() and not x.isEndTag()) \
                      or x.getContent().strip() or x is element,
            childs
        )
        if len(childs) <= 1:
            return False

        ioe = childs.index(element)
        if left and ioe > 0:
            return is_equal_tag(childs[ioe - 1], tag_name, params, content)

        if not left and ioe + 1 < len(childs):
            return is_equal_tag(childs[ioe + 1], tag_name, params, content)

        return False

    return has_neigh_closure


# Generated parsers
def get_second(dom):
    el = dom.find(
        'container',
        {'id': 'mycontent'},
        fn=has_neigh(None, None, 'something something', left=False)
    )

    # pick element from list
    el = el[0] if el else None

    return el


def get_first(dom):
    el = dom.wfind('root').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: root\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    el = el.wfind('xax').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: xax\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    el = el.wfind('container').childs

    if not el:
        raise UserWarning(
            "Can't find variable 'first'.\n" +
            'Tag name: container\n' +
            'El:' + str(el) + '\n' +
            'Dom:' + str(dom)
        )

    el = el[-1]

    return el


# Unittest
def test_parsers():
    # Test parsers against autoparser/autoparser_data/simple_xml.xml
    html = handle_encodnig(
        _get_source('autoparser/autoparser_data/simple_xml.xml')
    )
    dom = dhtmlparser.parseString(html)
    dhtmlparser.makeDoubleLinked(dom)

    second = get_second(dom)
    assert second.getContent().strip() == 'and this'

    first = get_first(dom)
    assert first.getContent().strip() == "i wan't this"

    # Test parsers against autoparser/autoparser_data/simple_xml2.xml
    html = handle_encodnig(
        _get_source('autoparser/autoparser_data/simple_xml2.xml')
    )
    dom = dhtmlparser.parseString(html)
    dhtmlparser.makeDoubleLinked(dom)

    second = get_second(dom)
    assert second.getContent().strip() == 'another wanted thing'

    first = get_first(dom)
    assert first.getContent().strip() == 'something wanted'


# Run tests of the parser
if __name__ == '__main__':
    test_parsers()

API

harvester.edeposit_autoparser