Skip to content

Adapting parserator to handle an entire document #16

@AbeHandler

Description

@AbeHandler

I am currently using Parserator to parse short strings, like this:

s of 1110 of an hour. The maximum amount to be paid under this contract is $20,000.00. No amount of work is guaranteed under this agreement; payments wil

and this

General Liability insurance will be purchased and maintained with limits of $1,000,000 per occurrence an

I extract these strings using a loose regular expression ".{75}$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).

@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as and the interesting ones would be tagged with their proper values.

I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions