Adapting parserator to handle an entire document

I am currently using Parserator to parse short strings, like this: 

> s of 1110 of an hour. The maximum amount to be paid under this contract is $20,000.00. No amount of  work is guaranteed under this agreement; payments wil

and this 

> General Liability insurance will be purchased and maintained with limits of $1,000,000 per occurrence  an

I extract these strings using a loose regular expression ".{75}\$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).

@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as <skip> and the interesting ones would be tagged with their proper values. 

I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapting parserator to handle an entire document #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adapting parserator to handle an entire document #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions