# DocParser Class
This example gives an overview of DocParser functionality. See [reference docs for more detail](https://devincornell.github.io/doctable/ref/doctable.DocParser.html). The class includes only classmethods and staticmethods, so it is meant to be inhereted rather than instantiated.

The DocParser class currently facilitates conversion from spacy doc objects to one of two object types:
1. **Token lists**: The `.tokenize_doc()` method produces sequences of token strings used for input into algorithms like word2vec, topic modeling, co-occurrence analyses, etc. These require no doctable-specific functionality to manipulate. To accomplish this, it also draws on `.parse_tok()` to specify rules for converting spacy token objects to strings and `.use_tok()` to decide whether or not to include a token.
2. **Parsetrees**: The `get_parsetrees()` method produces objects used for grammatical structure analysis. Generally contains token information along with gramattical relationships observed in the original sentences. By default relies on the `.parse_tok()` method to convert token objects to string representations. DocParser can convert these to nested dictionaries or also provides a built-in `ParseTree` class for working with them.

In [1]:
import sys
sys.path.append('..')
import doctable as dt
from spacy import displacy
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
exstr = 'James will paint the house for $20 (twenty dollars). He is a rule-breaker'
doc = nlp(exstr)
doc

James will paint the house for $20 (twenty dollars). He is a rule-breaker

## Preprocessing
The first step before using spacy to parse a document is to preprocess, which usually means removing artifacts. The `.preprocess()` method has features for replacing urls, replacing xml tags, and removing digits.

In [3]:
advstr = 'James said he will paint the house red for $20 (twenty dollars). He is such a <i>rule-breaker</i>: http://rulebreaking.com'
dt.DocParser.preprocess(advstr, replace_url='URL', replace_xml='', replace_digits='DIG')

'James said he will paint the house red for $DIG (twenty dollars). He is such a rule-breaker: URL'

Normally after preprocessing you will then feed into the spacy parser.

## Tokenization
Many text analysis applications begin with converting raw text into sequences of tokens. The `.tokenize_doc()`, `.use_tok()`, and `.parse_tok()` methods are convenient tools to assist with this task.

In [4]:
# basic doc tokenizer works like this
print(dt.DocParser.tokenize_doc(doc))

['James', 'will', 'paint', 'the', 'house', 'for', '$', '20', '(', 'twenty', 'dollars', ')', '.', 'he', 'is', 'a', 'rule', '-', 'breaker']


In [5]:
# and there are a number of options when using this method
print(dt.DocParser.tokenize_doc(doc, split_sents=True, merge_ents=True, merge_noun_chunks=True))

[['James', 'will', 'paint', 'the house', 'for', '$', '20', '(', 'twenty dollars', ')', '.'], ['he', 'is', 'a rule-breaker']]


In [6]:
# you can override functionality to decide to keep the token and to convert tok object to str
use_tok = lambda tok: not tok.like_num
parse_tok = lambda tok: tok.lower_
print(dt.DocParser.tokenize_doc(doc, use_tok_func=use_tok, parse_tok_func=parse_tok))

['james', 'will', 'paint', 'the house', 'for', '$', '(', 'twenty dollars', ')', '.', 'he', 'is', 'a rule-breaker']


In [7]:
# or use DocParser .use_tok() and .parse_tok() methods for additional features.
# this filters out stopwords and converts all number quantities to __NUM__
use_tok = lambda tok: dt.DocParser.use_tok(tok, filter_stop=True)
parse_tok = lambda tok: dt.DocParser.parse_tok(tok, format_ents=True, replace_num='__NUM__')
print(dt.DocParser.tokenize_doc(doc, use_tok_func=use_tok, parse_tok_func=parse_tok))

['James', 'paint', 'the house', '$', '__NUM__', '(', 'Twenty Dollars', ')', '.', 'a rule-breaker']


## Parsetree Extraction
In cases where you want to keep information about gramattical structure in your parsed document, use the `.get_parsetrees()` method.

In [8]:
# extracts a parsetree for each sentence in the document
print(dt.DocParser.get_parsetrees(doc))

[ParseTree(['James', 'will', 'paint', 'the house', 'for', '$', '20', '(', 'twenty dollars', ')', '.']), ParseTree(['he', 'is', 'a rule-breaker'])]


In [19]:
# by default this works like .tokenize_doc() except that it doesn't remove toks
# it includes a lot of other information as well
# it will include .pos and .ent if they were available in spacy parsing
s1, s2 = dt.DocParser.get_parsetrees(doc)
print([t.tok for t in s1])
print([t.dep for t in s1])
print([t.tag for t in s1])
print([t.pos for t in s1])
print([t.ent for t in s1])

['James', 'will', 'paint', 'the house', 'for', '$', '20', '(', 'twenty dollars', ')', '.']
['nsubj', 'aux', 'ROOT', 'dobj', 'prep', 'nmod', 'pobj', 'punct', 'appos', 'punct', 'punct']
['NNP', 'MD', 'VB', 'NN', 'IN', '$', 'CD', '-LRB-', 'NNS', '-RRB-', '.']
['PROPN', 'AUX', 'VERB', 'NOUN', 'ADP', 'SYM', 'NUM', 'PUNCT', 'NOUN', 'PUNCT', 'PUNCT']
['ORG', '', '', '', '', '', 'MONEY', '', 'MONEY', '', '']


In [12]:
# much like .tokenize_doc(), you can specify the parse_token functionality 
#     that will be applied to the .tok property
parse_tok = lambda tok: tok.text.upper()
s1, s2 = dt.DocParser.get_parsetrees(doc, parse_tok_func=parse_tok)
print(s1.toks)

['JAMES', 'WILL', 'PAINT', 'THE HOUSE', 'FOR', '$', '20', '(', 'TWENTY DOLLARS', ')', '.']


In [20]:
# to attach additional info to parsetree tokens, use info_func_map
infomap = {'is_stop':lambda tok: tok.is_stop, 'like_num': lambda tok: tok.like_num}
s1, s2 = dt.DocParser.get_parsetrees(doc, info_func_map=infomap)
print(s1.toks)
print([t.info['like_num'] for t in s1])

['James', 'will', 'paint', 'the house', 'for', '$', '20', '(', 'twenty dollars', ')', '.']
[False, False, False, False, False, False, True, False, False, False, False]


In [28]:
# we can convert to dict to see tree structure
s1, s2 = dt.DocParser.get_parsetrees(doc)
s1.asdict()

{'i': 2,
 'tok': 'paint',
 'tag': 'VB',
 'dep': 'ROOT',
 'info': {},
 'childs': [{'i': 0,
   'tok': 'James',
   'tag': 'NNP',
   'dep': 'nsubj',
   'info': {},
   'childs': [],
   'pos': 'PROPN',
   'ent': 'ORG'},
  {'i': 1,
   'tok': 'will',
   'tag': 'MD',
   'dep': 'aux',
   'info': {},
   'childs': [],
   'pos': 'AUX',
   'ent': ''},
  {'i': 3,
   'tok': 'the house',
   'tag': 'NN',
   'dep': 'dobj',
   'info': {},
   'childs': [],
   'pos': 'NOUN',
   'ent': ''},
  {'i': 4,
   'tok': 'for',
   'tag': 'IN',
   'dep': 'prep',
   'info': {},
   'childs': [{'i': 6,
     'tok': '20',
     'tag': 'CD',
     'dep': 'pobj',
     'info': {},
     'childs': [{'i': 5,
       'tok': '$',
       'tag': '$',
       'dep': 'nmod',
       'info': {},
       'childs': [],
       'pos': 'SYM',
       'ent': ''},
      {'i': 7,
       'tok': '(',
       'tag': '-LRB-',
       'dep': 'punct',
       'info': {},
       'childs': [],
       'pos': 'PUNCT',
       'ent': ''},
      {'i': 8,
       'tok'

#### Working with ParseTree objects
ParseTree objects can be parsed either iteratively (as we showed earlier), or recursively. The `.bubble_accum()` and `.bubble_reduce()` methods are convenient ways of using recursive functions on parsetrees. The `.root` property is also a useful way to write your own recursive functions on the data.

In [31]:
# .bubble_accum() 
def get_ents(pnode):
    if pnode.ent != '':
        return [pnode]
    else:
        return []
s1, s2 = dt.DocParser.get_parsetrees(doc)
s1.bubble_accum(get_ents)

[ParseNode(James), ParseNode(20), ParseNode(twenty dollars)]

In [34]:
# .bubble_reduce() will aggregate data as it performs a DFS
def f(pn,ct):
    return ct + 1
s1, s2 = dt.DocParser.get_parsetrees(doc)
s1.bubble_reduce(f, 0)

11

In [35]:
# or write your own recursive functions using s1.root to 
#     access the root node of the tree
def printnodes(pnode):
    print(pnode.tok, pnode.dep, pnode.pos)
    for child in pnode:
        printnodes(child)
s1, s2 = dt.DocParser.get_parsetrees(doc)
printnodes(s1.root)

paint ROOT VERB
James nsubj PROPN
will aux AUX
the house dobj NOUN
for prep ADP
20 pobj NUM
$ nmod SYM
( punct PUNCT
twenty dollars appos NOUN
) punct PUNCT
. punct PUNCT
