# Parser Pipeline
These examples will show how to use the DocParser class to make a parallelized pipeline. The `.distribute_parse()` method is the most important aspect of these features, and it is used to parallelize parsing tasks across many CPUs. Generally the workflow goes that you define your functions for tokenization, often times simply overloading parameters on the DocParser functions of interest.

In [1]:
import sys
sys.path.append('..')
import doctable as dt
from spacy import displacy
import urllib.request

import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher

In [2]:
base = 'https://raw.githubusercontent.com/devincornell/intro_to_text_analysis/master/duke_workshop/nss/'
urls = (
    base+'trump_nss.txt',
    base+'obama_nss.txt',
)
texts = [urllib.request.urlopen(url).read().decode('utf-8') for url in urls]
print('trump:', texts[0][:80])
print('obama:', texts[1][:80])
texts_small = ['The hat is red. And so are you.\n\nWhatever, they said. Whatever indeed.', 
               'But why is the hat blue?\n\nAre you colorblind? See the answer here: http://google.com']

trump: An America that is safe, prosperous, and free at home is an America with the str
obama: Today, the United States is stronger and better positioned to seize the opportun


## Distributed Parsing
Use the `.distribute_parse()` method to process many documents in parallel. All of the same tokenization function settings can be passed, as we will show.

By default, this method will use the built-in DocParser methods for tokenization, etc. But these can be overloaded and overridden to provide more robust functionality.

In [3]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True)
print(parsed)

parsing 2 docs
processing chunks of size 1 with 2 processes.
returned 2 parsed docs or paragraphs
[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [4]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True, paragraph_sep='\n\n')
print(parsed)

parsing 2 docs
split into 4 paragraphs
processing chunks of size 1 with 4 processes.
returned 4 parsed docs or paragraphs
[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?']]]


In [5]:
# provide a custom preprocess function.
def preprocess(text):
    return text.replace('And','wutwutwut') 
parsed = dt.DocParser.distribute_parse(texts_small, nlp, preprocessfunc=preprocess)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'wutwutwut', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [6]:
# now provide a totally custom parser fucntion. Can return literally anything.
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc)
print(parsed)

[19, 18]


In [7]:
# works with paragraphs too
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, paragraph_sep='\n\n')
print(parsed)

[[10, 8], [7]]


## Distribute Parse With DocParser Methods
For most applications, you can simply re-use many of the DocParser methods. We will show some examples. To see their full documentation refer to the docs for those functions.

In [8]:
# example overloading the remove_url of the built-in DocParser.preprocess() method
def preprocess(text): return dt.DocParser.preprocess(text, remove_url=True)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, preprocessfunc=preprocess)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':']]


In the following example, we overload many parts of the DocParser methods for parsing documents. In this way, we can still use DocParser features but by customizing a pipeline.

In [11]:
def use_token_overload(tok):
    # decide to include token or not
    # this example excludes stopwords and punctuation
    return dt.DocParser.use_tok(tok, filter_stop=True, filter_punct=True)

def parse_token_overload(tok):
    # decide how to parse a spacy token
    # this example lemmatizes each word
    return dt.DocParser.parse_tok(tok, lemmatize=True)

def parsefunc(doc):
    # decide how to parse
    # in this example, we simply overload some of the supplied functions and also have it split sentences
    return dt.DocParser.tokenize_doc(doc, use_tok_func=use_token_overload, parse_tok_func=parse_token_overload, split_sents=True)

def preprocess(text):
    # decide to preprocess
    # overloaded to remove url
    return dt.DocParser.preprocess(text, remove_url=True)

parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess)
print(parsed)

[[['hat', 'red'], [], ['say'], []], [['hat', 'blue'], [], ['colorblind'], ['answer']]]


## DocParser With Parsetrees
One obvious implication of being able to overload the parser function is that we can also use the DocParser parsetree functionality. This is pretty straightforward and builds on the other examples used here and in the parsetree examples.

In [17]:
# in this example we create the most basic parsetree
def preproc(text): return dt.DocParser.preprocess(text, remove_url=True)
def parsetree(doc): return dt.DocParser.get_parsetrees(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsetree, preprocessfunc=preproc)
print(parsed)

[[ParseTree(5), ParseTree(6), ParseTree(5), ParseTree(3)], [ParseTree(8), ParseTree(1), ParseTree(3), ParseTree(5)]]
