# Parser Pipeline
These examples will show how to use the DocParser class for an integrated pipeline. The `.distribute_parse()` method is the most important aspect of these features, and it is used to parallelize parsing tasks across many CPUs. Generally the workflow goes that you define your functions for tokenization, often times simply overloading parameters on the DocParser functions of interest.

In [18]:
import sys
sys.path.append('..')
import doctable as dt
from spacy import displacy
import urllib.request

import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher

In [2]:
base = 'https://raw.githubusercontent.com/devincornell/intro_to_text_analysis/master/duke_workshop/nss/'
urls = (
    base+'trump_nss.txt',
    base+'obama_nss.txt',
)
texts = [urllib.request.urlopen(url).read().decode('utf-8') for url in urls]
print('trump:', texts[0][:80])
print('obama:', texts[1][:80])
texts_small = ['The hat is red. And so are you.\n\nWhatever, they said. Whatever indeed.', 'But why is the hat blue?\n\nAre you colorblind?']

trump: An America that is safe, prosperous, and free at home is an America with the str
obama: Today, the United States is stronger and better positioned to seize the opportun


## Distributed Parsing
Use the `.distribute_parse()` method to process many documents in parallel. All of the same tokenization function settings can be passed, as we will show.

By default, this method will use the built-in DocParser methods for tokenization, etc. But these can be overloaded and overridden to provide more robust functionality.

In [7]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True)
print(parsed)

parsing 2 docs
processing chunks of size 1 with 2 processes.
returned 2 parsed docs or paragraphs
[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?']]


In [8]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True, paragraph_sep='\n\n')
print(parsed)

parsing 2 docs
split into 4 paragraphs
processing chunks of size 1 with 4 processes.
returned 4 parsed docs or paragraphs
[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?']]]


In [14]:
# provide a custom preprocess function.
def preprocess(text):
    return text.replace('And','wutwutwut') 
parsed = dt.DocParser.distribute_parse(texts_small, nlp, preprocessfunc=preprocess)
print(parsed)

processing chunks of size 1 with 2 processes.
returned 2 parsed docs or paragraphs
[['the', 'hat', 'is', 'red', '.', 'wutwutwut', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?']]


In [16]:
# now provide a totally custom parser fucntion. Can return literally anything.
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, n_cores=1)
print(parsed)

processing chunks of size 2 with 1 processes.
returned 2 parsed docs or paragraphs
[19, 12]


In [17]:
# works with paragraphs too
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, paragraph_sep='\n\n')
print(parsed)

processing chunks of size 1 with 4 processes.
returned 4 parsed docs or paragraphs
[[10, 8], [7]]
