# Distributed Parsing
These examples will show how to use the DocParser class to make a parallelized pipeline. The `.distribute_parse()` method is the most important aspect of these features, and it is used to parallelize parsing tasks across many CPUs. Generally the workflow goes that you define your functions for tokenization, often times simply overloading parameters on the DocParser functions of interest.

In [1]:
from spacy import displacy
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from pprint import pprint
import sys
sys.path.append('..')
import doctable as dt

In [12]:
texts_small = ['The hat is red. And so are you.\n\nWhatever, they said. Whatever indeed.', 
               'But why is the hat blue?\n\nAre you colorblind? See the answer here: http://google.com']

## `.distribute_parse()`
This method will automatically distribute parsing jobs across multiple CPUs for maximum concurrency. It can also break up paragraphs to maintain maximum ammount of information from the original texts. It also allows you to pass custom preprocessors and parser functions that wil be passed to each process.

In [7]:
# this is the straightforward mode where it tokenizes each doc in parallel
# using default document preprocessing and parsing
parsed = dt.DocParser.distribute_parse(texts_small, nlp)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [9]:
# use paragraph_sep to maintain paragraph information
parsed = dt.DocParser.distribute_parse(texts_small, nlp, paragraph_sep='\n\n')
print(parsed)

[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?'], ['are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]]


In [10]:
# this shows an exmple where it will customize every element of the process
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def use_token_overload(tok): return dt.DocParser.use_tok(tok, filter_stop=False, filter_punct=False)
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.tokenize_doc(doc, use_tok_func=use_token_overload, parse_tok_func=parse_token_overload, split_sents=True)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess, verbose=False)
pprint(parsed)

[[['the', 'hat', 'is', 'red', '.'],
  ['and', 'so', 'are', 'you', '.'],
  ['whatever', ',', 'they', 'said', '.'],
  ['whatever', 'indeed', '.']],
 [['but', 'why', 'is', 'the', 'hat', 'blue', '?'],
  ['are'],
  ['you', 'colorblind', '?'],
  ['see', 'the', 'answer', 'here', ':']]]


### With parsetrees
The fact that you can pass custom parsers means you can also use parsetrees.

In [16]:
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.get_parsetrees(doc, parse_tok_func=parse_token_overload)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess, verbose=False)
pprint(parsed)

[[ParseTree(['the', 'hat', 'is', 'red', '.']),
  ParseTree(['and', 'so', 'are', 'you', '.', '']),
  ParseTree(['whatever', ',', 'they', 'said', '.']),
  ParseTree(['whatever', 'indeed', '.'])],
 [ParseTree(['but', 'why', 'is', 'the', 'hat', 'blue', '?', '']),
  ParseTree(['are']),
  ParseTree(['you', 'colorblind', '?']),
  ParseTree(['see', 'the', 'answer', 'here', ':'])]]


## Distributed Parsing
Use the `.distribute_parse()` method to process many documents in parallel. All of the same tokenization function settings can be passed, as we will show.

By default, this method will use the built-in DocParser methods for tokenization, etc. But these can be overloaded and overridden to provide more robust functionality.

In [5]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True)
print(parsed)

parsing 2 docs
processing chunks of size 1 with 2 processes.
returned 2 parsed docs or paragraphs
[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [6]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True, paragraph_sep='\n\n')
print(parsed)

parsing 2 docs
processing chunks of size 1 with 4 processes.
returned 4 parsed docs or paragraphs
[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?'], ['are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]]


In [7]:
# provide a custom preprocess function.
def preprocess(text):
    return text.replace('And','wutwutwut') 
parsed = dt.DocParser.distribute_parse(texts_small, nlp, preprocessfunc=preprocess)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'wutwutwut', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [8]:
# now provide a totally custom parser fucntion. Can return literally anything.
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc)
print(parsed)

[19, 18]


In [9]:
# broken into paragraphs and then sentences
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, paragraph_sep='\n\n')
print(parsed)

Process ForkPoolWorker-18:
Process ForkPoolWorker-19:
Process ForkPoolWorker-20:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/utopia3/dc326/loca

## Distribute Parse With DocParser Methods
For most applications, you can simply re-use many of the DocParser methods. We will show some examples. To see their full documentation refer to the docs for those functions.

In the following example, we overload many parts of the DocParser methods for parsing documents. In this way, we can still use DocParser features but by customizing a pipeline.

## DocParser With Parsetrees
One obvious implication of being able to overload the parser function is that we can also use the DocParser parsetree functionality. This is pretty straightforward and builds on the other examples used here and in the parsetree examples.

In [None]:
# in this example we create the most basic parsetree
def preproc(text): return dt.DocParser.preprocess(text, remove_url=True)
def parsetree(doc): return dt.DocParser.get_parsetrees(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsetree, preprocessfunc=preproc)
print(parsed)