# Distributed Parsing
These examples will show how to use the DocParser class to make a parallelized parsing pipeline. Most of these features are in two methods:


* **`.distribute_parse()`**: A text processing-specific method that will distribute texts among processors, apply a supplied preprocssing function, process using spacy's nlp.pipe(), and then parse to a non-spacy object using a supplied function. If a dt_inst keyword argument is defined, your supplied function may additionally insert the parsed result into the DocTable directly from the parsing process.

* **`.distribute_process()`**: A more general function that allows you to distribute parsing of any element list and then if supplied with dt_inst can insert into the DocTable. This is ideal for cases where you do not want to store all texts into memory at once (as `distribute_parse()` requires, and could instead supply a list of filenames that then could be read, preprocessed, postprocessed, and inserted into the database all within the distributed processes.

* **`.distribute_chunks()`**: A function that creates processes and allows you to provide a function which operates on a chunk of the provided elements.

In [1]:
from spacy import displacy
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from pprint import pprint
import sys
sys.path.append('..')
import doctable as dt

In [2]:
texts_small = ['The hat is red. And so are you.\n\nWhatever, they said. Whatever indeed.', 
               'But why is the hat blue?\n\nAre you colorblind? See the answer here: http://google.com']

## `.distribute_parse()`

In [3]:
# this is the straightforward mode where it tokenizes each doc in parallel
# using default document preprocessing and parsing
parsed = dt.DocParser.distribute_parse(texts_small, nlp)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [4]:
# use paragraph_sep to maintain paragraph information
parsed = dt.DocParser.distribute_parse(texts_small, nlp, paragraph_sep='\n\n')
print(parsed)

[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?'], ['are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]]


In [5]:
# this shows an exmple where it will customize every element of the process
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def use_token_overload(tok): return dt.DocParser.use_tok(tok, filter_stop=False, filter_punct=False)
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.tokenize_doc(doc, use_tok_func=use_token_overload, parse_tok_func=parse_token_overload, split_sents=True)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess)
pprint(parsed)

[[['the', 'hat', 'is', 'red', '.'],
  ['and', 'so', 'are', 'you', '.'],
  ['whatever', ',', 'they', 'said', '.'],
  ['whatever', 'indeed', '.']],
 [['but', 'why', 'is', 'the', 'hat', 'blue', '?'],
  ['are'],
  ['you', 'colorblind', '?'],
  ['see', 'the', 'answer', 'here', ':']]]


### Parse and insert into database
The best part about all of these methods is that you can place them into a doctable directly, rather than returning them.

In [6]:
def parse_insert(doc, db):
    toks = dt.DocParser.tokenize_doc(doc)
    db.insert({'tokens':toks})
db = dt.DocTable(schema=[('pickle','tokens')], fname='t12.db')
db.delete() # empty if it had some rows
dt.DocParser.distribute_parse(texts_small, nlp, dt_inst=db, parsefunc=parse_insert, n_cores=1)
print(db.select_df())

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "../doctable/docparser.py", line 416, in distribute_chunks
    parsed = [el for parsed_chunk in p.starmap(chunk_thread_func, chunks)
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 638, in get
    self.wait(timeout)
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/multiprocessing/pool.py", line 635, in wait
    self._event.wait(timeout)
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/home/utopia3/dc326/local/anaconda3/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/utopia3/dc3

TypeError: must be str, not list

### With parsetrees or arbitrary objects
The fact that you can pass custom parsers means you can also use parsetrees or any other custom document representation.

In [None]:
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.get_parsetrees(doc, parse_tok_func=parse_token_overload)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess)
pprint(parsed)

In [None]:
def pf(doc):
    return len(doc)
dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=pf)

In [None]:
def pf(doc):
    return len(doc)
dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=pf, paragraph_sep='\n\n')

## `.distribute_process()`
This method has fewer features than `.distribute_parse()` because you cannot control the outer `nlp.pipe()` call that wraps chunk processing, but using this method you can concurrently process lists of any elements, and similarily add result objects into a DocTable directly. This is a good option for when you are training large models or would like more control over your text processing.

In [None]:
# this shows the general case where you can parse any element
def multiply(num):
    return num*2
elements = list(range(10))
dt.DocParser.distribute_process(multiply, elements)

In [None]:
# you can also store into a db
def multiply_sp(num, db):
    db.insert({'num':num*2})
elements = list(range(10))
db = dt.DocTable(schema=[('integer','num')], fname='t1.db')
db.delete()
dt.DocParser.distribute_process(multiply_sp, elements, dt_inst=db)
db.select_df()

In [None]:
# you can also add additional ordered args to pass to the function
def multiply_sp(num, mult):
    return num*mult
elements = list(range(10))
dt.DocParser.distribute_process(multiply_sp, elements, 4)

In [None]:
# and now additional args plus the database
def cust_tokenize(text, db, nlp):
    doc = nlp(text)
    db.insert({'tokens':[t.text for t in doc]})

db = dt.DocTable(schema=[('pickle','tokens')], fname='tcustok.db')
db.delete() # just in case had prev elements
dt.DocParser.distribute_process(cust_tokenize, texts_small, nlp, dt_inst=db)
db.select_df()

## `.distribute_chunks()`
This function provides the least functionality and the most flexibility. It is used by both the other two functions. The method passed to this function will operate on a chunk of elements rather than a single element. If it is desirable to enter data into a DocTable, this must be handeled manually. The function must return a list of the same size as the chunk size.

In [61]:
# and now additional args plus the database
def muli_multi(nums):
    return [num*2 for num in nums]

nums = list(range(100))
dt.DocParser.distribute_chunks(muli_multi, nums, nlp)

TypeError: muli_multi() takes 1 positional argument but 2 were given

In [57]:
# and now additional args plus the database
def cust_tokenize(text, DB, schema, fname):
    db = DB(schema=schema, fname=fname, persistent_conn=False)
    print('inserting')
    db.insert({'tokens':text.split()})
    print('inserted')

#db = dt.DocTable(, )
#db.delete() # just in case had prev elements
schema=[('pickle','tokens')]
fname='tcustok.db'
dt.DocParser.distribute_process(cust_tokenize, texts_small, dt.DocTable, schema, fname)
db = dt.DocTable(schema=schema,fname=fname)
#db.insert({'tokens':None})
print(db)

inserting
inserting
inserted
inserted
<DocTable2::_documents_ ct: 36>


## Distributed Parsing
Use the `.distribute_parse()` method to process many documents in parallel. All of the same tokenization function settings can be passed, as we will show.

By default, this method will use the built-in DocParser methods for tokenization, etc. But these can be overloaded and overridden to provide more robust functionality.

In [None]:
parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True)
print(parsed)

parsed = dt.DocParser.distribute_parse(texts_small, nlp, verbose=True, paragraph_sep='\n\n')
print(parsed)

In [None]:
# provide a custom preprocess function.
def preprocess(text):
    return text.replace('And','wutwutwut') 
parsed = dt.DocParser.distribute_parse(texts_small, nlp, preprocessfunc=preprocess)
print(parsed)

In [None]:
# now provide a totally custom parser fucntion. Can return literally anything.
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc)
print(parsed)

In [None]:
# broken into paragraphs and then sentences
def parsefunc(doc): return len(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, paragraph_sep='\n\n')
print(parsed)

## Distribute Parse With DocParser Methods
For most applications, you can simply re-use many of the DocParser methods. We will show some examples. To see their full documentation refer to the docs for those functions.

In the following example, we overload many parts of the DocParser methods for parsing documents. In this way, we can still use DocParser features but by customizing a pipeline.

## DocParser With Parsetrees
One obvious implication of being able to overload the parser function is that we can also use the DocParser parsetree functionality. This is pretty straightforward and builds on the other examples used here and in the parsetree examples.

In [None]:
# in this example we create the most basic parsetree
def preproc(text): return dt.DocParser.preprocess(text, remove_url=True)
def parsetree(doc): return dt.DocParser.get_parsetrees(doc)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsetree, preprocessfunc=preproc)
print(parsed)