# Distributed Parsing
These examples will show how to use the DocParser class to make a parallelized parsing pipeline. Most of these features are in two methods:


* **`.distribute_parse()`**: A text processing-specific method that will distribute texts among processors, apply a supplied preprocssing function, process using spacy's nlp.pipe(), and then parse to a non-spacy object using a supplied function. If a dt_inst keyword argument is defined, your supplied function may additionally insert the parsed result into the DocTable directly from the parsing process.

* **`.distribute_process()`**: A more general function that allows you to distribute parsing of any element list and then if supplied with dt_inst can insert into the DocTable. This is ideal for cases where you do not want to store all texts into memory at once (as `distribute_parse()` requires, and could instead supply a list of filenames that then could be read, preprocessed, postprocessed, and inserted into the database all within the distributed processes.

* **`.distribute_chunks()`**: A function that creates processes and allows you to provide a function which operates on a chunk of the provided elements.

In [1]:
from spacy import displacy
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from pprint import pprint
import sys
sys.path.append('..')
import doctable as dt

In [2]:
texts_small = ['The hat is red. And so are you.\n\nWhatever, they said. Whatever indeed.', 
               'But why is the hat blue?\n\nAre you colorblind? See the answer here: http://google.com']

## `.distribute_parse()`

In [3]:
# this is the straightforward mode where it tokenizes each doc in parallel
# using default document preprocessing and parsing
parsed = dt.DocParser.distribute_parse(texts_small, nlp)
print(parsed)

[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.', 'whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.'], ['but', 'why', 'is', 'the', 'hat', 'blue', '?', 'are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]


In [4]:
# use paragraph_sep to maintain paragraph information
parsed = dt.DocParser.distribute_parse(texts_small, nlp, paragraph_sep='\n\n')
print(parsed)

[[['the', 'hat', 'is', 'red', '.', 'and', 'so', 'are', 'you', '.'], ['whatever', ',', 'they', 'said', '.', 'whatever', 'indeed', '.']], [['but', 'why', 'is', 'the', 'hat', 'blue', '?'], ['are', 'you', 'colorblind', '?', 'see', 'the', 'answer', 'here', ':', 'http://google.com']]]


In [5]:
# this shows an exmple where it will customize every element of the process
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def use_token_overload(tok): return dt.DocParser.use_tok(tok, filter_stop=False, filter_punct=False)
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.tokenize_doc(doc, use_tok_func=use_token_overload, parse_tok_func=parse_token_overload, split_sents=True)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess)
pprint(parsed)

[[['the', 'hat', 'is', 'red', '.'],
  ['and', 'so', 'are', 'you', '.'],
  ['whatever', ',', 'they', 'said', '.'],
  ['whatever', 'indeed', '.']],
 [['but', 'why', 'is', 'the', 'hat', 'blue', '?'],
  ['are'],
  ['you', 'colorblind', '?'],
  ['see', 'the', 'answer', 'here', ':']]]


### Parse and insert into database
The best part about all of these methods is that you can place them into a doctable directly, rather than returning them.

In [6]:
def parse_insert(doc, db):
    toks = dt.DocParser.tokenize_doc(doc)
    db.insert({'tokens':toks})
db = dt.DocTable(schema=[('pickle','tokens')], fname='t12.db')
db.delete() # empty if it had some rows
dt.DocParser.distribute_parse(texts_small, nlp, dt_inst=db, parsefunc=parse_insert)
print(db.select_df())

                                              tokens
0  [the, hat, is, red, ., and, so, are, you, ., w...
1  [but, why, is, the, hat, blue, ?, are, you, co...


### With parsetrees or arbitrary objects
The fact that you can pass custom parsers means you can also use parsetrees or any other custom document representation.

In [7]:
def preprocess(text): return dt.DocParser.preprocess(text, replace_url='')
def parse_token_overload(tok): return dt.DocParser.parse_tok(tok, lemmatize=False)
def parsefunc(doc): return dt.DocParser.get_parsetrees(doc, parse_tok_func=parse_token_overload)
parsed = dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=parsefunc, preprocessfunc=preprocess)
pprint(parsed)

[[ParseTree(['the', 'hat', 'is', 'red', '.']),
  ParseTree(['and', 'so', 'are', 'you', '.', '']),
  ParseTree(['whatever', ',', 'they', 'said', '.']),
  ParseTree(['whatever', 'indeed', '.'])],
 [ParseTree(['but', 'why', 'is', 'the', 'hat', 'blue', '?', '']),
  ParseTree(['are']),
  ParseTree(['you', 'colorblind', '?']),
  ParseTree(['see', 'the', 'answer', 'here', ':'])]]


In [8]:
def pf(doc):
    return len(doc)
dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=pf)

[19, 18]

In [9]:
def pf(doc):
    return len(doc)
dt.DocParser.distribute_parse(texts_small, nlp, parsefunc=pf, paragraph_sep='\n\n')

[[10, 8], [7, 10]]

## `.distribute_process()`
This method has fewer features than `.distribute_parse()` because you cannot control the outer `nlp.pipe()` call that wraps chunk processing, but using this method you can concurrently process lists of any elements, and similarily add result objects into a DocTable directly. This is a good option for when you are training large models or would like more control over your text processing.

In [10]:
# this shows the general case where you can parse any element
def multiply(num):
    return num*2
elements = list(range(10))
dt.DocParser.distribute_process(multiply, elements)

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [11]:
# you can also store into a db
def multiply_sp(num, db):
    db.insert({'num':num*2})
elements = list(range(10))
db = dt.DocTable(schema=[('integer','num')], fname='t1.db')
db.delete()
dt.DocParser.distribute_process(multiply_sp, elements, dt_inst=db)
db.select_df()

Unnamed: 0,num
0,0
1,2
2,4
3,6
4,8
5,10
6,12
7,14
8,16
9,18


In [12]:
# you can also add additional ordered args to pass to the function
def multiply_sp(num, mult):
    return num*mult
elements = list(range(10))
dt.DocParser.distribute_process(multiply_sp, elements, 4)

[0, 4, 8, 12, 16, 20, 24, 28, 32, 36]

In [13]:
# and now additional args plus the database
def cust_tokenize(text, db, nlp):
    doc = nlp(text)
    db.insert({'tokens':[t.text for t in doc]})

db = dt.DocTable(schema=[('pickle','tokens')], fname='tcustok.db')
db.delete() # just in case had prev elements
dt.DocParser.distribute_process(cust_tokenize, texts_small, nlp, dt_inst=db)
db.select_df()

Unnamed: 0,tokens
0,"[The, hat, is, red, ., And, so, are, you, ., \..."
1,"[But, why, is, the, hat, blue, ?, \n\n, Are, y..."


## `.distribute_chunks()`
This function provides the least functionality and the most flexibility. It is used by both the other two functions. The method passed to this function will operate on a chunk of elements rather than a single element. If it is desirable to enter data into a DocTable, this must be handeled manually. The provided function must return a list of the same size as the chunk size it was given.

In [14]:
# and now additional args plus the database
def muli_multi(nums):
    return [num*1.275 for num in nums]

nums = list(range(1000))
%time res = dt.DocParser.distribute_chunks(muli_multi, nums, workers=1)
res[:3]

CPU times: user 4.23 ms, sys: 7.57 ms, total: 11.8 ms
Wall time: 32.7 ms


[0.0, 1.275, 2.55]

In [15]:
# you can see that passing the database instance to the worker
def parse_texts_chunk(texts, db, nlp):
    for doc in nlp.pipe(texts):
        toks = dt.DocParser.tokenize_doc(doc)
        db.insert({'tokens':toks})

nums = list(range(1000))
db = dt.DocTable(schema=[('pickle','tokens')], fname='parse_text_chunk.db')
db.delete()
%time res = dt.DocParser.distribute_chunks(parse_texts_chunk, texts_small, db, nlp, workers=2)
res[:3]
db.select_df()

CPU times: user 1.44 ms, sys: 13.6 ms, total: 15.1 ms
Wall time: 64.6 ms


Unnamed: 0,tokens
0,"[the, hat, is, red, ., and, so, are, you, ., w..."
1,"[but, why, is, the, hat, blue, ?, are, you, co..."
