# Parser Pipeline Basics
Parsing texts is usually the same.

In [1]:
# import doctable
import sys
sys.path.append('..')
import doctable

In [2]:
# download data for examples
from util import download_nss # for downloading nss text documents from my github repo
nss = download_nss(years=[1987])[1987].split('\n\n') # download nss and split paragraphs
nss[:3] # list of paragraph strings

['I. An American Perspective ',
 'In the early days of this Administration we laid the foundation for a more constructive and positive American role in world affairs by clarifying the essential elements of U.S. foreign and defense policy. ',
 "Over the intervening years, we have looked objectively at our policies and performance on the world scene to ensure they reflect the dynamics of a complex and ever-changing world . Where course adjustments have been required, I have directed changes. But we have not veered and will not veer from the broad aims that guide America's leadership role in today's world: "]

## Build a ParsePipeline
A pipeline is simply a list of functions (called components) to apply sequentially on each element of your data.

Pipeline components from doctable are functions that lie in `doctable.parse`, and can be accessed using `doctable.parse.<function_name>`. You can also use the `doctable.component` function as a shortcut to access those functions.

In [3]:
import spacy # for example text processing
nlp = spacy.load('en') # nlp is function for parsing with spacy

from doctable import component # shortcut to functions in doctable.parse.<functions>

pipeline = doctable.ParsePipeline([
    component('preprocess', replace_xml=''), # preprocess to remove xml tags (doctable.parse.preprocess)
    nlp, # spacy nlp parser object
    component('tokenize', split_sents=False), # add tokenizer component (doctable.parse.tokenize)
])
pipeline.components

[<function doctable.pipeline.component.<locals>.<lambda>(x)>,
 <spacy.lang.en.English at 0x7f614d07aba8>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>]

In [4]:
pipeline.parse(nss[0])

[I., An, American, Perspective]

## More Components For Pipeline
Because some of the component functions in `doctable.parse` take more functions as arguments, these components can be nested. Consider the `tokenize` function, which takes an argument `keep_tok_func` for deciding whether to keep a spacy token in the final output and an argument `parse_tok_func` to convert a Spacy token object into a string. The doctable functions `keep_tok` and `parse_tok` have some useful settings that we'll set in this next pipeline exmaple. We'll also use the doctable function `merge_tok_spans` to combine multi-word entities.

In [5]:
from doctable import component # shortcut to functions in doctable.parse.<functions>

pipeline = doctable.ParsePipeline([
    component('preprocess', replace_xml=''), # preprocess to remove xml tags (doctable.parse.preprocess)
    nlp, # spacy nlp parser object
    component('merge_tok_spans', merge_ents=True, merge_noun_chunks=False),
    component('tokenize', **{
        'split_sents': False,
        'keep_tok_func': component('keep_tok', **{
            'keep_whitespace': False, # don't keep whitespace
            'keep_punct': True, # keep punctuation and stopwords
            'keep_stop': True,
        }),
        'parse_tok_func': component('parse_tok', **{
            'format_ents': True,
            'lemmatize': False,
            'num_replacement': 'NUM',
            'ent_convert': lambda e: e.text.upper(), # function to capitalize named entities
        })
    })
])
pipeline.components

[<function doctable.pipeline.component.<locals>.<lambda>(x)>,
 <spacy.lang.en.English at 0x7f614d07aba8>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>]

In [6]:
print(nss[2])
print()
print(pipeline.parse(nss[2]))

Over the intervening years, we have looked objectively at our policies and performance on the world scene to ensure they reflect the dynamics of a complex and ever-changing world . Where course adjustments have been required, I have directed changes. But we have not veered and will not veer from the broad aims that guide America's leadership role in today's world: 

['over', 'THE INTERVENING YEARS', ',', 'we', 'have', 'looked', 'objectively', 'at', 'our', 'policies', 'and', 'performance', 'on', 'the', 'world', 'scene', 'to', 'ensure', 'they', 'reflect', 'the', 'dynamics', 'of', 'a', 'complex', 'and', 'ever', '-', 'changing', 'world', '.', 'where', 'course', 'adjustments', 'have', 'been', 'required', ',', 'i', 'have', 'directed', 'changes', '.', 'but', 'we', 'have', 'not', 'veered', 'and', 'will', 'not', 'veer', 'from', 'the', 'broad', 'aims', 'that', 'guide', 'AMERICA', "'s", 'leadership', 'role', 'in', 'TODAY', "'s", 'world', ':']


### Parsing Using Multiprocessing
The `ParsePipeline` class also allows you to parse multiple texts at once in parallel using the `doctable.Distribute()` module. Simply use the `.parsemany()` method to access this feature. It can give huge performance gains when used in the right places.

In [7]:
%time parsed = pipeline.parsemany(nss[:100], workers=1)
parsed[0]

CPU times: user 2.41 s, sys: 2.49 ms, total: 2.41 s
Wall time: 2.41 s


['I. AN AMERICAN', 'perspective']

In [8]:
# UNDER CERTAIN CONDITIONS THIS FAILS - IT FAILS IN THE FIRST EXAMPLE HERE, but not the last
%time parsed = pipeline.parsemany(nss[:100], workers=5)
parsed[0]

CPU times: user 6.71 ms, sys: 33.4 ms, total: 40.1 ms
Wall time: 556 ms


['I. AN AMERICAN', 'perspective']

## Custom Functions in Pipeline
Because a pipeline is just a list of function components, it's easy to add components after creating a pipeline or simply 

In [9]:
pipeline = doctable.ParsePipeline([
    lambda x: x.upper(),
    lambda x: x.split(),
])
print(pipeline.parse(nss[1]))

['IN', 'THE', 'EARLY', 'DAYS', 'OF', 'THIS', 'ADMINISTRATION', 'WE', 'LAID', 'THE', 'FOUNDATION', 'FOR', 'A', 'MORE', 'CONSTRUCTIVE', 'AND', 'POSITIVE', 'AMERICAN', 'ROLE', 'IN', 'WORLD', 'AFFAIRS', 'BY', 'CLARIFYING', 'THE', 'ESSENTIAL', 'ELEMENTS', 'OF', 'U.S.', 'FOREIGN', 'AND', 'DEFENSE', 'POLICY.']
