# Example 2: NSS Corpus in Paragraph Form
Here I'll show how to make a DocTable for storing NSS documents at the paragraph level, and parse the documents in parallel.

For context, check out [Example 1](https://devincornell.github.io/doctable/examples/ex_nss.html) - here we'll just use some shortcuts for code used there. These come from the util.py code in the repo examples folder.

In [1]:
import sys
sys.path.append('..')
import util
import doctable
import spacy
from tqdm import tqdm

First we get the data as a list of dictionaries. We'll skip the details of downloading the nss data just to use for illustration.

In [2]:
nss_data = util.download_all_nssdata()
print(nss_data[0].keys())
len(nss_data)

dict_keys(['party', 'president', 'text', 'year'])


17

## 1. Add Data to a DocTable
We now proceed to make a doctable and add our data to it. We made our schema match the format from the data retrieved above, so only need to use the `.insert` method.

In [3]:
class NSSParsetrees(doctable.DocTable):
    tabname = 'nss_parsetrees'
    schema = (
        ('idcol', 'id'),
        ('integer', 'year'), 
        ('string', 'party'),
        ('string', 'president'),
        ('string', 'text'),
        ('pickle', 'parsetrees'),
    )
    def __init__(self, fname=':memory:', **kwargs):
        super().__init__(fname=fname, schema=self.schema, tabname=self.tabname, **kwargs)

In [4]:
# add docs to database
db = NSSParsetrees()
for nssdoc in nss_data:
    db.insert(nssdoc)
print(db)
db.select_df(limit=2)

<DocTable::nss_parsetrees ct: 17>


Unnamed: 0,id,year,party,president,text,parsetrees
0,1,1987,R,Reagan,I. An American Perspective \n\nIn the early da...,
1,2,1988,R,Reagan,Preface\n\nThis statement of America's Nationa...,


## 2. Create a Parser Class Using a Pipeline
Now we create a small `NSSParser` class that keeps a `doctable.ParsePipeline` object for doing the actual text processing. As you can see from our init method, instantiating the package will load a spacy module into memory and construct the pipeline from the selected components. We also create a wrapper over the pipeline `.parse` and `.parsemany` methods. Here we define, instantiate, and view the components of `NSSParser`.

In [5]:
class NSSParser:
    ''' Handles text parsing for NSS documents.'''
    def __init__(self):
        nlp = spacy.load('en')
        
        # this determines all settings for tokenizing
        self.pipeline = doctable.ParsePipeline([
            nlp, # first run spacy parser
            doctable.component('merge_tok_spans', merge_ents=True),
            doctable.component('get_parsetrees', **{
                'parse_tok_func': doctable.component('parse_tok', **{
                    'format_ents': True,
                    'num_replacement': 'NUM',
                })
            })
        ])
    
    def parsemany(self, texts, workers=1):
        return self.pipeline.parsemany(texts, workers=workers)

parser = NSSParser() # creates a parser instance
parser.pipeline.components

[<spacy.lang.en.English at 0x7fdb4ca5cc50>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>]

In [10]:
for idx, year, text in tqdm(db.select(['id','year','text'])):
    parsed = parser.parsemany(text.split('\n\n'), workers=30) # parse paragraphs in parallel
    parsetrees = [pt for par in parsed for pt in par]
    db.update({'parsetrees': parsetrees}, where=db['id']==idx)
db.select_df(limit=2)

100%|██████████| 17/17 [00:21<00:00,  1.27s/it]


Unnamed: 0,id,year,party,president,text,parsetrees
0,1,1987,R,Reagan,I. An American Perspective \n\nIn the early da...,"[(ParseNode(I. An American), ParseNode(perspec..."
1,2,1988,R,Reagan,Preface\n\nThis statement of America's Nationa...,"[(ParseNode(preface)), (ParseNode(this), Parse..."


## 3. Work With Parsetrees
Above you can see that we have loaded all the parsetrees from our sample. Now we show an example illustrating how you would extract simple subject-verb-object triplets from sentences in the NSS documents.

In [50]:
def first_pos(tok, pos):
    for child in tok.childs:
        if child.dep == pos:
            return child.text
        
for i,pt in enumerate(db.select_first('parsetrees')):
    print(' '.join([p.text for p in pt])) # print out sentence
    for tok in pt:
        if tok.pos == 'VERB':
            print(first_pos(tok, 'nsubj'), tok.text, first_pos(tok, 'dobj'))
            
    print('\n')
    if i == 4:
        break

I. An American perspective


in The Early Days of this administration we laid the foundation for a more constructive and positive American role in world affairs by clarifying the essential elements of U.S. foreign and defense policy .
we laid foundation
None clarifying elements


over The Intervening Years , we have looked objectively at our policies and performance on the world scene to ensure they reflect the dynamics of a complex and ever - changing world .
we looked None
None ensure None
they reflect dynamics
None changing None


where course adjustments have been required , i have directed changes .
None required None
i directed changes


but we have not veered and will not veer from the broad aims that guide America 's leadership role in Today 's world :
we veered None
None will None
None veer None
that guide role


