# doctable Demo: US National Security Strategy Documents
In this example, I'll show how to create a database for document + metadata storage using the `DocTable` class, and a parser class using `DocParser`. We will store the metadata you see below in addition to the formatted document text.

This example shows a full example of a doctable workflow designed to parse texts end-to-end, using the NSS documents for demonstation.


In [1]:
import sys
sys.path.append('..')
import doctable
import spacy
from tqdm import tqdm
import pandas as pd
import os
from pprint import pprint
import urllib.request # used for downloading nss docs

## Introduction to Dataset
This dataset is the plain text version of the US National Security Strategy documents. During the parsing process, all plain text files will be downloaded from my [github project hosting the nss docs](https://github.com/devincornell/nssdocs). I compiled the metadata you see below from [a page hosted by the historical dept of the secretary's office](https://history.defense.gov/Historical-Sources/National-Security-Strategy/). In short, each US President must release at least one NSS per term, with some (namely Clinton) producing more.

Here I've created the function `download_nss` to download the text data from my nssdocs github repository, and the python dictionary `nss_metadata` to store information about each document to be stored in the database.

In [2]:
def download_nss(year):
    ''' Simple helper function for downloading texts from my nssdocs repo.'''
    baseurl = 'https://raw.githubusercontent.com/devincornell/nssdocs/master/docs/{}.txt'
    url = baseurl.format(year)
    text = urllib.request.urlopen(url).read().decode('utf-8')
    return text

In [3]:
nss_metadata = {
            1987: {'party': 'R', 'president': 'Reagan'}, 
            1993: {'party': 'R', 'president': 'H.W. Bush'}, 
            2002: {'party': 'R', 'president': 'W. Bush'}, 
            2015: {'party': 'D', 'president': 'Obama'}, 
            1994: {'party': 'D', 'president': 'Clinton'}, 
            1990: {'party': 'R', 'president': 'H.W. Bush'}, 
            1991: {'party': 'R', 'president': 'H.W. Bush'}, 
            2006: {'party': 'R', 'president': 'W. Bush'}, 
            1997: {'party': 'D', 'president': 'Clinton'}, 
            1995: {'party': 'D', 'president': 'Clinton'}, 
            1988: {'party': 'R', 'president': 'Reagan'}, 
            2017: {'party': 'R', 'president': 'Trump'}, 
            1996: {'party': 'D', 'president': 'Clinton'}, 
            2010: {'party': 'D', 'president': 'Obama'}, 
            1999: {'party': 'D', 'president': 'Clinton'}, 
            1998: {'party': 'D', 'president': 'Clinton'}, 
            2000: {'party': 'D', 'president': 'Clinton'}
}

In [4]:
# downloader example: first 100 characters of 1993 NSS document
text = download_nss(1993)
text[:100]

'Preface \n\nAmerican Leadership for Peaceful Change \n\nOur great Nation stands at a crossroads in histo'

# 1. DocTable and Parsing for NSS Document Dataset
This class inherits from DocTable and will typically store schema and other static inforamtion about the database. This is the most common way to work with DocTable. You can see we keep two class member variables to store the database table name and the schema. See the [schema guide](examples/doctable_schema.html) for more schema examples.

We also create a `.insert_nssdoc()` method which wraps the `DocTable.insert()` method to make insertion easier by counting paragraphs, sentences, and tokens to insert. A `.print_doctable()` static method is created so we can print the contents of a NSSDocs database later.

In [5]:
class NSSDocs(doctable.DocTable):
    tabname = 'nssdocs'
    schema = (
        ('idcol', 'id'), # doctable shortcut for automatic index
        
        # some metdata about the documents
        ('integer', 'year', dict(unique=True, nullable=False)),
        ('string','president'),
        ('string','party'), ('check_constraint', 'party in ("R","D")'),
        
        # raw and parsed text data
        ('string', 'text'),
        ('pickle','parsed'), # nested tokens within each paragraph
        
        # metdata
        ('integer','num_paragraphs'),
        ('integer', 'num_tokens'),
        
        # indices for easy access
        ('index', 'ind_yr', ['year'], dict(unique=True)),
    )
    def __init__(self, fname=':memory:', **kwargs):
        super().__init__(fname=fname, schema=self.schema, tabname=self.tabname, **kwargs)
        
        
    def update(self, row, **kwargs):
        ''' Override insert to automatically calculate num_paragraphs and num_tokens.
        '''
        if 'parsed' in row:
            row['num_paragraphs'] = len(row['parsed'])
            row['num_tokens'] = len([tok for par in row['parsed'] for tok in par])
        
        return super().update(row, **kwargs) # call the regular doctable insert now.
    

We can then create a connection to a database by instantiating. Since the fname parameter was not provided, this doctable exists only in memory using sqlite (uses special sqlite name ":memory:"). Our other examples will use files, but instantiating in memory first is a good way to check that the schema is valid. We can check the sqlite table schema using the `.schemainfo` property. You can see that the 'pickle' datatype we chose above is represented as a BLOB column. This is because DocTable, using SQLAlchemy core, creates an interface on top of sqlite to handle the data conversion.

In [6]:
# printing the DocTable object itself shows how many entries there are
db = NSSDocs()
print(db)
pd.DataFrame(db.schemainfo)

<DocTable::nssdocs ct: 0>


Unnamed: 0,name,type,nullable,default,autoincrement,primary_key
0,id,INTEGER,False,,auto,1
1,year,INTEGER,False,,auto,0
2,president,VARCHAR,True,,auto,0
3,party,VARCHAR,True,,auto,0
4,text,VARCHAR,True,,auto,0
5,parsed,BLOB,True,,auto,0
6,num_paragraphs,INTEGER,True,,auto,0
7,num_tokens,INTEGER,True,,auto,0


Now let's download and store the text into the database. Each loop downloads a text document and inserts it into the doctable.

In [7]:
for year, docmeta in nss_metadata.items():
    text = download_nss(year)
    db.insert({
        'year':year, 
        'party': docmeta['party'], 
        'president': docmeta['president'],
        'text': text, 
    }, ifnotunique='replace')
    print(f'added parsed text to {year}: {db}')
db

added parsed text to 1987: <DocTable::nssdocs ct: 1>
added parsed text to 1993: <DocTable::nssdocs ct: 2>
added parsed text to 2002: <DocTable::nssdocs ct: 3>
added parsed text to 2015: <DocTable::nssdocs ct: 4>
added parsed text to 1994: <DocTable::nssdocs ct: 5>
added parsed text to 1990: <DocTable::nssdocs ct: 6>
added parsed text to 1991: <DocTable::nssdocs ct: 7>
added parsed text to 2006: <DocTable::nssdocs ct: 8>
added parsed text to 1997: <DocTable::nssdocs ct: 9>
added parsed text to 1995: <DocTable::nssdocs ct: 10>
added parsed text to 1988: <DocTable::nssdocs ct: 11>
added parsed text to 2017: <DocTable::nssdocs ct: 12>
added parsed text to 1996: <DocTable::nssdocs ct: 13>
added parsed text to 2010: <DocTable::nssdocs ct: 14>
added parsed text to 1999: <DocTable::nssdocs ct: 15>
added parsed text to 1998: <DocTable::nssdocs ct: 16>
added parsed text to 2000: <DocTable::nssdocs ct: 17>


<DocTable::nssdocs ct: 17>

In [8]:
# use select_df to show a couple rows of our database
db.select_df(limit=2)

Unnamed: 0,id,year,president,party,text,parsed,num_paragraphs,num_tokens
0,1,1987,Reagan,R,I. An American Perspective \n\nIn the early da...,,,
1,2,1993,H.W. Bush,R,Preface \n\nAmerican Leadership for Peaceful C...,,,


Notice that we've filled the text column, but now we need to convert the text data into tokens in python.


## Parse Text in the DocTable
Now that the text is in the doctable, we can parse the text by reading from the table and store the parsed text there as well.

#### Create a Parser Class Using a Pipeline
Now we create a small `NSSParser` class that keeps a `doctable.ParsePipeline` for doing the actual text processing. Instantiating the package will load a spacy module into memory and construct the pipeline from the selected components.

In [9]:
class NSSParser(doctable.DocParser):
    ''' Handles text parsing for NSS documents.'''
    def __init__(self):
        nlp = spacy.load('en')
        
        # this determines all settings for tokenizing
        self.pipeline = doctable.ParsePipeline([
            nlp, # first run spacy parser
            doctable.component('merge_tok_spans', merge_ents=True),
            doctable.component('tokenize', **{
                'split_sents': False,
                'keep_tok_func': doctable.component('keep_tok'),
                'parse_tok_func': doctable.component('parse_tok', **{
                    'format_ents': True,
                    'num_replacement': 'NUM',
                })
            })
        ])
        
    def parse(self, text):
        return self.pipeline.parse(text)
    
    def parsemany(self, texts, workers=1):
        return self.pipeline.parsemany(texts, workers=workers)

parser = NSSParser() # creates a parser instance
parser.pipeline.components

[<spacy.lang.en.English at 0x7ff1a72e3048>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>,
 <function doctable.pipeline.component.<locals>.<lambda>(x)>]

In the next block we loop through rows in the doctable and for each iteration parse the text document and insert it back into the table. Because each paragraph is independent, we can use our `.parsemany` method (simple wrapper over pipeline) to parse them all in parallel. To parse in parallel, you must specify a value for `workers` greater than 1.

In [10]:
for idx, year, text in tqdm(db.select(['id','year','text'])):
    parsed = parser.parsemany(text.split('\n\n'), workers=30) # parse paragraphs in parallel
    db.update({'parsed': parsed}, where=db['id']==idx)
db

100%|██████████| 17/17 [00:14<00:00,  1.19it/s]


<DocTable::nssdocs ct: 17>

In [11]:
db.select_df(limit=3)

Unnamed: 0,id,year,president,party,text,parsed,num_paragraphs,num_tokens
0,1,1987,Reagan,R,I. An American Perspective \n\nIn the early da...,"[[I. An American, perspective], [in, The Early...",265,25302
1,2,1993,H.W. Bush,R,Preface \n\nAmerican Leadership for Peaceful C...,"[[preface], [American, leadership, for, peacef...",125,13082
2,3,2002,W. Bush,R,The great struggles of the twentieth century b...,"[[the, great, struggles, of, The Twentieth Cen...",199,13883


You can see above that the `num_paragraphs` and `num_tokens` columns have been updated because of our custom `update` function in the `NSSDocs` class.

# 2. NSS In Paragraph Form
Our second example will look very similar, but this time we will create a doctable where each paragraph is a separate row in the database. We will use the same `Tokenize` class defined earlier, but update the doctable columns and slightly change our insert and parsing code. I aim to demonstrate here a general formula for processing large ammounts of text from a doctable for storage back into the doctable.

In [12]:
class NSSParagraphs(doctable.DocTable):
    tabname = 'nss_paragraphs'
    schema = (
        ('idcol', 'id'),
        ('integer', 'year', dict(nullable=False)),
        ('string','president'),
        ('string','party'), ('check_constraint', 'party in ("R","D")'),
        ('integer', 'parnum', dict(nullable=False)), # added number to indicate which paragraph within the document
        ('string', 'text'),
        ('pickle','parsed'),
        ('integer', 'num_tokens'),
        ('index', 'ind_yr_parnum', ['year', 'parnum'], dict(unique=True)),
    )
    def __init__(self, fname=':memory:', **kwargs):
        super().__init__(fname=fname, schema=self.schema, tabname=self.tabname, **kwargs)
        
    def update(self, row, **kwargs):
        ''' Override insert to automatically calculate num_paragraphs and num_tokens.
        '''
        if 'parsed' in row:
            row['num_tokens'] = len(row['parsed'])
        return super().update(row, **kwargs) # call the regular doctable insert now.
pdb = NSSParagraphs()
pdb

<DocTable::nss_paragraphs ct: 0>

In [13]:
for year, docmeta in tqdm(nss_metadata.items()):
    partexts = download_nss(year).split('\n\n')
    for i, text in enumerate(partexts):
        pdb.insert({
            'year':year, 
            'party': docmeta['party'], 
            'president': docmeta['president'],
            'parnum': i,
            'text': text, 
        }, ifnotunique='replace')
pdb

100%|██████████| 17/17 [00:04<00:00,  4.03it/s]


<DocTable::nss_paragraphs ct: 5200>

In [14]:
pdb.head()

Unnamed: 0,id,year,president,party,parnum,text,parsed,num_tokens
0,1,1987,Reagan,R,0,I. An American Perspective,,
1,2,1987,Reagan,R,1,In the early days of this Administration we la...,,
2,3,1987,Reagan,R,2,"Over the intervening years, we have looked obj...",,
3,4,1987,Reagan,R,3,"Commitment to the goals of world freedom, peac...",,
4,5,1987,Reagan,R,4,While the United States has been the leader of...,,


## Parse in Chunks
A common approach to parsing involves processing chunks of texts in parallel. Using the parser shown in the previous example, I'll show how to use the `.select_chunks()` function to select chunks of rows at a time for parallel processing.

In [15]:
for rowchunk in tqdm(pdb.select_chunks(['id', 'text'], chunksize=1000)):
    parsed = parser.parsemany([r['text'] for r in rowchunk], workers=30)
    for row, p in zip(rowchunk, parsed):
        pdb.update({'parsed': p}, where=pdb['id']==row['id'])

7it [00:14,  2.00s/it]


In [16]:
pdb.head()

Unnamed: 0,id,year,president,party,parnum,text,parsed,num_tokens
0,1,1987,Reagan,R,0,I. An American Perspective,"[I. An American, perspective]",2
1,2,1987,Reagan,R,1,In the early days of this Administration we la...,"[in, The Early Days, of, this, administration,...",32
2,3,1987,Reagan,R,2,"Over the intervening years, we have looked obj...","[over, The Intervening Years, ,, we, have, loo...",67
3,4,1987,Reagan,R,3,"Commitment to the goals of world freedom, peac...","[commitment, to, the, goals, of, world, freedo...",167
4,5,1987,Reagan,R,4,While the United States has been the leader of...,"[while, The United States, has, been, the, lea...",45


All paragraphs have been parsed and inserted back into the database now. Using .`select_chunks` prevents us from loading too many text documents into memory at once, and using the `.parsemany` pipeline function (wrapped here by our custom `.parsemany` function) allows us to parse documents in parallel. The `chunksize` parameter can be adjusted according to available machine memory.