# Example 1: US National Security Strategy Document Corpus
In this example, I'll show how to create a database for document + metadata storage using the `DocTable` class, and a parser class using a `ParsePipeline`. We will store the metadata you see below with the raw text and parsed tokens in the same DocTable.

In [1]:
import sys
sys.path.append('..')
import doctable
import spacy
from tqdm import tqdm
import pandas as pd
import os
from pprint import pprint
import urllib.request # used for downloading nss docs

## Introduction to NSS Corpus
This dataset is the plain text version of the US National Security Strategy documents. During the parsing process, all plain text files will be downloaded from my [github project hosting the nss docs](https://github.com/devincornell/nssdocs). I compiled the metadata you see below from [a page hosted by the historical dept of the secretary's office](https://history.defense.gov/Historical-Sources/National-Security-Strategy/). In short, each US President must release at least one NSS per term, with some (namely Clinton) producing more.

Here I've created the function `download_nss` to download the text data from my nssdocs github repository, and the python dictionary `nss_metadata` to store information about each document to be stored in the database.

In [2]:
def download_nss(year):
    ''' Simple helper function for downloading texts from my nssdocs repo.'''
    baseurl = 'https://raw.githubusercontent.com/devincornell/nssdocs/master/docs/{}.txt'
    url = baseurl.format(year)
    text = urllib.request.urlopen(url).read().decode('utf-8')
    return text

In [3]:
nss_metadata = {
    1987: {'party': 'R', 'president': 'Reagan'}, 
    1993: {'party': 'R', 'president': 'H.W. Bush'}, 
    2002: {'party': 'R', 'president': 'W. Bush'}, 
    2015: {'party': 'D', 'president': 'Obama'}, 
    1994: {'party': 'D', 'president': 'Clinton'}, 
    1990: {'party': 'R', 'president': 'H.W. Bush'}, 
    1991: {'party': 'R', 'president': 'H.W. Bush'}, 
    2006: {'party': 'R', 'president': 'W. Bush'}, 
    1997: {'party': 'D', 'president': 'Clinton'}, 
    1995: {'party': 'D', 'president': 'Clinton'}, 
    1988: {'party': 'R', 'president': 'Reagan'}, 
    2017: {'party': 'R', 'president': 'Trump'}, 
    1996: {'party': 'D', 'president': 'Clinton'}, 
    2010: {'party': 'D', 'president': 'Obama'}, 
    1999: {'party': 'D', 'president': 'Clinton'}, 
    1998: {'party': 'D', 'president': 'Clinton'}, 
    2000: {'party': 'D', 'president': 'Clinton'}
}

In [4]:
# downloader example: first 100 characters of 1993 NSS document
text = download_nss(1993)
text[:100]

'Preface \n\nAmerican Leadership for Peaceful Change \n\nOur great Nation stands at a crossroads in histo'

## 1. Create a DocTable Schema
The `DocTable` class is often used by subclassing. Our `NSSDocs` class inherits from `DocTable` and will store connection and schema information. Because the default constructor checks for statically define member variables `tabname` and `schema` (as well as others), we can simply add them to the class definition. 

In this example, we create the 'id' column as a unique index, the 'year', 'president', and 'party' columns for storing the metadata we defined above in `nss_metadata`, and columns for raw and parse text. See the [schema guide](examples/doctable_schema.html) for examples of the full range of column types.

In [7]:
from dataclasses import dataclass
from typing import Any

@dataclass
class NSSDoc(doctable.DocTableSchema):
    id: int = doctable.IDCol()
    year: int = doctable.Col(nullable=False)
    president: str = doctable.Col()
    party: str = doctable.Col()
    text: str = doctable.Col()
    parsed: Any = doctable.Col()
        
    __constraints__ = [('check', 'party in ("R", "D")')]
    __indices__ = {'ind_yr': ('year', dict(unique=True))}

We can then create a connection to a database by instantiating the `NSSDocs` class. Since the `fname` parameter was not provided, this doctable exists only in memory using sqlite (uses special sqlite name ":memory:"). We will use this for these examples.

We can check the sqlite table schema using `.schema_table()`. You can see that the 'pickle' datatype we chose above is represented as a BLOB column. This is because DocTable, using SQLAlchemy core, creates an interface on top of sqlite to handle the data conversion. You can view the number of documents using `.count()` or by viewing the db instance as a string (in this case with print function).

In [29]:
# printing the DocTable object itself shows how many entries there are
db = doctable.DocTable(schema=NSSDoc, target=':memory:', verbose=True)
print(db.count())
print(db)
db.schema_table()

DocTable: SELECT count() AS count_1 
FROM _documents_
 LIMIT ? OFFSET ?
0
<DocTable (6 cols)::sqlite:///:memory::_documents_>


Unnamed: 0,name,type,nullable,default,autoincrement,primary_key
0,id,INTEGER,False,,auto,1
1,year,INTEGER,False,,auto,0
2,president,VARCHAR,True,,auto,0
3,party,VARCHAR,True,,auto,0
4,text,VARCHAR,True,,auto,0
5,parsed,BLOB,True,,auto,0


## 2. Insert Data Into the Table

Now let's download and store the text into the database. Each loop downloads a text document and inserts it into the doctable, and we use the `.insert()` method to insert a single row at a time. The row to be inserted is represented as a dictionary, and any missing column information is left as NULL. The `ifnotunique` argument is set to false because if we were to re-run this code, it needs to replace the existing document of the same year. Recall that in the schema we placed a unique constraint on the year column.

In [30]:
for year, docmeta in tqdm(nss_metadata.items()):
    text = download_nss(year)
    db.insert(NSSDoc(
        year=year, 
        party=docmeta['party'], 
        president=docmeta['president'], 
        text=text
    ), ifnotunique='replace')
db.head()

 12%|█▏        | 2/17 [00:00<00:02,  5.82it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 24%|██▎       | 4/17 [00:00<00:02,  6.29it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 35%|███▌      | 6/17 [00:00<00:01,  6.24it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 47%|████▋     | 8/17 [00:01<00:01,  5.93it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 59%|█████▉    | 10/17 [00:01<00:01,  5.87it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 71%|███████   | 12/17 [00:01<00:00,  6.02it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 76%|███████▋  | 13/17 [00:02<00:00,  6.10it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


 88%|████████▊ | 15/17 [00:02<00:00,  5.39it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)


100%|██████████| 17/17 [00:02<00:00,  5.84it/s]

DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: INSERT OR REPLACE INTO _documents_ (year, president, party, text) VALUES (?, ?, ?, ?)
DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?





Unnamed: 0,id,year,president,party,text,parsed
0,1,1987,Reagan,R,I. An American Perspective \n\nIn the early da...,
1,2,1993,H.W. Bush,R,Preface \n\nAmerican Leadership for Peaceful C...,
2,3,2002,W. Bush,R,The great struggles of the twentieth century b...,
3,4,2015,Obama,D,"Today, the United States is stronger and bette...",
4,5,1994,Clinton,D,Preface \n\nProtecting our nation's security -...,


## 3. Query Table Data
Now that we have inserted the NSS documents into the table, there are a few ways we can query the data. To select the first entry of the table use `.select_first()`. This method returns a simple `sqlalchemy.RowProxy` object which can be accessed like a dictionary or like a tuple.

In [31]:
row = db.select_first()
#print(row)
print(row['president'])

DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?
Reagan


To select more than one row, use the `.select()` method. If you'd only like to return the first few rows, you can use the `limit` argument.

In [32]:
rows = db.select(limit=2)
print(rows[0]['year'])
print(rows[1]['year'])

DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?
1987
1993


We can also select only a few columns.

In [33]:
db.select(['year', 'president'], limit=3)

DocTable: SELECT _documents_.year, _documents_.president 
FROM _documents_
 LIMIT ? OFFSET ?


[NSSDoc(id=EmptyValue, year=1987, president='Reagan', party=EmptyValue, text=EmptyValue, parsed=EmptyValue),
 NSSDoc(id=EmptyValue, year=1993, president='H.W. Bush', party=EmptyValue, text=EmptyValue, parsed=EmptyValue),
 NSSDoc(id=EmptyValue, year=2002, president='W. Bush', party=EmptyValue, text=EmptyValue, parsed=EmptyValue)]

For convenience, we can also use the `.select_df()` method to return directly as a pandas dataframe.

In [34]:
# use select_df to show a couple rows of our database
db.select_df(limit=2)

DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?


Unnamed: 0,id,year,president,party,text,parsed
0,1,1987,Reagan,R,I. An American Perspective \n\nIn the early da...,
1,2,1993,H.W. Bush,R,Preface \n\nAmerican Leadership for Peaceful C...,


## 4. Create a Parser for Tokenization
Now that the text is in the doctable, we can extract it using `.select()`, parse it, and store the parsed text back into the table using `.update()`.

Now we create a parser using `ParsePipeline` and a list of functions to apply to the text sequentially. The `Comp` function returns a [doctable parse function](ref/doctable.parse.html) with additional keyword arguments. For instance, the following two expressions would be the same.
```
doctable.component('keep_tok', keep_punct=True) # is equivalent to
lambda x: doctable.parse.parse_tok_func(x, keep_punct=True)
```
Note in this example that the 'tokenize' function takes two function arguments `keep_tok_func` and `parse_tok_func` which are also specified using the `.Comp()` function. The available pipeline components are listed in the [parse function documentation](ref/doctable.parse.html).

In [35]:
# first load a spacy model
nlp = spacy.load('en_core_web_sm')

# add pipeline components
parser = doctable.ParsePipeline([
    nlp, # first run spacy parser
    doctable.Comp('tokenize', **{
        'split_sents': False,
        'keep_tok_func': doctable.Comp('keep_tok'),
        'parse_tok_func': doctable.Comp('parse_tok'),
    })
])

parser.components

[<spacy.lang.en.English at 0x7fc33aa36a30>,
 functools.partial(<function tokenize at 0x7fc3524b8c10>, split_sents=False, keep_tok_func=functools.partial(<function keep_tok at 0x7fc3524b8d30>), parse_tok_func=functools.partial(<function parse_tok at 0x7fc3524b8ca0>))]

Now we loop through rows in the doctable and for each iteration parse the text and insert it back into the table using `.update()`. We use the `ParsePipeline` method `.parsemany()` to parse paragraphs from each document in parallel. This is much faster.

In [36]:
docs = db.select()
for doc in tqdm(docs):
    parsed = parser.parsemany(text.split('\n\n'), workers=30) # parse paragraphs in parallel
    doc.parsed = parsed
    print(doc.id)
    db.update_dataclass(doc, verbose=True)

  0%|          | 0/17 [00:00<?, ?it/s]

DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_


  6%|▌         | 1/17 [00:01<00:26,  1.63s/it]

1
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 12%|█▏        | 2/17 [00:03<00:24,  1.60s/it]

2
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 18%|█▊        | 3/17 [00:04<00:22,  1.60s/it]

3
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 24%|██▎       | 4/17 [00:06<00:21,  1.64s/it]

4
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 29%|██▉       | 5/17 [00:08<00:19,  1.60s/it]

5
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 35%|███▌      | 6/17 [00:09<00:17,  1.57s/it]

6
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 41%|████      | 7/17 [00:11<00:16,  1.65s/it]

7
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 47%|████▋     | 8/17 [00:13<00:15,  1.70s/it]

8
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 53%|█████▎    | 9/17 [00:15<00:13,  1.75s/it]

9
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 59%|█████▉    | 10/17 [00:16<00:12,  1.79s/it]

10
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 65%|██████▍   | 11/17 [00:18<00:10,  1.81s/it]

11
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 71%|███████   | 12/17 [00:20<00:09,  1.82s/it]

12
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 76%|███████▋  | 13/17 [00:22<00:07,  1.84s/it]

13
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 82%|████████▏ | 14/17 [00:24<00:05,  1.92s/it]

14
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 88%|████████▊ | 15/17 [00:26<00:03,  1.96s/it]

15
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


 94%|█████████▍| 16/17 [00:28<00:01,  1.97s/it]

16
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?


100%|██████████| 17/17 [00:30<00:00,  1.80s/it]

17
DocTable: UPDATE _documents_ SET id=?, year=?, president=?, party=?, text=?, parsed=? WHERE _documents_.id = ?





See the 'parsed' column in the dataframe below to view the paragraphs.

In [37]:
db.select_df(limit=3)

DocTable: SELECT _documents_.id, _documents_.year, _documents_.president, _documents_.party, _documents_.text, _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?


Unnamed: 0,id,year,president,party,text,parsed
0,1,1987,Reagan,R,I. An American Perspective \n\nIn the early da...,"[[as, we, enter, the, new, millennium, ,, we, ..."
1,2,1993,H.W. Bush,R,Preface \n\nAmerican Leadership for Peaceful C...,"[[as, we, enter, the, new, millennium, ,, we, ..."
2,3,2002,W. Bush,R,The great struggles of the twentieth century b...,"[[as, we, enter, the, new, millennium, ,, we, ..."


And here we show a few tokenized paragraphs.

In [38]:
paragraphs = db.select_first('parsed')
for par in paragraphs[:3]:
    print(par, '\n')

DocTable: SELECT _documents_.parsed 
FROM _documents_
 LIMIT ? OFFSET ?
['as', 'we', 'enter', 'the', 'new', 'millennium', ',', 'we', 'are', 'blessed', 'to', 'be', 'citizens', 'of', 'a', 'country', 'enjoying', 'record', 'prosperity', ',', 'with', 'no', 'deep', 'divisions', 'at', 'home', ',', 'no', 'overriding', 'external', 'threats', 'abroad', ',', 'and', 'history', "'s", 'most', 'powerful', 'military', 'ready', 'to', 'defend', 'our', 'interests', 'around', 'the', 'world', '.', 'Americans', 'of', 'earlier', 'eras', 'may', 'have', 'hoped', 'one', 'day', 'to', 'live', 'in', 'a', 'nation', 'that', 'could', 'claim', 'just', 'one', 'of', 'these', 'blessings', '.', 'probably', 'few', 'expected', 'to', 'experience', 'them', 'all', ';', 'fewer', 'still', 'all', 'at', 'once', '.'] 

['our', 'success', 'is', 'cause', 'for', 'pride', 'in', 'what', 'we', "'ve", 'done', ',', 'and', 'gratitude', 'for', 'what', 'we', 'have', 'inherited', '.', 'but', 'the', 'most', 'important', 'matter', 'is', 'what', 