# Parsetrees with `ParsePipeline`
Here I'll show you how to extract and store parsetrees in your doctable using Spacy + doctable. The motivation is that parsetree information in raw Spacy Document objects are very large and not suitable for storage when using large corpora. We solve this by simply converting the Spacy Document object to a tree data structure built from python lists and dictionaries.

We use this feature by simply creating a new `get_parsetrees` pipeline component after the spacy parser. [Check the docs](ref/doctable.parse.html) to learn more about this function. You can see more examples of creating parse pipelines in our [overview examples](examples/parse_basics.html).

In [2]:
# import doctable
import sys
sys.path.append('..')
import doctable
from doctable import Comp

In [5]:
import spacy
nlp = spacy.load('en')
parser = doctable.ParsePipeline([
    nlp, # the spacy parser
    Comp('get_parsetrees', **{
        'parse_tok_func': Comp('parse_tok', **{
            'num_replacement': 'NUM',
        })
    })
])
parser.components

[<spacy.lang.en.English at 0x7f248b027d30>,
 functools.partial(<function get_parsetrees at 0x7f2493cb4f28>, parse_tok_func=functools.partial(<function parse_tok at 0x7f2493cb4d08>, num_replacement='NUM'))]

And we also define some example text docuemnts, Star Wars themed.

In [9]:
docs = [
    'Hello world.',
    'Do. Or do not. There is no try.',
    'Help me, Obi-Wan Kenobi. You’re my only hope.',
    'I find your lack of faith disturbing.',
    'No. I am your father.',
    'It’s the ship that made the Kessel run in less than twelve parsecs. I’ve outrun Imperial starships.',
]

## Create and Store Parsetrees
First, we build a simple doctable for our example.

In [22]:
class PTreeTable(doctable.DocTable):
    schema = (
        ('idcol', 'id'),
        ('pickle', 'ptrees'), # store as raw python object
    )
    def __init__(self):
        super().__init__(schema=self.schema)
db = PTreeTable()
db

<DocTable::_documents_ ct: 0>

Now we parse each of the documents using the parser we made earlier. You can see that every parsed document is a list of `ParseTree` objects ([see docs](ref/doctable.parsetree.html)). These are special objects created to store parsetrees in a compact format.

In [23]:
for doc in docs:
    ptrees = parser.parse(doc)
    print(ptrees)

[ParseTree(['hello', 'world', '.'])]
[ParseTree(['do', '.']), ParseTree(['or', 'do', 'not', '.']), ParseTree(['there', 'is', 'no', 'try', '.'])]
[ParseTree(['help', 'me', ',', 'Obi', '-', 'Wan', 'Kenobi', '.']), ParseTree(['you', '’re', 'my', 'only', 'hope', '.'])]
[ParseTree(['i', 'find', 'your', 'lack', 'of', 'faith', 'disturbing', '.'])]
[ParseTree(['no', '.']), ParseTree(['i', 'am', 'your', 'father', '.'])]
[ParseTree(['it', '’s', 'the', 'ship', 'that', 'made', 'the', 'kessel', 'run', 'in', 'less', 'than', 'NUM', 'parsecs', '.']), ParseTree(['i', '’ve', 'outrun', 'Imperial', 'starships', '.'])]


Because `ParseTree` objects are simple python objects, we can simply insert them into the DocTable as a pickle column.

In [27]:
for i,doc in enumerate(docs):
    ptrees = parser.parse(doc)
    db.insert({'id':i, 'ptrees':ptrees}, ifnotunique='replace')
db.select_df()

Unnamed: 0,id,ptrees
0,0,"[(ParseNode(hello), ParseNode(world), ParseNod..."
1,1,"[(ParseNode(do), ParseNode(.)), (ParseNode(or)..."
2,2,"[(ParseNode(help), ParseNode(me), ParseNode(,)..."
3,3,"[(ParseNode(i), ParseNode(find), ParseNode(you..."
4,4,"[(ParseNode(no), ParseNode(.)), (ParseNode(i),..."
5,5,"[(ParseNode(it), ParseNode(’s), ParseNode(the)..."
