We're going to be working through how to load a treebank into memory, and the first thing we need to know is how to deal with the objects contained in a treebank: trees. To structure this discussion, we'll use a motivating example: suppose I’m interested in finding all sentences with a definite determiner in a subject.

## Initial design

The first question we need to ask is: what are trees in the abstract?

An initial approximation is that a tree is something that is...
- ...empty (base case)
- ...a nonempty sequence of trees

In [11]:
from typing import List

class Tree:

    def __init__(self, children: List['Tree']=[]):
        self._children = children

Tree([Tree(), Tree()])

<__main__.Tree at 0x7f81e0d40760>

One problem is that these sorts of abstract trees aren’t super useful. So we can augment our definition. 

A tree is something that is...
  - ...empty (base case)
  - ...a piece of data paired with a nonempty sequence of trees

In [12]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children

By convention, we shouldn't access the private attributes `_data` and `_children`, so a common thing to do is to build read-only accessors using the `@property` decorators.

In [13]:
class Tree:
    
    def __init__(self, data, children: List['Tree']=[]):
        self._data = data
        self._children = children
  
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children

In [14]:
t = Tree('S', [Tree('NP', ['the', Tree('children')]), Tree('VP')])

t.children[0].data

'NP'

Our class doesn't currently enforce that the children be `Tree`s. To enforce this, we can build a validator private method into the intialization.

In [15]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

So now the following won't work.

In [16]:
try:
    Tree('S', ['NP', 'VP'])
except TypeError as e:
    print("TypeError:", e)

TypeError: all children must be trees


But these will.

In [17]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [18]:
tree1

<__main__.Tree at 0x7f81e0d481c0>

### Stringifying the tree

If we try to look at the tree, the result isn't very informative.

In [19]:
tree1

<__main__.Tree at 0x7f81e0d481c0>

This is because we need to tell python how to display objects of our class. There are two obvious things to do: print the **yield** of the tree or print some representation of the tree itself. We implement both using the `__str__` (what is shown when we call `print()`) and `__repr__` (what is shown when we evaluate) magic methods. 

We'll have `__str__` return the yield and `__repr__` return the tree representation.

In [20]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()

    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s
        
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

In [21]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

print(tree1)

a greyhound loves a greyhound


In [22]:
tree1

S
--NP
  --D
    --a
  --N
    --greyhound
--VP
  --V
    --loves
  --NP
    --D
      --a
    --N
      --greyhound

### Testing for containment

*Motivating example:* Suppose I’m interested in finding all
sentences with a definite determiner in a subject.

This means checking whether a particular subtree (corresponding to the subject) contains a particular element. Let's figure out how to compute containment, and then we'll get back to figuring out how to grab the relevant subtree.

*Question:* How do we implement containment tests?

*Answer:* Magic (instance) methods – in this case, `__contains__`

*Idea:* `__contains__` should take a piece of data and tell us whether it matches a piece of data somewhere in the tree.

In [23]:
'g' in ['g', 'b', 'q']

True

In [24]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()

    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s
        
    def __contains__(self, data):
        raise NotImplementedError

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

*Question:* Right, but how do we implement containment tests?

*Answer:* Standard tree-search algorithms.

*Two kinds of algorithm:*

- depth-first search
- breath-first search

In both kinds of search, we start at the top of the tree and work our way down, the question is which nodes we look at.

#### Depth-first search

*Three kinds:*

- Pre-order depth-first search
- In-order depth-first search
- Post-order depth-first search

##### Pre-order depth-first search

*Intuition:* Starting from the left-most child subtree and
moving right, look at the data at the root of that subtree and then do
pre-order depth-first search on that subtree. (The line is our traversal path and the dots are wehn we look at a piece of data in a node.)

![500px-Sorted_binary_tree_preorder.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Sorted_binary_tree_preorder.svg/500px-Sorted_binary_tree_preorder.svg.png)

In [25]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        return self.depth_first_search_pre(data)

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

    def depth_first_search_pre(self, data):
        if self._data == data:
            return True
        else:
            for c in self._children:
                found = c.depth_first_search_pre(data)
                
                if found:
                    return True
                
            return False

In [26]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [27]:
%%timeit
'a' in tree1

1.99 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [28]:
%%timeit
'loves' in tree1

3.75 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


##### In-order depth-first search

*Intuition:* Do in-order depth-first search on the left-most
child subtree; look at the data at the root of that subtree; then do
in-order depth-first search on the remaining child subtrees.

![500px-Sorted_binary_tree_inorder.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Sorted_binary_tree_inorder.svg/500px-Sorted_binary_tree_inorder.svg.png)

Trees can have at-most two children (or a decision needs to be made where to split children into right and left children).

In [29]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        return self.depth_first_search_in(data)

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

    def depth_first_search_pre(self, data):
        if self._data == data:
            return True
        else:
            for c in self._children:
                found = c.depth_first_search_pre(data)
                
                if found:
                    return True
                
            return False

    def depth_first_search_in(self, data):
        self._validate_binary()

        if not len(self._children):
            return self.data == data
        elif self._children[0].depth_first_search_in(data):
            return True
        elif self.data == data:
            return True
        elif len(self._children) == 2:
            return self._children[1].depth_first_search_in(data)

        return False
        
    def _validate_binary(self):
        try:
            assert len(self._children) <= 2
        except AssertionError:
            errmsg = 'In-order depth-first search only defined'+\
                     ' for trees with at most two children'
            raise ValueError(errmsg)

In [30]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [31]:
'the' in tree1

False

In [32]:
%%timeit
'a' in tree1

3.67 µs ± 625 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [33]:
%%timeit
'loves' in tree1

21.4 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


##### Post-order depth-first search

*Intuition:* Starting from the left-most child subtree and
moving right, do post-order depth-first search on that subtree, then look
at the data at the root of that subtree.

![500px-Sorted_binary_tree_postorder.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/9/9d/Sorted_binary_tree_postorder.svg/500px-Sorted_binary_tree_postorder.svg.png)

In [34]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        return self.depth_first_search_post(data)

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

    def depth_first_search_pre(self, data):
        if self._data == data:
            return True
        else:
            for c in self._children:
                found = c.depth_first_search_pre(data)
                
                if found:
                    return True
                
            return False

    def depth_first_search_in(self, data):
        self._validate_binary()

        if not len(self._children):
            return self.data == data
        elif self._children[0].depth_first_search_in(data):
            return True
        elif self.data == data:
            return True
        elif len(self._children) == 2:
            return self._children[1].depth_first_search_in(data)

        return False
        
    def _validate_binary(self):
        try:
            assert len(self._children) <= 2
        except AssertionError:
            errmsg = 'In-order depth-first search only defined'+\
                     ' for trees with at most two children'
            raise ValueError(errmsg)

    def depth_first_search_post(self, data):
        if not self._children:
            return self._data == data
        else:
            for c in self._children:
                if c.depth_first_search_post(data):
                    return True
                else:
                    return self._data == data

In [35]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [36]:
'the' in tree1

False

In [37]:
'a' in tree1

True

#### Breadth-first search

*Intuition:* Look at the data in all nodes at depth $i$ then
to breadth-first search at depth $i+1$.

![500px-Sorted_binary_tree_breadth-first_traversal.svg.png](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Sorted_binary_tree_breadth-first_traversal.svg/500px-Sorted_binary_tree_breadth-first_traversal.svg.png)

In [38]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        return self.breadth_first_search(data)

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

    def depth_first_search_pre(self, data):
        if self._data == data:
            return True
        else:
            for c in self._children:
                found = c.depth_first_search_pre(data)
                
                if found:
                    return True
                
            return False

    def depth_first_search_in(self, data):
        self._validate_binary()

        if not len(self._children):
            return self.data == data
        elif self._children[0].depth_first_search_in(data):
            return True
        elif self.data == data:
            return True
        elif len(self._children) == 2:
            return self._children[1].depth_first_search_in(data)

        return False
        
    def _validate_binary(self):
        try:
            assert len(self._children) <= 2
        except AssertionError:
            errmsg = 'In-order depth-first search only defined'+\
                     ' for trees with at most two children'
            raise ValueError(errmsg)

    def depth_first_search_post(self, data):
        if not self._children:
            return self._data == data
        else:
            for c in self._children:
                if c.depth_first_search_post(data):
                    return True
                else:
                    return self._data == data


    def breadth_first_search(self, data):
        depth = 0
        while True:
            try:
                if self._iddfs(data, depth):
                    return True
                depth += 1
            except StopIteration:
                return False


    def _iddfs(self, data, depth):
        # iterative deepening depth-first search
        if depth == 0:
            return self._data == data
        elif depth > 0:
            for c in self._children:
                if c._iddfs(data, depth-1):
                    return True
            
            if not any(bool(c._children)
                       for c in self._children):
                raise StopIteration
        
        return False

In [39]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [40]:
'the' in tree1

False

In [41]:
'a' in tree1

True

In [42]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

#### Indexing subtrees

*Motivating example:* Suppose I’m interested in finding all
sentences with a definite determiner in a subject.

*Question:* How do we find particular subtrees?

*Answer:* When searching, return that subtree instead of a boolean.

*Question:* But where is that subtree with respect to other subtrees?

*Requirement:* A way of indexing trees

*Question:* How do we index trees?

*Possibility 1:* `int` representing when a particular search
algorithm visits a node.

In [43]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        return self.flattened[idx]
    
    def __len__(self):
        return len(self.flattened)

    @property
    def flattened(self):
        try:
            return self._flattened
        except AttributeError:
            # pre-order depth-first search
            self._flattened = [self] +\
                              [elem 
                               for c in self._children
                               for elem in c.flattened]
            return self._flattened

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

In [44]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [45]:
tree1[0]

S
--NP
  --D
    --a
  --N
    --greyhound
--VP
  --V
    --loves
  --NP
    --D
      --a
    --N
      --greyhound

In [46]:
tree1[1]

NP
--D
  --a
--N
  --greyhound

In [47]:
tree1[2]

D
--a

In [48]:
tree1[4]

N
--greyhound

In [49]:
for i in range(len(tree1)):
    print(i, tree1[i].data)

0 S
1 NP
2 D
3 a
4 N
5 greyhound
6 VP
7 V
8 loves
9 NP
10 D
11 a
12 N
13 greyhound


*Problem:* This indexation scheme makes it a bit hard to represent relations like parenthood or sisterhood in a tree.

*Possibility 2:* `tuple` representing the index path to the root

In [50]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        idx = (idx,) if isinstance(idx, int) else idx
        
        try:
            assert all(isinstance(i, int) for i in idx)
            assert all(i >= 0 for i in idx)
        except AssertionError:
            errmsg = 'index must be a positive int or tuple of positive ints'
            raise IndexError(errmsg)
        
        if not idx:
            return self
        elif len(idx) == 1:
            return self._children[idx[0]]
        else:
            return self._children[idx[0]][idx[1:]]
            

    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)

In [51]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [52]:
tree1[tuple()]

S
--NP
  --D
    --a
  --N
    --greyhound
--VP
  --V
    --loves
  --NP
    --D
      --a
    --N
      --greyhound

In [53]:
tree1[0]

NP
--D
  --a
--N
  --greyhound

In [54]:
tree1[0,0]

D
--a

In [55]:
tree1[0,1]

N
--greyhound

In [56]:
tree1[0,1,0]

greyhound

In [57]:
tree1[1]

VP
--V
  --loves
--NP
  --D
    --a
  --N
    --greyhound

In [58]:
tree1[1,1]

NP
--D
  --a
--N
  --greyhound

In [59]:
tree1[1,1,0]

D
--a

In [60]:
tree1[1,1,0,0]

a

*Question:* This will get us from indices to trees, but how would we go from data to indices?

*Answer:* Similar to a `list`, we can implement an `index()` method.

In [61]:
class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        idx = (idx,) if isinstance(idx, int) else idx
        
        try:
            assert all(isinstance(i, int) for i in idx)
            assert all(i >= 0 for i in idx)
        except AssertionError:
            errmsg = 'index must be a positive int or tuple of positive ints'
            raise IndexError(errmsg)
        
        if not idx:
            return self
        elif len(idx) == 1:
            return self._children[idx[0]]
        else:
            return self._children[idx[0]][idx[1:]]
        
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)
            
    def index(self, data, index_path=tuple()):
        indices = [index_path] if self._data==data else []
        root_path = [] if index_path == -1 else index_path
        
        indices += [j 
                    for i, c in enumerate(self._children) 
                    for j in c.index(data, root_path+(i,))]

        return indices

In [62]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('the')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [63]:
determiner_indices = tree1.index('D')

determiner_indices

[(0, 0), (1, 1, 0)]

In [64]:
tree1[determiner_indices[0]]

D
--a

In [65]:
tree1[determiner_indices[1]]

D
--the

#### Matching tree patterns

*Motivating example:* Suppose I’m interested in finding all
sentences with a definite determiner in a subject.

*Question:* How would we search for tree patterns?

*Possibility:* use trees as the pattern to match against!

In [66]:
tree_pattern = Tree('S', 
                    [Tree('NP',
                          [Tree('D', 
                                [Tree('the')])]),
                     Tree('VP')])

tree_pattern

S
--NP
  --D
    --the
--VP

In [67]:
list(zip([0, 3, 6], [9, 1, 5, 8]))

[(0, 9), (3, 1), (6, 5)]

*Exercise:* how would we implement the `find()` method?

In [68]:
from typing import Optional, List

class Tree:
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        if isinstance(idx, int):
            return self._children[idx]
        elif len(idx) == 1:
            return self._children[idx[0]]
        elif idx:
            return self._children[idx[0]].__getitem__(idx[1:])
        else:
            return self
        
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)
            
    def index(self, data, index_path=tuple()):
        indices = [index_path] if self._data==data else []
        root_path = [] if index_path == -1 else index_path
        
        indices += [j 
                    for i, c in enumerate(self._children) 
                    for j in c.index(data, root_path+(i,))]

        return indices
            
    def find(self, pattern: 'Tree', 
             subtree_idx: tuple=tuple()) -> List[tuple]:
        '''The subtrees matching the pattern
        
        Parameters
        ----------
        pattern
            the tree pattern to match against
        subtree_idx
            the index of the subtree within the tree pattern to return
            defaults to the entire match
        '''
        
        #raise NotImplementedError
        
        match_indices = [i + subtree_idx
                         for i in self.index(pattern.data) 
                         if self[i].match(pattern)]
            
        return match_indices
   
    def match(self, pattern):
        if self._data != pattern.data:
            return False
        
        for child1, child2 in zip(self._children, pattern.children):
            if not child1.match(child2):
                return False
                
        return True

In [69]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree2 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('the')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree3 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('the')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree4 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('the')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('the')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [70]:
tree2.find(tree_pattern, (0,0))

[(0, 0)]

In [71]:
tree_pattern = Tree('VP', 
                    [Tree('V'),
                     Tree('NP', 
                          [Tree('D', 
                                [Tree('the')])])])

tree_pattern

VP
--V
--NP
  --D
    --the

In [72]:
tree1.find(tree_pattern, subtree_idx=(1,))

[]

In [73]:
tree2.find(tree_pattern, subtree_idx=(1,))

[]

In [74]:
tree3.find(tree_pattern, subtree_idx=(1,))

[(1, 1)]

In [75]:
tree4.find(tree_pattern, subtree_idx=(1,))

[(1, 1)]

Now try matching `tree1`, `tree2`, `tree3`, and `tree4`, looking for direct objects with a definite determiner.

In [76]:
## Insert code here

This sort of treelet-based matching is somewhat weak as it stands. What if we wanted:

1. ...nodes to be allowed to have some value from a set? 
2. ...arbitrary distance between the nodes we are matching on?
3. ...arbitrary boolean conditions on node matches?

To handle this, we need both a *domain-specific language* (DSL) for specifying such queries and an *interpeter* for that language. We can use [SPARQL](https://en.wikipedia.org/wiki/SPARQL) for our DSL. To intepret SPARQL, we will use the existing interpreter in [`rdflib`](https://github.com/RDFLib/rdflib).

First, we need to install SPARQL.

In [77]:
## Uncomment if you don't yet have these packages

!pip install rdflib
!pip install requests

# import sys
# !conda install --yes --prefix {sys.prefix} rdflib requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rdflib
  Downloading rdflib-6.3.2-py3-none-any.whl (528 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m528.1/528.1 KB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting isodate<0.7.0,>=0.6.0
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 KB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: isodate, rdflib
Successfully installed isodate-0.6.1 rdflib-6.3.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Then, we need to map our `Tree` objects into an in-memory format for which a SPARQL interpreter is already implemented. We will use [Resource Description Format](https://en.wikipedia.org/wiki/Resource_Description_Framework) as implemented in `rdflib`.

In [78]:
from rdflib import Graph, URIRef

class Tree:
    
    RDF_TYPES = {}
    RDF_EDGES = {'is': URIRef('is-a'),
                 'parent': URIRef('is-the-parent-of'),
                 'child': URIRef('is-a-child-of'),
                 'sister': URIRef('is-a-sister-of')}
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        if isinstance(idx, int):
            return self._children[idx]
        elif len(idx) == 1:
            return self._children[idx[0]]
        elif idx:
            return self._children[idx[0]].__getitem__(idx[1:])
        else:
            return self
        
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)
            
    def index(self, data, index_path=tuple()):
        indices = [index_path] if self._data==data else []
        root_path = [] if index_path == -1 else index_path
        
        indices += [j 
                    for i, c in enumerate(self._children) 
                    for j in c.index(data, root_path+(i,))]

        return indices
            
    def to_rdf(self, graph=None, nodes={}, idx=tuple()): 
        graph = Graph() if graph is None else graph
        
        idxstr = '_'.join(str(i) for i in idx)
        nodes[idx] = URIRef(idxstr)
            
        if self._data not in Tree.RDF_TYPES:
            Tree.RDF_TYPES[self._data] = URIRef(self._data)

        typetriple = (nodes[idx], 
                      Tree.RDF_EDGES['is'],
                      Tree.RDF_TYPES[self.data])

        graph.add(typetriple)

        for i, child in enumerate(self._children):
            childidx = idx+(i,)
            child.to_rdf(graph, nodes, childidx)
                
            partriple = (nodes[idx], 
                         Tree.RDF_EDGES['parent'],
                         nodes[childidx])
            chitriple = (nodes[childidx], 
                         Tree.RDF_EDGES['child'],
                         nodes[idx])
            
            graph.add(partriple)
            graph.add(chitriple)
            
        for i, child1 in enumerate(self._children):
            for j, child2 in enumerate(self._children):
                child1idx = idx+(i,)
                child2idx = idx+(j,)
                sistriple = (nodes[child1idx], 
                             Tree.RDF_EDGES['sister'],
                             nodes[child2idx])
                
                graph.add(sistriple)
        
        self._rdf_nodes = nodes
        
        return graph
    
    @property
    def rdf(self):
        return self.to_rdf()
    
    def find(self, query):
        return [tuple([int(i) 
                       for i in str(res[0]).split('_')]) 
                for res in self.rdf.query(query)]

In [79]:
tree1 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree2 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('the')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('a')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree3 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('a')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('the')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

tree4 = Tree('S', 
             [Tree('NP', 
                   [Tree('D', 
                         [Tree('the')]),
                    Tree('N', 
                         [Tree('greyhound')])]),
             Tree('VP', 
                   [Tree('V', 
                         [Tree('loves')]),
                    Tree('NP',
                         [Tree('D',
                               [Tree('the')]),
                          Tree('N',
                               [Tree('greyhound')])])])])

In [80]:
tree1.find('''SELECT ?node
              WHERE { ?node <is-a> <NP>.
                      ?node <is-the-parent-of>* ?child.
                      ?node <is-a-child-of>* ?parent.
                      ?parent <is-a> <S>.
                      ?child <is-a> <the>.
                      ?node <is-a-sister-of> ?sister.
                      ?sister <is-a> <VP>.
                    }''')

[]

In [81]:
tree2.find('''SELECT ?node
              WHERE { ?node <is-a> <NP>.
                      ?node <is-the-parent-of>* ?child.
                      ?child <is-a> <the>.
                      ?node <is-a-sister-of> ?sister.
                      ?sister <is-a> <VP>.
                    }''')

[(0,)]

In [82]:
tree2.find('''SELECT ?node
              WHERE { ?node <is-a> <NP>;
                            <is-the-parent-of>* ?child;
                            <is-a-sister-of> ?sister.
                      ?child <is-a> <the>.
                      ?sister <is-a> <VP>.
                    }''')

[(0,)]

In [83]:
tree3.find('''SELECT ?node
              WHERE { ?node <is-a> <NP>;
                            <is-the-parent-of>* ?child;
                            <is-a-sister-of> ?sister.
                      ?child <is-a> <the>.
                      ?sister <is-a> <VP>.
                    }''')

[]

In [84]:
tree4.find('''SELECT ?node
              WHERE { ?node <is-a> <NP>;
                            <is-the-parent-of>* ?child;
                            <is-a-sister-of> ?sister.
                      ?child <is-a> <the>.
                      ?sister <is-a> <V>.
                    }''')

[(1, 1)]

In [85]:
tree4.rdf

<Graph identifier=N84a3afd7c2b14d72af03209abe2bbda8 (<class 'rdflib.graph.Graph'>)>

### Building a corpus reader

Now that we can search over individual trees, let's now see how to automatically load all trees from a corpus. We'll use the constituency-parsed [English Web TreeBank](https://catalog.ldc.upenn.edu/LDC2012T13) for this purpose. This corpus is separated into different genres, sources, and documents, with each `.tree` file containing possibly multiple parse trees (one per line).

In [86]:
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1ygMIl1w6wz6A24oxkLwirunSKXb9EW12' -O 'LDC2012T13.tgz'

--2023-04-04 14:52:41--  https://drive.google.com/uc?export=download&id=1ygMIl1w6wz6A24oxkLwirunSKXb9EW12
Resolving drive.google.com (drive.google.com)... 64.233.187.100, 64.233.187.113, 64.233.187.139, ...
Connecting to drive.google.com (drive.google.com)|64.233.187.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0o-9c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/03pdrrip7409rc5bsq4cf71e7c2hoh41/1680619950000/06256629009318567325/*/1ygMIl1w6wz6A24oxkLwirunSKXb9EW12?e=download&uuid=0a18c79a-61c9-4aef-a094-a7ff9e5d2ef1 [following]
--2023-04-04 14:53:09--  https://doc-0o-9c-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/03pdrrip7409rc5bsq4cf71e7c2hoh41/1680619950000/06256629009318567325/*/1ygMIl1w6wz6A24oxkLwirunSKXb9EW12?e=download&uuid=0a18c79a-61c9-4aef-a094-a7ff9e5d2ef1
Resolving doc-0o-9c-docs.googleusercontent.com (doc-0o-9c-docs.googleusercontent.com)... 74.125.203.132, 2404:

In [87]:
!tar -xzf LDC2012T13.tgz --to-command=cat 'eng_web_tbk/data/newsgroup/penntree/groups.google.com_8TRACKGROUPFORCOOLPEOPLE_3b43577fb9121c9f_ENG_20050320_090500.xml.tree'

( (S (S-IMP (NP-SBJ (-NONE- *PRO*)) (VP (VB Play) (NP (PRP$ your) (NML (NML (NNS CD's)) (, ,) (NML (CD 8) (HYPH -) (NNS tracks)) (, ,) (NML (NML (NN reel)) (PP (IN to) (NP (NNS reels)))) (, ,) (NML (NNS cassettes)) (, ,) (NML (NN vinyl) (CD 33) (SYM /) (NNS 45's)) (, ,) (CC and) (NML (NN shellac) (NNS 78's)))) (PP-MNR (IN through) (NP (DT this) (JJ little) (JJ integrated) (NN amp))))) (, ,) (S (NP-SBJ (PRP you)) (VP (MD 'll) (VP (VB get) (NP (DT a) (JJ big) (NN eye) (NN opener))))) (. !)) )
( (FRAG (ADJP (JJ complete) (PP (IN with) (NP (JJ original) (NNP Magnavox) (NNS tubes)))) (, -) (S (S (NP-SBJ-1 (DT all) (NNS tubes)) (VP (VBP have) (VP (VBN been) (VP (VBN tested) (NP-1 (-NONE- *)))))) (S (NP-SBJ (PRP they)) (VP (VBP are) (RB all) (ADJP-PRD (JJ good))))) (, -) (NP (NN stereo) (NN amp))) )


We will talk about how to actually parse these sorts of strings against a grammar later in the class, but for current purposes, we'll use [`pyparsing`](https://github.com/pyparsing/pyparsing) to define a grammar and parse threse strings to a list of lists.

In [88]:
## Uncomment if you don't yet have this package
!pip install --upgrade pyparsing

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [89]:
import pyparsing

LPAR = pyparsing.Suppress('(')
RPAR = pyparsing.Suppress(')')
data = pyparsing.Regex(r'[^\(\)\s]+')

exp = pyparsing.Forward()
expList = pyparsing.Group(LPAR + data + exp[...] + RPAR)
exp <<= data | expList

In [90]:
import tarfile
from pprint import pprint

fname = "eng_web_tbk/data/newsgroup/penntree/groups.google.com_8TRACKGROUPFORCOOLPEOPLE_3b43577fb9121c9f_ENG_20050320_090500.xml.tree"

with tarfile.open("LDC2012T13.tgz") as corpus:
    with corpus.extractfile(fname) as treefile:
        treestr = treefile.readline().decode()[2:-2]
        treelist = exp.parseString(treestr)[0]
    
treelist

ParseResults(['S', ParseResults(['S-IMP', ParseResults(['NP-SBJ', ParseResults(['-NONE-', '*PRO*'], {})], {}), ParseResults(['VP', ParseResults(['VB', 'Play'], {}), ParseResults(['NP', ParseResults(['PRP$', 'your'], {}), ParseResults(['NML', ParseResults(['NML', ParseResults(['NNS', "CD's"], {})], {}), ParseResults([',', ','], {}), ParseResults(['NML', ParseResults(['CD', '8'], {}), ParseResults(['HYPH', '-'], {}), ParseResults(['NNS', 'tracks'], {})], {}), ParseResults([',', ','], {}), ParseResults(['NML', ParseResults(['NML', ParseResults(['NN', 'reel'], {})], {}), ParseResults(['PP', ParseResults(['IN', 'to'], {}), ParseResults(['NP', ParseResults(['NNS', 'reels'], {})], {})], {})], {}), ParseResults([',', ','], {}), ParseResults(['NML', ParseResults(['NNS', 'cassettes'], {})], {}), ParseResults([',', ','], {}), ParseResults(['NML', ParseResults(['NN', 'vinyl'], {}), ParseResults(['CD', '33'], {}), ParseResults(['SYM', '/'], {}), ParseResults(['NNS', "45's"], {})], {}), ParseResults

*Exercise:* given such a list of lists, how should we build a `Tree`?

In [91]:
class Tree:
    
    RDF_TYPES = {}
    RDF_EDGES = {'is': URIRef('is-a'),
                 'parent': URIRef('is-the-parent-of'),
                 'child': URIRef('is-a-child-of'),
                 'sister': URIRef('is-a-sister-of')}
    
    PARSER = exp
    
    def __init__(self, data, children=[]):
        self._data = data
        self._children = children
        
        self._validate()
  
    def __str__(self):
        if self._children:
            return ' '.join(c.__str__() for c in self._children)
        else:
            return str(self._data)
        
    def __repr__(self):
        return self.to_string(0)
     
    def to_string(self, depth):
        s = (depth - 1) * '  ' +\
            int(depth > 0) * '--' +\
            self._data + '\n'
        s += ''.join(c.to_string(depth+1)
                     for c in self._children)
        
        return s

    def __contains__(self, data):
        # pre-order depth-first search
        if self._data == data:
            return True
        else:
            for child in self._children:
                if data in child:
                    return True
                
            return False
        
    def __getitem__(self, idx):
        if isinstance(idx, int):
            return self._children[idx]
        elif len(idx) == 1:
            return self._children[idx[0]]
        elif idx:
            return self._children[idx[0]].__getitem__(idx[1:])
        else:
            return self
        
    @property
    def data(self):
        return self._data 
    
    @property
    def children(self):
        return self._children
        
    def _validate(self):
        try:
            assert all(isinstance(c, Tree)
                       for c in self._children)
        except AssertionError:
            msg = 'all children must be trees'
            raise TypeError(msg)
            
    def index(self, data, index_path=tuple()):
        indices = [index_path] if self._data==data else []
        root_path = [] if index_path == -1 else index_path
        
        indices += [j 
                    for i, c in enumerate(self._children) 
                    for j in c.index(data, root_path+(i,))]

        return indices
            
    def to_rdf(self, graph=None, nodes={}, idx=tuple()): 
        graph = Graph() if graph is None else graph
        
        idxstr = '_'.join(str(i) for i in idx)
        nodes[idx] = URIRef(idxstr)
            
        if self._data not in Tree.RDF_TYPES:
            Tree.RDF_TYPES[self._data] = URIRef(self._data)

        typetriple = (nodes[idx], 
                      Tree.RDF_EDGES['is'],
                      Tree.RDF_TYPES[self.data])

        graph.add(typetriple)

        for i, child in enumerate(self._children):
            childidx = idx+(i,)
            child.to_rdf(graph, nodes, childidx)
                
            partriple = (nodes[idx], 
                         Tree.RDF_EDGES['parent'],
                         nodes[childidx])
            chitriple = (nodes[childidx], 
                         Tree.RDF_EDGES['child'],
                         nodes[idx])
            
            graph.add(partriple)
            graph.add(chitriple)
            
        for i, child1 in enumerate(self._children):
            for j, child2 in enumerate(self._children):
                child1idx = idx+(i,)
                child2idx = idx+(j,)
                sistriple = (nodes[child1idx], 
                             Tree.RDF_EDGES['sister'],
                             nodes[child2idx])
                
                graph.add(sistriple)
        
        self._rdf_nodes = nodes
        
        return graph
    
    @property
    def rdf(self):
        return self.to_rdf()
    
    def find(self, query):
        return [tuple([int(i) 
                       for i in str(res[0]).split('_')]) 
                for res in self.rdf.query(query)]
    
    @classmethod
    def from_string(cls, treestr):
        treelist = cls.PARSER.parseString(treestr[2:-2])[0]
        return cls.from_list(treelist)
    
    @classmethod
    def from_list(cls, treelist):
        if isinstance(treelist, str):
            return cls(treelist[0])
        elif isinstance(treelist[1], str):
            return cls(treelist[0], [cls(treelist[1])])
        else:
            return cls(treelist[0], [cls.from_list(l) for l in treelist[1:]])

We can now build a lightweight container for our trees.

In [92]:
import tarfile
from collections import defaultdict

class EnglishWebTreebank:
    
    def __init__(self, root='LDC2012T13.tgz'):
        
        def trees():
            with tarfile.open(root) as corpus:
                for fname in corpus.getnames():
                    if '.xml.tree' in fname:
                        with corpus.extractfile(fname) as treefile:
                            treestr = treefile.readline().decode()
                            yield fname, Tree.from_string(treestr)
                        
        self._trees = trees()
                        
    def items(self):
        for fn, tlist in self._trees:
              yield fn, tlist
        
ewt = EnglishWebTreebank()

next(ewt.items())

('eng_web_tbk/data/answers/penntree/20070404104007AAY1Chs_ans.xml.tree',
 S
 --SBARQ
   --WHADVP-9
     --WRB
       --where
   --SQ
     --MD
       --can
     --NP-SBJ
       --PRP
         --I
     --VP
       --VB
         --get
       --NP
         --NNS
           --morcillas
       --PP-LOC
         --IN
           --in
         --NP
           --NNP
             --tampa
           --NNP
             --bay
       --ADVP-LOC-9
         ---NONE-
           --*T*
 --,
   --,
 --S
   --S
     --NP-SBJ
       --PRP
         --I
     --VP
       --MD
         --will
       --VP
         --VB
           --like
         --NP
           --DT
             --the
           --JJ
             --argentinian
           --NN
             --type
   --,
     --,
   --CC
     --but
   --S
     --NP-SBJ-1
       --PRP
         --I
     --VP
       --MD
         --will
       --S
         --NP-SBJ-1
           ---NONE-
             --*PRO*
         --VP
           --TO
             --to
           -

Now, we can run arbitrary queries across trees.

In [94]:
ewt = EnglishWebTreebank()

for _, tree in ewt.items():
    idx = tree.find('''SELECT ?node
                       WHERE { ?node <is-a> <NP-SBJ>;
                                     <is-the-parent-of>* ?child.
                               ?child <is-a> <the>.
                              }''')

    for i in idx:
        print(tree)
        print(tree[i])
        print()

where can I get morcillas in tampa bay *T* , I will like the argentinian type , but I will *PRO* to try anothers please ?
I

where can I get morcillas in tampa bay *T* , I will like the argentinian type , but I will *PRO* to try anothers please ?
I

What do you eat *T* in Miramar ?
you

Which of these do you like *T* : McDonald s , Burger King , Taco Bell , Wendy s ?
you

Do you think *0* there are any koreans in Miramar ?
you

What foods do you eat *T* in Miramar ?
you

Do you prefer ham , bacon or sausages with your breakfast ?
you

Which do you prefer *T* Crab or Shrimp ?
you

Can you recommend any restaurants in Buenos Aires ?
you

I have hundreds of VHS movies lying around ... what should I do *T* with them ?
I

I have hundreds of VHS movies lying around ... what should I do *T* with them ?
I

I have a Nacho Libre question .?
I

RP : Is it *EXP* wrong *PRO* to want *PRO* to remove a contact because they are really really stupid ... ?
they

How much does it cost *T* *PRO* to buy a 

KeyboardInterrupt: ignored

In [123]:
ewt = EnglishWebTreebank()

n_subj = 0
n_subj_prp = 0
n_obj_prp = 0
n_obj = 0 

for _, tree in ewt.items():
    idx_subj_prp = tree.find('''SELECT ?node
                                WHERE { ?node <is-a> <NP-SBJ>;
                                              <is-the-parent-of> ?child.
                                        ?child <is-a> <PRP>.
                                      }''')
    idx_subj = tree.find('''SELECT ?node
                                WHERE { ?node <is-a> <NP-SBJ>. }''')
    idx_obj_prp = tree.find('''SELECT ?node
                                WHERE { ?parent <is-the-parent-of> ?node.
                                        { ?parent <is-a> <VP> } UNION { ?parent <is-a> <PP> }
                                        ?node <is-the-parent-of> ?child;
                                              <is-a> <NP>.
                                        ?child <is-a> <PRP>.
                                      }''')
    idx_obj = tree.find('''SELECT ?node
                                WHERE { ?parent <is-the-parent-of> ?node.
                                        { ?parent <is-a> <VP> } UNION { ?parent <is-a> <PP> }
                                        ?node <is-a> <NP>.
                                      }''')

    n_subj += len(idx_subj)
    n_subj_prp += len(idx_subj_prp)
    n_obj_prp += len(idx_obj_prp)
    n_obj += len(idx_obj) 

In [130]:
import numpy as np
from scipy.stats import fisher_exact

fisher_exact([[n_subj - n_subj_prp, n_subj_prp], [n_obj-n_obj_prp, n_obj_prp]])

#np.array([[n_subj - n_subj_prp, n_subj_prp], [n_obj-n_obj_prp, n_obj_prp]])

SignificanceResult(statistic=0.12353492733239568, pvalue=5.709918835058294e-53)

In [129]:
n_subj_prp / n_subj, n_obj_prp / n_obj

(0.3748517200474496, 0.06896551724137931)