# Topic 5: Dependency Parsing

## Preliminaries 
Run this cell.

In [None]:
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections
from collections import defaultdict,Counter
from itertools import zip_longest
from IPython.display import display
from random import seed
get_ipython().magic('matplotlib inline')
import random
import math
import matplotlib.pylab as pylab
%matplotlib inline
params = {'legend.fontsize': 'large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from pylab import rcParams
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import csv

### spaCy
In the next topic, Topic 6, we will be investigating how to make a fine-grained analysis of the content of Amazon reviews. In particular, we will be analysing what the reviewer says about specific aspects of the product, e.g. what is said about the *plot* of a *film*. In preparation for that, in this notebook, we will be learning about dependency trees.

Dependency trees allow us to see how the words in a sentence relate to one another grammatically. This contrasts with  what we've been doing up until now when determining the sentiment of a review; we've been viewing a document as an unordered bag of words.

Up to this point we have been using varous NLP tools that form part of the NLTK. We now turn to an alternative, significantly more powerful NLP toolkit, one that provides state-of-the-art accuracy and state-of-the-art efficiency. 

We will be using something called [spaCy](https://spacy.io/). 

In this notebook we will be familiarising ourselves with many of the things that spaCy can do. 

Note that several of the examples that appear in this notebook have either been taken directly, or have been adapted from various spaCy tutorials found [here](https://spacy.io/docs/usage/tutorials).

To load spaCy, run the following cell.

In [None]:
import spacy
# we'll be using the English version of spaCy. German, French, Spanish versions are also available.
nlp = spacy.load('en')

### Review dataset
We need some data to work with. Let's set up a dataset of dvd reviews, `dvd_reviews`. 

To do this, run the following cell.

In [None]:
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader

# create a list containing the raw text of all of the available dvd reviews.
dvd_reviews = [review for review in AmazonReviewCorpusReader().category("dvd").raw()]
print("The dvd review dataset contains {} reviews".format(len(dvd_reviews)))

### Processing a review with spaCy
We now look at what spaCy produces when it analyses text. 

The following cell illustrates some (though by no means all) of the elements of the analysis. After being analysed by spaCy, much of the analysis is accessible through the tokens. Each token is an object that has a number of properties. See [here](https://spacy.io/docs/api/token) for a full list of a token's properties.

Note: in general, when a property name ends with an underscore character, e.g.  `orth_`, that property returns a string (unicode) representation of the value for that property. This is useful when displaying output in a human-readable way. With no underscore, e.g. `orth`, the property returns the index of the value within the spaCy vocabulary.

In the following cell, we see the following token properties being used:
- `orth_`: the token's orthography.
- `lemma_`: the uninflected form of the token.
- `shape_`: the token shape.
- `pos_`: the part of speech of the token.
- `is_stop`: is the token a stop word or not?
- `is_punc`: is the token punctuation or not?
- `is_space`: is the token whitespace? Note that spacy tokeniser treates any sequence of whitespace characters beyond a single space as a token.
- `like_num`: is the token a number?
- `is_oov`: is the token an out-of-vocabulary word?
- `prop`: the log probabilities of tokens, where the probabilities are estimated from a three billion word corpus, with simple Good-Turning smoothing;

### Exercise
Run the following cell several times so that you can look at the output for a variety of sentences.
- Notice that parts of speech tags are upper case strings, e.g. `VERB`.
- Look at places where the lemma is different from the token.
- See if you can find sentences with out-of-vocabulary (oov) tokens.
- What does the `shape` property capture?
- See if you can figure out what this line is doing:
```
df.loc[:, 'stop?':'out of vocab.?'] = (df4.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: 'Yes' if x else ''))
```

In [None]:
# randomly choose a review
review = random.choice(dvd_reviews)
#run spaCy on the review
parsed_review = nlp(review) # in spaCy we call parsed_review a Doc

# get just the first sentence of the review
parsed_sentence = next(parsed_review.sents) # in spaCy we call parsed_sentence a Span (of a Doc)

token_attributes = [(token.orth_,
                     token.lemma_,
                     token.pos_,
                     token.like_num,
                     token.is_stop,
                     token.is_oov,
                     token.is_punct,
                     token.is_space,
                     token.shape_,
                     token.prob,
                    )
                    for token in parsed_sentence]

df = pd.DataFrame(token_attributes,
                   columns=['text',
                            'lemma',
                            "pos",
                            'number?',
                            'stop?',
                            'oov?',
                            'punctuation?',
                            'whitespace?',
                            'shape',
                            'log probability',
                           ])

df.loc[:, 'number?':'whitespace?'] = (df.loc[:, 'number?':'whitespace?']
                                       .applymap(lambda x: 'yes' if x else 'no'))

print('Analysis of the sentence:\n{}'.format(parsed_sentence.text))                                               
display(df)

### Properties of spaCy objects

Three classes of objects make up a spaCy analysis:
- A document.
- A span. 
 - this as a subsequence, or slice, of the parsed document and could be a sentence or phrase.
- A token.

Each of these has various properties.

In each of the following three code cells you will see code that uses `dir` to display the full set of such properties for each kind of object. 

We begin with a document.

In [None]:
review = random.choice(dvd_reviews)
parsed_review = nlp(review) # in spaCy we call parsed_review a Doc
      
for prop in dir(parsed_review):
    if not prop.startswith('_'):
        print("\t",prop)


Next we look at the properties of spans. In this case, our span is a sentence from the review.

In [None]:
review = random.choice(dvd_reviews)
parsed_review = nlp(review) # in spaCy we call parsed_review a Doc
parsed_sentence = next(parsed_review.sents) # in spaCy we call parsed_sentence a Span (of a Doc)

for prop in dir(parsed_sentence):
    if not prop.startswith('_'):
        print("\t",prop)

In [None]:
Finally, we look at the properties of tokens.

In [None]:
review = random.choice(dvd_reviews)
parsed_review = nlp(review) # in spaCy we call parsed_review a Doc
parsed_sentence = next(parsed_review.sents) # in spaCy we call parsed_sentence a Span (of a Doc)

for prop in dir(parsed_sentence[0]):
    if not prop.startswith('_'):
        print("\t",prop)

### Dependency trees in spaCy
We are now ready to look at dependency trees.

Dependency trees are graphs that are used to describe the syntax of a sentence. They do this by specifying relationship between the tokens in the sentence. The vertices of the graph are the tokens and the edges of the graph capture grammatical relationship between tokens, e.g. that a noun is the subject of a verb. They are called dependency **trees** because the graph is a tree.

The following visualisation shows the a dependency tree produced by spaCy for the sentence  
"*However, the plot was predictable.*"

![dependency tree example](./img/example_dependency_tree.png)

### Exerise
In order to get a sense of what dependency trees produced by spaCy look like, take a look at a demo of spaCy's parser 
which can be found [here](https://demos.explosion.ai/displacy).

In the box at the top (the one with the magnifying glass icon on its right), type in a sentence, run the parser, and examine the output. Try this for a few sentences.

Here are some things to look out for:
- Across the bottom of the tree, you will see each token with its part-of-speech shown below.
 - A full list of the parts of speech tag set can be found [here](https://spacy.io/docs/api/annotation#pos-tagging).
 - The tokens are shown in the order that they appear in the text.
 - Use `spacy.explain` to get a brief explanatin of a symbol, e.g. try `spacy.explain("JJ")`.
- Above the tokens you see **directed, labelled edges**. 
 - Each edge specifies a dependency between two tokens in the sentence.
 - It is the edges that provide the syntactic analysis of the sentence.  
 - Each edge connects a **head** with one of its **dependents**. 
 - The edges are directed **from** the head **to** the dependent.
 - The edges are labelled by the name of the dependency relation. 
 - A full set of dependency relations can be found [here](https://spacy.io/docs/api/annotation#dependency-parsing).

### Working with dependency trees in spaCy
Tokens are associated with two properties that encode the dependency tree that spaCy has assigned to a sentence.
- `token.head`: this gives the token in the sentence that is the head of this token.
- `token.dep_`: this gives the label of the dependency relation that links `token.head` to `token`.

Note that when `token` is the root of the dependency tree `token.head == token`.

### Exerise
Run the cell below and inspect the output. 
- Notice that dependency labels are lower case strings, e.g. `nsubj`.
- Notice the token that is at the root of the dependency tree has itself as its head.
- Type the same sentence into the [spaCy parser demo](https://demos.explosion.ai/displacy) and check that each line of output is compatible with the tree being displayed.

In [None]:
# randomly choose a review
review = random.choice(dvd_reviews)
#run spaCy on the review
parsed_review = nlp(review)
# get just the first sentence of the review
parsed_sentence = next(parsed_review.sents)

token_attributes = [(token.orth_,
                     token.pos_,
                     token.dep_,
                     token.head,
                    )
                    for token in parsed_sentence]

df = pd.DataFrame(token_attributes,
                   columns=['text',
                            "pos",
                            "dep",
                            "head",
                           ])
                                               
print('Analysis of the sentence:\n{}'.format(parsed_sentence.text))
display(df)

### Exercise
In the cell below, you will find code that shows the verb tokens in a review together with an indication of whether they appeared (at least once in the review) in an `nsubj` relation with another token.

Make a copy of this cell, and in the new cell, adapt the code so that so that the output also includes an additional column showing whether the verb tokens also appeared in a sentence in a situation where the token had both a `nsubj` relationship with some other token and a `dobj` relation with yet some other token. 

In [None]:
reviews = dvd_reviews[:10]
for review in reviews:
    parsed_review = nlp(review)
    print("Review:\n\n{}".format(review))
    all_verbs = set()
    verbs_with_nsubj = set()
    for token in parsed_review:
        if token.pos_ == 'VERB':
            all_verbs.add(token)
            for child in token.children:
                if child.dep_ == 'nsubj':
                    verbs_with_nsubj.add(token)
                    break
    print("Review:\n{}".format(review))
    df = pd.DataFrame([(verb,verb in verbs_with_nsubj) for verb in all_verbs],
                      columns=["verb",'has nsubj?'])
    df.loc[:, 'has nsubj?':'has nsubj?'] = (df.loc[:, 'has nsubj?':'has nsubj?']
                                       .applymap(lambda x: 'yes' if x else ''))
    display(df)


In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/verbs_with_subj_and_obj

### Exercise
Now adapt the code you wrote for the last exercise so that it displays a table with one column for each verb that appeared with at least one (`nsubj`,`dobj`) pair. The column for a verb should contains all the (`nsubj`,`dobj`) pairs that occurred with that verb.

So a verb that occurred three times in the review in a situation where it had both an `nsubj` and a `dobj` would have entries in rows 0, 1 and 2, with each entry being the pair of tokens, i.e. the verbs `nsubj` and `dobj`. 

- Use a dictionary to store the (`nsubj`,`dobj`) pair details of each verb. 
- Store each verb's (`nsubj`,`dobj`) pairs in a list.
- Put all of the lists of (`nsubj`,`dobj`) pairs into a list of lists of pairs, called `all_pairs`
- Put the verbs in a list called `verbs` that is ordered in a way that aligns with the ordering in `all_pairs`.
- Put this into a Pandas dataframe
 - use `pd.DataFrame(list(zip_longest(*all_pairs)),columns = verbs).applymap(lambda x: '' if x == None else x)` 
 - see [unpack argument lists](https://docs.python.org/3/tutorial/controlflow.html#tut-unpacking-arguments) for an explanation of the `*`.


In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/verbs_with_subj_obj_pairs

### Direct object relation
The direct object of a verb, is the recipient of the action. So in "I bought Shrek", "Shrek" is the direct object of a buying action. So, for example, if we were to look for the direct objects of the verbs "want", "buy" and "love" we would find the words which are wanted, bought and loved. This relation is called `dobj`.

### Exercise
In the blank cell below, write code that finds all of the reviews in the DVD review set that contain the verbs "love", "buy" or "want". For each of these verbs, collect all the words that lie in the `dobj` relation with them, and show the results in a table. There should be one column for each of the three verbs.
- To make the code general, do this: `target_verbs = ["love","buy","want"]`.
- Our three target words are verb lemmas, so check their equality using `.lemma_`.
- When you are debugging your code, don't run it on the whole dataset.
- You can store the direct objects using a dictionary of lists and convert to a dataframe in the same way that was recommended for the previous exercise.


In [None]:
# uncomment the next line and then run the cell to load a solution
#%load solutions/love_buy_want_objs