# NLP Core 2 Exercise 2: Sensible PP attachment

In this exercise, we will learn about **POS tagging** and **dependency parsing** and study the well-known **PP attachment problem**.

## Introduction and POS tagging

First, let's take a look at spaCy's Part-of-Speech (POS) tagging and dependency parsing abilities. Here's how we load a sentence into a spaCy document object and view its dependency parse:

In [1]:
import numpy as np

In [2]:
import spacy
from spacy import displacy
nlp = spacy.load('en')
test_doc = nlp('I write code.')
displacy.render(test_doc, jupyter = True)

spaCy also tokenizes the sentence for you. You can view tokens and their POS tags as follows:

In [3]:
print([(token, token.pos_) for token in test_doc])

[(I, 'PRON'), (write, 'VERB'), (code, 'NOUN'), (., 'PUNCT')]


Now let's try applying this to a real dataset. NLTK includes an API for accessing many free open textual corpora, including the Project Gutenberg collection of public domain books. We'll load an array of the sentences of Jane Austen's 1811 novel *Sense and Sensibility* for our tests:

In [4]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
sentences = gutenberg.sents('austen-sense.txt')

[nltk_data] Downloading package gutenberg to /Users/Yohan/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


**Questions:**
  1. How many sentences are in the novel? How many unique tokens?
  2. What are the five most common verbs, counting inflections, in the novel? What are the five most common verbal lemmas (base forms of verbs)?



1.

In [5]:
count_sentences = len(sentences)
print("There are {} sentences in the novel.".format(count_sentences))

There are 4999 sentences in the novel.


In [6]:
all_words = [word for my_sentence in sentences for word in my_sentence]

In [7]:
count_unique_words = len(set(all_words))
count_unique_words

6828

In [8]:
my_doc = nlp(" ".join(all_words))
verb_tokens = [token for token in my_doc if token.pos_ == "VERB"]

2.

In [9]:
verb_freq = nltk.FreqDist([token.text.lower() for token in verb_tokens])
print(verb_freq.most_common(5))

[('was', 1861), ('be', 1304), ('had', 997), ('have', 819), ('is', 757)]


In [10]:
verb_lemmas = nltk.FreqDist([token.lemma_ for token in verb_tokens])
print(verb_lemmas.most_common(5))

[('be', 5434), ('have', 2075), ('do', 703), ('say', 609), ('could', 568)]


## Dependency parsing and PP attachment

As we saw above, spaCy also generates dependency parses that we can plot. These represent the grammatical relations that connect the different words and phrases in a sentence.

For the next task, we will consider how verbs and prepositional phrases can be related in sentences. (A *prepositional phrase* or *PP* is a phrase like "in the house", "on the table", "with my friend" which is headed by a prepisition like "in", "on", "with" ...).

**Questions:**
  3. What is the difference between the prepositional phrases in the sentences in (A) and those in (B)? Plot their dependency parses with displacy.render and look for a difference in structure.

(A)
  * I eat an apple in my room.
  * We listen to music at the theater.
  * John visited Brazil with his friend.
  
(B)
  * I see a fly in my soup.
  * She knows the man at the store.
  * I photographed a man with a bowtie.

In [11]:
a_senteces= nlp('I eat an apple in my room. We listen to music at the theater. John visited Brazil with his friend.')
b_senteces = nlp('I see a fly in my soup. She knows the man at the store. I photographed a man with a bowtie.')
displacy.render(a_senteces, jupyter = True)

In [12]:
displacy.render(b_senteces, jupyter = True)

In A there is an extra ADP=to when compared to B. Then in the second sentence, A has a proper name at the start whereas B has a pronoun. A also has a second propper name = Brazil where B has a DET and a regular noun. Then the sentences end the same way. In A the prepositional phrases are attached to the verb while in B the prepositional phrase is attached to the verb's object.

As you can imagine, it is not simple for the parser to decide where the prepositional phrase should be attached -- this is the **PP attachment problem**. Let's evaluate spaCy's default behavior towards PP attachment on our *Sense and Sensibility* corpus:

**Questions:**
  4. Make an array of all tuples (verb, preposition) for prepositional phrases attached to the verb (like (A) above). Hint: for a spaCy token object *token*, you can get its children with *token*.children and the child's relation to it with *child.dep_*. What are first five (verb, preposition) pairs in this case?
  5. Do the same where the prepositional phrase is attached to the verb's object (case (B)). What are the first five (verb, preposition) pairs in this case?

**Bonus:** Look at a few random sentences from the corpus that are parsed as (A) or (B). Do you agree with the given parse? Why or why not?

In [13]:
#4.
verb_prep = [(token, child) for token in verb_tokens for child in token.children if child.dep_ == 'prep']
verb_prep_array = np.array(verb_prep)

In [14]:
verb_prep_array[:5]

array([[settled, in],
       [was, at],
       [was, in],
       [lived, for],
       [lived, in]], dtype=object)

In [15]:
#5.
verb_obj = [(token, child) for token in verb_tokens for child in token.children if child.dep_ == 'dobj']
verb_array = np.array(verb_obj)
verb_obj_prep = [(tuple[0], child) for tuple in verb_array for child in tuple[1].children if child.dep_ == 'prep']
verb_obj_prep_array = np.array(verb_obj_prep)

In [16]:
verb_obj_prep_array[:5]

array([[engage, of],
       [produced, in],
       [received, of],
       [gave, of],
       [had, in]], dtype=object)