# NLP Syntactic Parsing exercise: Sensible PP attachment

In this exercise, we will learn about **POS tagging** and **dependency parsing** and study the well-known **PP attachment problem**.

## Introduction and POS tagging

First, let's take a look at spaCy's Part-of-Speech (POS) tagging and dependency parsing abilities. Here's how we load a sentence into a spaCy document object and view its dependency parse:

In [1]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [3]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
test_doc = nlp('I write code.')
displacy.render(test_doc, jupyter=True,options={'compact': True})
# Note: you can add options={'compact': True} to get a more compact image

spaCy also tokenizes the sentence for you. You can view tokens and their POS tags as follows:

In [4]:
print([(token, token.pos_) for token in test_doc])

[(I, 'PRON'), (write, 'VERB'), (code, 'NOUN'), (., 'PUNCT')]


Now let's try applying this to a real dataset. NLTK includes an API for accessing many free open textual corpora, including the Project Gutenberg collection of public domain books. We'll load an array of the sentences of Jane Austen's 1811 novel *Sense and Sensibility* for our tests:

In [5]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.corpus import gutenberg
sentences = gutenberg.sents('austen-sense.txt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Question 1
How many sentences are in the novel?



In [8]:
# your code here
len(sentences)

4999

There is 4,999 sentences in the novel.

## Question 2
Create a list of spaCy parsed documents from the sentences.  *Hint:* you need to reconstruct the original sentences. *Hint:* `sentences` is iterable

In [28]:
# your code here
nlps = []
for sentence in sentences:
  nlp_temp = nlp(' '.join(sentence))
  nlps.append(nlp_temp)


## Question 3
Create a flat list of tokens in all of the documents.  How many unique lowercase tokens are there?

In [38]:
# your code here
all_tokens = []
for parsed_doc in nlps:
  for token in parsed_doc :
    all_tokens.append(token)


In [70]:
import numpy as np
np.unique([token.lower_ for token in all_tokens]).shape

(6354,)

6,354 unique lowercase tokens.

## Question 4
What are the five most common lowercase verbs in the novel counting different inflections separately?

In [110]:
all_verbs = []
for token in all_tokens:
  if token.pos_ == 'VERB':
    all_verbs.append(token.text)

In [107]:
# your code here
from collections import Counter
counter = Counter(all_verbs)

In [111]:
counter.most_common(5)

[('said', 397), ('had', 246), ('know', 230), ('have', 224), ('think', 208)]

The five most common lowercase verb are : 'said','had','know','have','think'.

## Question 5
What are the five most common verbal lemmas (base forms of verbs)?

In [116]:
all_tokens[1].lemma_

'sense'

In [117]:
# your code here
all_verbs_lemma = []
for token in all_tokens:
  if token.pos_ == 'VERB':
    all_verbs_lemma.append(token.lemma_)

In [118]:
counter_lemma = Counter(all_verbs_lemma)

In [120]:
counter_lemma.most_common(5)

[('say', 608), ('have', 556), ('know', 385), ('see', 383), ('do', 355)]

The 5 most common verbal lemmas are : 'say','have','know','see','do'.

## Dependency parsing and PP attachment

As we saw above, spaCy also generates dependency parses that we can plot. These represent the grammatical relations that connect the different words and phrases in a sentence.

For the next task, we will consider how verbs and prepositional phrases can be related in sentences. (A *prepositional phrase* or *PP* is a phrase like "in the house", "on the table", "with my friend" which is headed by a prepisition like "in", "on", "with" ...).

## Question 6
What is the difference between the prepositional phrases in the sentences in (A) and those in (B)? Plot their dependency parses with `displacy.render` and look for a difference in structure.

(A)
  * I eat an apple in my room.
  * We listen to music at the theater.
  * John visited Brazil with his friend.
  
(B)
  * I see a fly in my soup.
  * She knows the man at the store.
  * I photographed a man with a hat.

**Note:** it's possible that some of the sentences above will not be parsed properly.  Use your judgement and different parsings to differentiate between the groups.

In [None]:
# your code here

As you can imagine, it is not simple for the parser to decide where the prepositional phrase should be attached -- this is the **PP attachment problem**. Let's evaluate spaCy's default behavior towards PP attachment on our *Sense and Sensibility* corpus:

## Question 7
Create tuples (verb lemma, preposition lemma) for prepositional phrases attached to the verb (like (A) above). *Hint:* for a spaCy token object `token`, you can get its children with `token.children` and the child's relation to it with `child.dep_`. What are five most common (verb lemma, preposition lemma) pairs in the novel?

In [None]:
# your code here

## Question 8
Do the same where the prepositional phrase is attached to the verb's object (case (B)). What are the five most common (verb lemma, preposition lemma) pairs in this case? **Hint:** what should be the verb's child's dependency type? what should be the child child's dependency type?

In [None]:
# your code here

## Bonus question
Look at a few random sentences from the corpus that are parsed as (A) or (B). Do you agree with the given parsing? Why or why not?