# Lesson notebook 8 - Parsing



#### NLTK Chart parser

First we'll look at a chart parser from NLTK.  This parser is not pretrained.  It will operate by following the production rules in the grammar we provide.


#### NLTK Shift Reduce parser

Next we'll run the NLTK shift reduce parser.  Again, this parser is also not pre-trained so it is completely dependent on the grammar we provide.  Since we are providing a toy grammar and an ambiguous sentence we end up without a single tree as output.

#### NLTK Probabilistic Chart parser

Third, we'll look at a probabilistic chart parser from NLTK.  This parser is not pretrained.  It will operate by following the production rules in the grammar we provide and score the sentences.


#### SpaCy language processing examples

Finally we'll use SpaCy, a pretrained open source language processing pipeline.  It provides a platform for processing text in a number of ways without having to perform any fine-tuning.

<a id = 'returnToTop'></a>

## Notebook Contents
  * 1. [NLTK Parsers](#nltk)
    * 1.1 [NLTK Setup](#nltkSetup)
    * 1.2 [Chart Parser](#chartParser)
    * 1.3 [Shift Reduce Parser](#srParser)
    * 1.4 [Probabilistic Chart Parser](#pchartParser)
  * 2. [SpaCy](#spacy)
    * 2.1 [SpaCy Setup](#spacySetup)
    * 2.2 [Spacy Natural Language Processing Pipeline](#spacyPipeline)
    * 2.3 [Sentence Boundary Detection](#spacySentence)
    * 2.4 [Part of Speech Tagging](#spacyPOS)
    * 2.5 [Dependency Parsing](#spacyDep)
  * 3. [Class Exercise](#classExercise)
  * 4. [Answers](#answers)      





[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2023-summer-main/blob/master/materials/lesson_notebooks/lesson_8_Parsing.ipynb)

[Return to Top](#returnToTop)  
<a id = 'nltk'></a>


## 1. NLTK Parsers

NLTK (Natural Language Tool Kit) is an older python library that enables the pre-neural way of doing many of the language processing tasks that we discuss in this class.  It is a good way of exploring algorithms and non-neural implementations. The [NLTK book](https://www.nltk.org/book/) is referenced in the syllabus.


[Return to Top](#returnToTop)  
<a id = 'nltkSetup'></a>

### 1.1 NLTK set up

Let's set up our environment to run the NLTK library.  It was created before the advent of neural NLP but provides a great illustration of these approaches and allows you to experiment with them.  These implementations do not require a GPU and can easily run on your laptop.

In [1]:
import pickle
import subprocess
import sys
import nltk
from nltk import Nonterminal, nonterminals, Production, CFG, PCFG

[Return to Top](#returnToTop)  
<a id = 'chartParser'></a>

### 1.2 NLTK Chart parser

Recall that a [chart parser](https://www.nltk.org/howto/parse.html) requires some way of prioritizing production rules.  This can be done with a context free grammar.  Here's an example of such a grammar that deals with the wonderfully ambiguous line "I shot an elephant in my pajamas".  The prepositional phrase "in my pajamas" can be attached to the verb  "shot" meaning I was wearing pajamas or attached to the non "elephant" meaning the elephant was wearing my pajamas. Both parses are equally valid gramatically speaking even though the attachment to the verb shot is the more probable.

First we define our context free grammar.  A real full grammar for English would be significantly larger.

In [2]:
groucho_grammar = nltk.CFG.fromstring("""
 S -> NP VP
 PP -> P NP
 NP -> Det N | Det N PP | 'I'
 VP -> V NP | VP PP
 Det -> 'an' | 'my'
 N -> 'elephant' | 'pajamas'
 V -> 'shot'
 P -> 'in'
 """)

Now we can feed our grammar and sentence in to the chart parser and generate some parses.

In [3]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
parser = nltk.ChartParser(groucho_grammar)
for tree in parser.parse(sent):
     tree.pretty_print()

     S                                       
  ___|______________                          
 |                  VP                       
 |         _________|__________               
 |        VP                   PP            
 |    ____|___              ___|___           
 |   |        NP           |       NP        
 |   |     ___|_____       |    ___|_____     
 NP  V   Det        N      P  Det        N   
 |   |    |         |      |   |         |    
 I  shot  an     elephant  in  my     pajamas

     S                                   
  ___|__________                          
 |              VP                       
 |    __________|______                   
 |   |                 NP                
 |   |     ____________|___               
 |   |    |     |          PP            
 |   |    |     |       ___|___           
 |   |    |     |      |       NP        
 |   |    |     |      |    ___|_____     
 NP  V   Det    N      P  Det        N   
 |   |    |     |    

Note the parser includes trees for both prepositional attachment possibilities because both parses are equally valid given our grammar.

[Return to Top](#returnToTop)  
<a id = 'srParser'></a>

### 1.3 NLTK Shift Reduce Parser Example

Let's try NLTK's simple shift reduce parser.  This is a parser that uses a grammar we provide and generates a constituency parse that corresponds to our grammar.  As such it can only work as well as the grammar we provide.  If you alter the input sentence to inpclude words not in the grammar you will generte an exception.

In [4]:
#shift reduce parser example
from nltk.grammar import Nonterminal
from nltk.parse.api import ParserI
from nltk.tree import Tree

Now let's run the shift reduce parser.  The buffer is loaded with all of the words in our sentence.  On the left, before the square bracket is a letter **S** or **R**.  **S** means the parser picks the Shift command and move a token from the buffer to the stack.  **R** means it chooses the reduce command so swaps out a word for a label based on the grammar.  The parser runs until the buffer is empty.

In [5]:
parser = nltk.parse.ShiftReduceParser(groucho_grammar, trace=2)
for p in parser.parse(sent):
    print(p)

Parsing 'I shot an elephant in my pajamas'
    [ * I shot an elephant in my pajamas]
  S [ 'I' * shot an elephant in my pajamas]
  R [ NP * shot an elephant in my pajamas]
  S [ NP 'shot' * an elephant in my pajamas]
  R [ NP V * an elephant in my pajamas]
  S [ NP V 'an' * elephant in my pajamas]
  R [ NP V Det * elephant in my pajamas]
  S [ NP V Det 'elephant' * in my pajamas]
  R [ NP V Det N * in my pajamas]
  R [ NP V NP * in my pajamas]
  R [ NP VP * in my pajamas]
  R [ S * in my pajamas]
  S [ S 'in' * my pajamas]
  R [ S P * my pajamas]
  S [ S P 'my' * pajamas]
  R [ S P Det * pajamas]
  S [ S P Det 'pajamas' * ]
  R [ S P Det N * ]
  R [ S P NP * ]
  R [ S PP * ]


Note the shift reduce parser doesn't produce a single constituency parse with an S at the top of the tree.

[Return to Top](#returnToTop)  
<a id = 'pchartParser'></a>

### 1.4 NLTK Probabilistic Chart Parser

Here is a probabilistic chart parser where we define a grammar and associate a probability with each of the productions.  We can use this to generate a joint probability for each parse of the sentence.

First, we define our grammar and associate probabilities with each production.  Note that the probabilities associated with the left hand rule **VP** add up to one.  There is a vey low probability associated with attaching a prepositional phrase (PP) to a verb phrase (VP).

In [6]:
from nltk.parse import pchart

In [7]:
toy_pcfg2 = PCFG.fromstring("""
     S    -> NP VP         [1.0]
     VP   -> V NP          [.59]
     VP   -> V             [.40]
     VP   -> VP PP         [.01]
     NP   -> Det N         [.41]
     NP   -> Name          [.28]
     NP   -> NP PP         [.31]
     PP   -> P NP          [1.0]
     V    -> 'saw'         [.21]
     V    -> 'shot'        [.51]
     V    -> 'ran'         [.28]
     N    -> 'boy'         [.11]
     N    -> 'pajamas'     [.12]
     N    -> 'table'       [.13]
     N    -> 'telescope'   [.14]
     N    -> 'elephant'    [.5]
     Name -> 'Jack'        [.32]
     Name -> 'Bob'         [.28]
     Name -> 'I'           [.40]
     P    -> 'in'          [.30] 
     P    -> 'with'        [.41]
     P    -> 'under'       [.29]
     Det  -> 'the'         [.41]
     Det  -> 'an'          [.31]
     Det  -> 'my'          [.28]
     """)

In [8]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
grammar = toy_pcfg2
parser = pchart.InsideChartParser(grammar)
for t in parser.parse(sent):
    print(t)

(S
  (NP (Name I))
  (VP
    (V shot)
    (NP
      (NP (Det an) (N elephant))
      (PP (P in) (NP (Det my) (N pajamas)))))) (p=2.74386e-06)
(S
  (NP (Name I))
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas))))) (p=8.85116e-08)


[Return to Top](#returnToTop)  
<a id = 'spacy'></a>

## 2. SpaCy for Language Processing

Let's set up our environment to run the current version of [SpaCy](https://spacy.io) and feed it a small snippet of text to see what it can do.  

SpaCy is an open source industrial strength NLP engine that can perform multiple functions out of the box. It strikes a good balance between speed of processing and accuracy of predictions.  It comes with a number of different language models trained on the [OntoNotes5](https://catalog.ldc.upenn.edu/LDC2013T19) data set.  This means that it is already trained to do part of speech tagging and dependency parsing.  It can also be trained to do classification and a number of other tasks in the standard NLP stack.  It is very fast.  It can be a handy way of analyzing some text for exploratory data analysis. Another use is annotating some text to then create a labelled training set that you use to train up your own model independent of spaCy.

SpaCy uses a combination of techniques including embeddings and convolutional neural nets to genearate the output we see. Newer versions (> 2.1) are able to interact with pre-trained transformers.



[Return to Top](#returnToTop)  
<a id = 'spacySetup'></a>

### 2.1 SpaCy Setup


In [9]:
!pip -q install -U spacy


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
en-core-web-sm 3.4.1 requires spacy<3.5.0,>=3.4.0, but you have spacy 3.5.0 which is incompatible.[0m[31m
[0m

In [10]:
!pip install -U spacy-lookups-data

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-lookups-data
  Downloading spacy_lookups_data-1.0.3-py2.py3-none-any.whl (98.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.5/98.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-lookups-data
Successfully installed spacy-lookups-data-1.0.3


In [11]:
import spacy
import pandas as pd

print(spacy.__version__)
print(pd.__version__)




3.5.0
1.3.5


#### Pre-trained Language Models for SpaCy

SpaCy has also been pre-trained on multiple languages.  When using it you need to select and load a specific language model.

Make sure you first download a language model then load it into SpaCy. We're selecting English via a small model which gives us access to a wide variety of functionality.  There are many other options and other languages. 

In [12]:
#load an english model
nlp = spacy.load("en_core_web_sm")



[Return to Top](#returnToTop)  
<a id = 'spacyPipeline'></a>

### 2.2 SpaCy Natural Language Processing Pipeline
When you invoke spaCy with some input text it generates a set of objects.  spaCy wants to process "document" like objects. This document can be one sentence or can be many sentences.  You provide the text and spaCy runs the nlp function which returns a Doc object.  That Doc object contains a list of Token objects each of which is associated with a set of annotations.  Many examples below are just about harvesting the labels associated with each token after the processing of the document in the Doc object. 

In [13]:
doc = nlp(u"This is a sentence that we want to process.")

print("The first word is: ") 
doc[0].text

The first word is: 


'This'

[Return to Top](#returnToTop)  
<a id = 'spacySentence'></a>

### 2.3 Sentence Boundary Detection

Let's demonstrate some of the capabilities built in to the SpaCy language processing pipeline.  One problem we sometimes have to deal with is sentence boundayr detection.  We want to process a sequence of words as a unit like a sentence.  We might then want to feed individual sentences in to some SpaCy process.

Let's see if we can convert these five lines of text into the three sentences they contain.  We include the tricky 'U.S.' in our lines to see if the bounadry detector can handle more complex cases.

In [14]:
#sentence detection
# Given an input block of text, identify where the sentences end.

about_text = ('Sentence boundary detection is actually'
              ' a pretty hard problem.  Great advances'
              ' have been made in the U.S. in the'
              ' past decade. New neural nets'
              ' like a CNN can help improve results on this classification task.')
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
#len(sentences)
 
#now print out the three sentences
for sentence in sentences:
    print (sentence)

Sentence boundary detection is actually a pretty hard problem.  
Great advances have been made in the U.S. in the past decade.
New neural nets like a CNN can help improve results on this classification task.


[Return to Top](#returnToTop)  
<a id = 'spacyPOS'></a>

### 2.4 Part of Speech Tagging
Part of speech tagging can also be very valuable.  Tagging words can allow you to quickly distinguish "things" from "actions" or "events." 
SpaCy has several different tags to display related to part of speech as shown below.  First, we'll just print out the tags.  Second we'll take the output and display it in a table using pandas.

In [15]:
#POS with unpretty print

for token in about_doc:
    print (token, token.tag_, token.pos_, spacy.explain(token.tag_))

Sentence NN NOUN noun, singular or mass
boundary JJ ADJ adjective (English), other noun-modifier (Chinese)
detection NN NOUN noun, singular or mass
is VBZ AUX verb, 3rd person singular present
actually RB ADV adverb
a DT DET determiner
pretty RB ADV adverb
hard JJ ADJ adjective (English), other noun-modifier (Chinese)
problem NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer
  _SP SPACE whitespace
Great JJ ADJ adjective (English), other noun-modifier (Chinese)
advances NNS NOUN noun, plural
have VBP AUX verb, non-3rd person singular present
been VBN AUX verb, past participle
made VBN VERB verb, past participle
in IN ADP conjunction, subordinating or preposition
the DT DET determiner
U.S. NNP PROPN noun, proper singular
in IN ADP conjunction, subordinating or preposition
the DT DET determiner
past JJ ADJ adjective (English), other noun-modifier (Chinese)
decade NN NOUN noun, singular or mass
. . PUNCT punctuation mark, sentence closer
New JJ ADJ adjective (Engli

Let's process that input from the variable *about_doc* and show the results of POS tagging the three sentences it contains.  We'll take that output and display it in a table with columns for the word, it's POS tag, a higher level syntactic description, and an explanation for the model.

In [16]:
#POS
#capturing the output in a pandas dataframe makes it easier to view
dpos = pd.DataFrame()
dpos['text'] = [token.text for token in about_doc]
dpos['tag'] = [token.tag_ for token in about_doc]
dpos['pos'] = [token.pos_ for token in about_doc]
dpos['explain'] = [spacy.explain(token.tag_) for token in about_doc]

dpos


Unnamed: 0,text,tag,pos,explain
0,Sentence,NN,NOUN,"noun, singular or mass"
1,boundary,JJ,ADJ,"adjective (English), other noun-modifier (Chin..."
2,detection,NN,NOUN,"noun, singular or mass"
3,is,VBZ,AUX,"verb, 3rd person singular present"
4,actually,RB,ADV,adverb
5,a,DT,DET,determiner
6,pretty,RB,ADV,adverb
7,hard,JJ,ADJ,"adjective (English), other noun-modifier (Chin..."
8,problem,NN,NOUN,"noun, singular or mass"
9,.,.,PUNCT,"punctuation mark, sentence closer"


[Return to Top](#returnToTop)  
<a id = 'spacyDep'></a>

### 2.5 Dependency Parsing

Now let's test SpaCy's ability to generate dependency parse trees. SpaCy has been pre-trained on a number of different tasks, like T5. Spacy performs multiple analyses simultaneously so we can walk over the list of input tokens and simply call up the labels assigned to each token.

This approach can be difficult for a human to read.  Sometimes the data can be used for training other models or for exploratory data analysis.

In [17]:
#dependency parsing
w266_text = 'Students are learning Natural Language Processing in the W266 class.'
w266_doc = nlp(w266_text)
for token in w266_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

Students NNS learning nsubj
are VBP learning aux
learning VBG learning ROOT
Natural NNP Language compound
Language NNP Processing compound
Processing NNP learning dobj
in IN learning prep
the DT class det
W266 JJ class amod
class NN in pobj
. . learning punct


Lets capture the output and put it into a pandas dataframe for easier consumption.

In [18]:
#if you capture the tags in a dataframe you can then perform additional 
#operations like counting and filtering and searching

df = pd.DataFrame()
df['text'] = [token.text for token in w266_doc]
df['lemma'] = [token.lemma_ for token in w266_doc]
df['is_punctuation'] = [token.is_punct for token in w266_doc]
df['is_space'] = [token.is_space for token in w266_doc]
df['shape'] = [token.shape_ for token in w266_doc]
df['part_of_speech'] = [token.pos_ for token in w266_doc]
df['pos_tag'] = [token.tag_ for token in w266_doc]
df['head'] = [token.head.text for token in w266_doc] 
df['dep'] = [token.dep_ for token in w266_doc]

df

Unnamed: 0,text,lemma,is_punctuation,is_space,shape,part_of_speech,pos_tag,head,dep
0,Students,student,False,False,Xxxxx,NOUN,NNS,learning,nsubj
1,are,be,False,False,xxx,AUX,VBP,learning,aux
2,learning,learn,False,False,xxxx,VERB,VBG,learning,ROOT
3,Natural,Natural,False,False,Xxxxx,PROPN,NNP,Language,compound
4,Language,Language,False,False,Xxxxx,PROPN,NNP,Processing,compound
5,Processing,Processing,False,False,Xxxxx,PROPN,NNP,learning,dobj
6,in,in,False,False,xx,ADP,IN,learning,prep
7,the,the,False,False,xxx,DET,DT,class,det
8,W266,w266,False,False,Xddd,ADJ,JJ,class,amod
9,class,class,False,False,xxxx,NOUN,NN,in,pobj


The dependency tree is a list of arcs and labels.  These are shown in the final two columns.  The 'head' column indicates the word from which the incoming arc originates and the 'dep' column contains the label associated with that tag.


We can also visualize the dependency tree with the call to 'displacy.render' and then displaying the resulting HTML code.

In [19]:
from spacy import displacy
from IPython.core.display import HTML


In [20]:
html = displacy.render(w266_doc, style="dep")
#print(html)

In [21]:
HTML(html)

We'll look at other SpaCy capabilities in later classes.

[Return to Top](#returnToTop)  
<a id = 'classExercise'></a>

## 3. Class Exercise

Try submitting sentences to the SpaCY dependency parser to see how well it does and where it begins to break down.

Here are a number of "garden path" sentences where words at the end modify the meaning of words toward the front and alter the correct parts of speech.  You can submit these sentences or come up with your own.

In [22]:
#w266_text = 'The complex houses married and single students and their families.'
w266_text = 'The blind man picked up the hammer and saw.'
#w266_text = 'The woman with the dog that had the parasol was brown.'
#w266_text = 'The old man the boat.'
#w266_text = 'Time flies like an arrow, fruit flies like a banana.'
#w266_text = 'Everyone must learn to parse long multi-clausal sentences because they teach us about the intricacies of grammar.'
w266_doc = nlp(w266_text)
for token in w266_doc:
    print (token.text, token.tag_, token.head.text, token.dep_)

The DT man det
blind JJ man amod
man NN picked nsubj
picked VBD picked ROOT
up RP picked prt
the DT hammer det
hammer NN picked dobj
and CC picked cc
saw VBD picked conj
. . picked punct


In [23]:
html = displacy.render(w266_doc, style="dep")
HTML(html)