# Assignment 5 - Parts of Speech

by Bryan Carr

10 October 2022

for University of San Diego - AAI 520 Natural Language Processing

Prof. Siamak Aram


In this assignment, we'll be using the short story The Tale of Peter Rabbit by Beatrix Potter (1902). This story is in the public domain, and the text file was obtained from Project Gutenburg.

In [None]:
# Import key libraries

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

import pandas as pd
import numpy as np

# Mount the Google Drive to use the data file
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **1. Create a Doc object from the text file peterrabbit.txt**

In [None]:
# Read in the data
text = open('/content/drive/My Drive/AAI 520 NLP/peterrabbit.txt').read()

# Print out the data to check

text

"The Tale of Peter Rabbit, by Beatrix Potter (1902).\n\nOnce upon a time there were four little Rabbits, and their names\nwere--\n\n          Flopsy,\n       Mopsy,\n   Cotton-tail,\nand Peter.\n\nThey lived with their Mother in a sand-bank, underneath the root of a\nvery big fir-tree.\n\n'Now my dears,' said old Mrs. Rabbit one morning, 'you may go into\nthe fields or down the lane, but don't go into Mr. McGregor's garden:\nyour Father had an accident there; he was put in a pie by Mrs.\nMcGregor.'\n\n'Now run along, and don't get into mischief. I am going out.'\n\nThen old Mrs. Rabbit took a basket and her umbrella, and went through\nthe wood to the baker's. She bought a loaf of brown bread and five\ncurrant buns.\n\nFlopsy, Mopsy, and Cottontail, who were good little bunnies, went\ndown the lane to gather blackberries:\n\nBut Peter, who was very naughty, ran straight away to Mr. McGregor's\ngarden, and squeezed under the gate!\n\nFirst he ate some lettuces and some French beans; and 

In [None]:
# Create the Doc object
doc = nlp(text)

#Check the first 60 tokens in Doc
doc[:60]

The Tale of Peter Rabbit, by Beatrix Potter (1902).

Once upon a time there were four little Rabbits, and their names
were--

          Flopsy,
       Mopsy,
   Cotton-tail,
and Peter.

They lived with their Mother in a sand-bank, underneath the root of

### **2. For every token in the third sentence, print the Token, the POS Tag, the fine-grained TAG Tag, and the description of the fine-grained tag.**

Sentence starts: "They lived with their mother in a sand-bank..."

In [None]:
list_of_sentences = list(doc.sents)

print("Number of Sentences: " + str(len(list_of_sentences)))

list_of_sentences[2]

Number of Sentences: 55


They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.

'

In [None]:
doc_sentence3 = nlp(str(list_of_sentences[2]))

doc_sentence3

They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.

'

In [None]:
for token in doc_sentence3:
  print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

They       PRON     PRP    pronoun, personal
lived      VERB     VBD    verb, past tense
with       ADP      IN     conjunction, subordinating or preposition
their      PRON     PRP$   pronoun, possessive
Mother     PROPN    NNP    noun, proper singular
in         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner
sand       NOUN     NN     noun, singular or mass
-          PUNCT    HYPH   punctuation mark, hyphen
bank       NOUN     NN     noun, singular or mass
,          PUNCT    ,      punctuation mark, comma
underneath ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
root       NOUN     NN     noun, singular or mass
of         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner

          SPACE    _SP    whitespace
very       ADV      RB     adverb
big        ADJ      JJ     adjective (English), other noun-modifier (Chinese)
fir        NOUN     NN

We can see the results more or less as expected. An extra period exists after the space at the end of the sentence.


### **3. Provide a frequency list of POS tags from the entire document**

In [None]:
pos_counts = doc.count_by(spacy.attrs.POS)

for k, v in sorted(pos_counts.items()):
  print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 54
85. ADP  : 122
86. ADV  : 67
87. AUX  : 49
89. CCONJ: 61
90. DET  : 90
92. NOUN : 166
93. NUM  : 8
94. PART : 29
95. PRON : 109
96. PROPN: 76
97. PUNCT: 173
98. SCONJ: 20
100. VERB : 135
103. SPACE: 99


### 4. **CHALLENGE: What percentage of tokens are nouns?**

In [None]:
# First compute the total number of entries
# I am using a loop because sum across the dictionary gives a strange result - nearly 1400

total = 0
for k, v in pos_counts.items():
  total += v

total

1258

In [None]:
# Next calculate the percentage based on the count at entry 92, for Nouns

noun_percentage = pos_counts[92] / total

noun_percentage

0.13195548489666137

The result, 13.20%, is close to the 13.99% expected. Our text document and POS count results are slightly different than the expected ones in the assignment handout, with only 166 nouns vs 176 expected.

### **5. Display the Dependency Parse for the third sentence**

We can show the dependency parse using the Displacy package. Render will create a nice graphical representation, whereas Parse_Deps will give us a raw dictionary with the information - one keyed on Words (tokens with part of speech), and one keyed on Arcs (for the arrows showing relationship).

In [None]:
# Using .render to show graphically

displacy.render(doc_sentence3, style='dep', options={'distance':110}, jupyter=True)

In [None]:
# Using .parse_deps to show all the info used to build the visualization -- the Tokens and their relating Arrows

displacy.parse_deps(doc_sentence3)

{'words': [{'text': 'They', 'tag': 'PRON', 'lemma': None},
  {'text': 'lived', 'tag': 'VERB', 'lemma': None},
  {'text': 'with', 'tag': 'ADP', 'lemma': None},
  {'text': 'their', 'tag': 'PRON', 'lemma': None},
  {'text': 'Mother', 'tag': 'PROPN', 'lemma': None},
  {'text': 'in', 'tag': 'ADP', 'lemma': None},
  {'text': 'a', 'tag': 'DET', 'lemma': None},
  {'text': 'sand-', 'tag': 'NOUN', 'lemma': None},
  {'text': 'bank,', 'tag': 'NOUN', 'lemma': None},
  {'text': 'underneath', 'tag': 'ADP', 'lemma': None},
  {'text': 'the', 'tag': 'DET', 'lemma': None},
  {'text': 'root', 'tag': 'NOUN', 'lemma': None},
  {'text': 'of', 'tag': 'ADP', 'lemma': None},
  {'text': 'a', 'tag': 'DET', 'lemma': None},
  {'text': '\n', 'tag': 'SPACE', 'lemma': None},
  {'text': 'very', 'tag': 'ADV', 'lemma': None},
  {'text': 'big', 'tag': 'ADJ', 'lemma': None},
  {'text': 'fir-', 'tag': 'NOUN', 'lemma': None},
  {'text': 'tree.', 'tag': 'PUNCT', 'lemma': None},
  {'text': "\n\n'", 'tag': 'SPACE', 'lemma': Non

### **6. Show the first two named entities from Beatrix Potter's "The Tale of Peter Rabbit"**

We can use Spacy's doc.ents function to look at the Entities


In [None]:
# Define a function to print out the NER, plus label, plus explanation
def print_ents(ner):
  print(ner.text + " -- " + ner.label_ + " -- " + str(spacy.explain(ner.label_)))

# Print first named entity
print_ents(doc.ents[0])

# print Second named entity
print_ents(doc.ents[1])

The Tale of Peter Rabbit -- WORK_OF_ART -- Titles of books, songs, etc.
Beatrix Potter -- PERSON -- People, including fictional


### **7. How many sentences are contained in *The Tale of Peter Rabbit*?**

As above, we already created a list of sentences, and can print its length.

In [None]:
print("Number of Sentences: " + str(len(list_of_sentences)))

Number of Sentences: 55


### **8. CHALLENGE: How many sentences contain named entities?**

We can build a loop to go through all sentences in the list;

convert them to docs;

search for Entities;

Increase a counter if an Entities is found.

In [None]:
#

sent_ner_count = 0

for sentence in list_of_sentences:
  entities = nlp(str(sentence)).ents
  if len(entities) > 0:
    sent_ner_count += 1

print("Number of sentences with entities: " + str(sent_ner_count))

Number of sentences with entities: 25


This is quite a bit lower than the expected result of 49. It seems this text file is not generating as many named entities.


### **9. CHALLENGE: Display the named entity visualization for list_of_sents[0] from the previous problem.**

We can do this easily with the help of Displacy, changing the style to 'ent' and removing the spacing between tokens.

In [None]:
displacy.render(nlp(str(list_of_sentences[0])), style='ent', jupyter=True)