___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Parts of Speech Assessment

For this assessment we'll be using the short story [The Tale of Peter Rabbit](https://en.wikipedia.org/wiki/The_Tale_of_Peter_Rabbit) by Beatrix Potter (1902). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/14838.txt.utf-8).

In [46]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

import warnings
warnings.filterwarnings('ignore')

**1. Create a Doc object from the file `peterrabbit.txt`**<br>
> HINT: Use `with open('../TextFiles/peterrabbit.txt') as f:`

In [23]:
with open('../TextFiles/peterrabbit.txt') as f:
    doc = nlp(f.read())
    
doc[:70]

The Tale of Peter Rabbit, by Beatrix Potter (1902).

Once upon a time there were four little Rabbits, and their names
were--

          Flopsy,
       Mopsy,
   Cotton-tail,
and Peter.

They lived with their Mother in a sand-bank, underneath the root of a
very big fir-tree.

'

**2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.**

In [3]:
# Creat list of sentences in doc object
sent_list = [sent for sent in doc.sents]

# Print token text, part of speech tag, find-grained tag, and explanation
for token in sent_list[2]:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')
    

They       PRON     PRP    pronoun, personal
lived      VERB     VBD    verb, past tense
with       ADP      IN     conjunction, subordinating or preposition
their      ADJ      PRP$   pronoun, possessive
Mother     PROPN    NNP    noun, proper singular
in         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner
sand       NOUN     NN     noun, singular or mass
-          PUNCT    HYPH   punctuation mark, hyphen
bank       NOUN     NN     noun, singular or mass
,          PUNCT    ,      punctuation mark, comma
underneath ADP      IN     conjunction, subordinating or preposition
the        DET      DT     determiner
root       NOUN     NN     noun, singular or mass
of         ADP      IN     conjunction, subordinating or preposition
a          DET      DT     determiner

          SPACE           None
very       ADV      RB     adverb
big        ADJ      JJ     adjective
fir        NOUN     NN     noun, singular or mass
-          PUNCT   

In [4]:
# refactored version of the above code to print the tokens and tags from the 3rd sentence

for token in list(doc.sents)[2]:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

They         PRON   PRP    pronoun, personal
lived        VERB   VBD    verb, past tense
with         ADP    IN     conjunction, subordinating or preposition
their        ADJ    PRP$   pronoun, possessive
Mother       PROPN  NNP    noun, proper singular
in           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner
sand         NOUN   NN     noun, singular or mass
-            PUNCT  HYPH   punctuation mark, hyphen
bank         NOUN   NN     noun, singular or mass
,            PUNCT  ,      punctuation mark, comma
underneath   ADP    IN     conjunction, subordinating or preposition
the          DET    DT     determiner
root         NOUN   NN     noun, singular or mass
of           ADP    IN     conjunction, subordinating or preposition
a            DET    DT     determiner

            SPACE         None
very         ADV    RB     adverb
big          ADJ    JJ     adjective
fir          NOUN   NN     noun, singular or mass
-            PUNCT 

**3. Provide a frequency list of POS tags from the entire document**

In [5]:
# create a dictionary of integer values and frequency counts
POS_counts = doc.count_by(spacy.attrs.POS)

# create a frequency list of the POS tags from the entire doc
for k, v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')


83. ADJ  : 83
84. ADP  : 127
85. ADV  : 75
88. CCONJ: 61
89. DET  : 90
91. NOUN : 176
92. NUM  : 8
93. PART : 36
94. PRON : 72
95. PROPN: 75
96. PUNCT: 174
99. VERB : 182
102. SPACE: 99


**4. CHALLENGE: What percentage of tokens are nouns?**<br>
HINT: the attribute ID for 'NOUN' is 91

In [6]:
tokens = len(doc)

for k, v in POS_counts.items():
    if k == 91:
        print(f'{round((v / tokens) * 100, 2)}% of tokens in the doc are nouns.')
    

13.99% of tokens in the doc are nouns.


In [7]:
# refactored code to find the percentage of noun tokens in the doc

percent = 100*POS_counts[91]/len(doc)

print(f'{POS_counts[91]} / {len(doc)} = {percent:{.4}}%')

176 / 1258 = 13.99%


**5. Display the Dependency Parse for the third sentence**

In [8]:
displacy.render(list(doc.sents)[2], style='dep', jupyter=True, options={'distance': 110})

**Use options to change the appearance and .serve to run in a browser window**

In [9]:
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}

#displacy.serve(sent_list[2], style='dep', options=options)

Click this link to view the dependency: http://127.0.0.1:5000
Interrupt the kernel to return to jupyter.

**Use spans to break up the long sentence and make more manageable viz**

In [10]:
span1 = sent_list[2][:11]
span2 = sent_list[2][11:24]

In [11]:
displacy.render(span1, style='dep', jupyter=True, options=options)

In [12]:
displacy.render(span2, style='dep', jupyter=True, options=options)

In [13]:
# print the course POS tag for each token as well as dependency tag

for token in list(doc.sents)[2]:
    print(f'{token.text:{10}} {token.pos_:{7}} {token.dep_:{7}} {spacy.explain(token.dep_)}')

They       PRON    nsubj   nominal subject
lived      VERB    ROOT    None
with       ADP     prep    prepositional modifier
their      ADJ     poss    possession modifier
Mother     PROPN   pobj    object of preposition
in         ADP     prep    prepositional modifier
a          DET     det     determiner
sand       NOUN    compound None
-          PUNCT   punct   punctuation
bank       NOUN    pobj    object of preposition
,          PUNCT   punct   punctuation
underneath ADP     prep    prepositional modifier
the        DET     det     determiner
root       NOUN    pobj    object of preposition
of         ADP     prep    prepositional modifier
a          DET     det     determiner

          SPACE           None
very       ADV     advmod  adverbial modifier
big        ADJ     amod    adjectival modifier
fir        NOUN    compound None
-          PUNCT   punct   punctuation
tree       NOUN    pobj    object of preposition
.          PUNCT   punct   punctuation


         SPACE     

**6. Show the first two named entities from Beatrix Potter's *The Tale of Peter Rabbit* **

In [14]:
for entity in doc.ents[:2]:
    print(entity.text+' - '+entity.label_+' - '+str(spacy.explain(entity.label_)))


The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.
Beatrix Potter - PERSON - People, including fictional


**7. How many sentences are contained in *The Tale of Peter Rabbit*?**

In [15]:
len(sent_list)

56

In [16]:
len([sent for sent in doc.sents])

56

**8. CHALLENGE: How many sentences contain named entities?**

In [17]:
#

list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ent_sents = [doc for doc in list_of_sents if doc.ents]

len(list_of_ent_sents)

49

**Write a function to display basic entity info:**

In [25]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
            print()
    else:
        print('No named entities found.')

In [36]:
show_ents(sent_list[0])

The Tale of Peter Rabbit - WORK_OF_ART - Titles of books, songs, etc.

Beatrix Potter - PERSON - People, including fictional

1902 - DATE - Absolute or relative dates or periods



In [47]:
for sent in doc.sents:
    displacy.render(nlp(sent.text), style='ent', jupyter=True)

**Print the sentences without entities as well but in plain text**

In [48]:
for sent in doc.sents:
    docx = nlp(sent.text)
    if docx.ents:
        displacy.render(docx, style='ent', jupyter=True)
    else:
        print(docx.text)

I am going out.'




It was a blue jacket with brass buttons, quite new.




And rushed into the tool-shed, and jumped into a can.


Presently Peter sneezed--'Kertyschoo!'


He went back to his work.




She only shook her head at him.


THE END



In [44]:
for ent in sent_list[0].ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

The Tale of Peter Rabbit 0 5 0 24 WORK_OF_ART
Beatrix Potter 7 9 29 43 PERSON
1902 10 11 45 49 DATE


**9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem**

In [29]:
colors = {"WORK_OF_ART": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["WORK_OF_ART", "PERSON", "DATE"], "colors": colors}

displacy.render(list_of_sents[0], style='ent', jupyter=True, options=options)

### Great Job!

# Play with viz in Spacy

In [None]:
doc2 = nlp(u'Have a fantastic 2020, Bayes!')

In [31]:
show_ents(doc2)

2020 - DATE - Absolute or relative dates or periods

Bayes - ORG - Companies, agencies, institutions, etc.



In [64]:
colors = {"DATE": "linear-gradient(indigo, white, violet)", "ORG": "radial-gradient(yellow, green)"}
options = {"ents": ["ORG", "DATE"], "colors": colors}

displacy.render(doc2, style='ent', jupyter=True, options=options)
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}
displacy.render(doc2, style='dep', jupyter=True, options=options)

In [59]:
colors = {"DATE": "linear-gradient(90deg, #aa9cfc, #fc9ce7)", "ORG": "radial-gradient(yellow, green)"}
options = {"ents": ["ORG", "DATE"], "colors": colors}

displacy.render(doc2, style='ent', jupyter=True, options=options)

In [60]:
options = {'distance': 110, 'compact': 'True', 'color': 'yellow', 'bg': '#09a3d5', 'font': 'Times'}

In [61]:
doc2 = nlp(u'Have a fantastic 2020, Bayes!')
doc2.user_data["title"] = "Happy 2020"

displacy.render(doc2, style='dep', jupyter=True, options=options)