In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy import displacy

1. Create a Doc object from the file `peterrabit.txt`

In [2]:
with open('peterrabit.txt') as f:
    doc = nlp(f.read())

2. For every token in the third sentence, print the token text, the POS tag, the fine-grained TAG tag, and the description of the fine-grained tag.

In [3]:
for token in list(doc.sents)[2]:
    print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {str(spacy.explain(token.tag_)):{10}}")

For        ADP        IN         conjunction, subordinating or preposition
other      ADJ        JJ         adjective (English), other noun-modifier (Chinese)
uses       NOUN       NNS        noun, plural
,          PUNCT      ,          punctuation mark, comma
see        VERB       VBP        verb, non-3rd person singular present
Peter      PROPN      NNP        noun, proper singular
Rabbit     PROPN      NNP        noun, proper singular
(          PUNCT      -LRB-      left round bracket
disambiguation NOUN       NN         noun, singular or mass
)          PUNCT      -RRB-      right round bracket
.          PUNCT      .          punctuation mark, sentence closer

          SPACE      _SP        whitespace


3. Provide a frequency list of POS tags from the entire document

In [6]:
POS_counts = doc.count_by(spacy.attrs.POS)
for k,v in sorted(POS_counts.items()):
    print(f"id:{k} {doc.vocab[k].text} {v} counts")

id:84 ADJ 90 counts
id:85 ADP 226 counts
id:86 ADV 41 counts
id:87 AUX 47 counts
id:89 CCONJ 72 counts
id:90 DET 145 counts
id:92 NOUN 279 counts
id:93 NUM 76 counts
id:94 PART 37 counts
id:95 PRON 56 counts
id:96 PROPN 419 counts
id:97 PUNCT 245 counts
id:98 SCONJ 13 counts
id:99 SYM 6 counts
id:100 VERB 159 counts
id:101 X 4 counts
id:103 SPACE 61 counts


4. CHALLENGE: what percentage of tokens are nouns?

HINT: the attribute ID for 'NOUN' is 92

In [7]:
len(doc)

1976

In [8]:
100 * POS_counts[92]/len(doc)

14.119433198380566

5. Display the Dependency Parse for the third sentence

In [9]:
displacy.render(list(doc.sents)[2],style='dep',jupyter=True)

6. Show the first two named entities from Beatrix Potter's The Tale of Peter Rabbit

In [12]:
for ent in doc.ents[:2]:
    print(ent.text + '****' + ent.label_ + '###' +str(spacy.explain(ent.label_)))

Wikipedia****ORG###Companies, agencies, institutions, etc.
Peter Cottontail****PERSON###People, including fictional


7. How many sentences are contained in the doc?

In [13]:
len(list(doc.sents))

53

8. CHALLENGE: How many sentences contain entities?

In [16]:
list_of_sents = [nlp(sent.text) for sent in doc.sents]
list_of_ners = [doc for doc in list_of_sents if doc.ents]
len(list(list_of_ners))

53

9. CHALLENGE: Display the named entity visualization for `list_of_sents[0]` from the previous problem

In [18]:
displacy.render(list_of_sents[0],style='ent',jupyter=True)