In [1]:
import spacy, glob, os

In [2]:
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7feab6211d00>)

In [5]:
def get_spacy_tags(text):
    doc=nlp(text)
    for word in doc:
        print(word.text, word.tag_)

get_spacy_tags("Time flies like an arrow")

Time NN
flies VBZ
like IN
an DT
arrow NN


In [6]:
get_spacy_tags("Fruit flies like a banana")

Fruit NN
flies VBZ
like IN
a DT
banana NN


In [7]:
def read_docs(inputDir, maxDocs=100):
    """ Read in movie documents (all ending in .txt) from an input folder
    and process with spacy """
    
    docs=[]
    for idx, filename in enumerate(glob.glob(os.path.join(inputDir, '*.txt'))):
        with open(filename) as file:
            docs.append((filename, nlp(file.read())))
        if idx >= maxDocs:
            break
    return docs

In [8]:
# directory with 2000 movies summaries from Wikipedia
inputDir="../data/movie_summaries/"
docs=read_docs(inputDir, maxDocs=100)

Here are the 45 tags used by the Penn Treebank:

|tag|meaning|
|---|---|
|CC|Coordinating conjunction|
|CD|Cardinal number|
|DT|Determiner|
|EX|Existential there|
|FW|Foreign word|
|IN|Preposition or subordinating conjunction|
|JJ|Adjective|
|JJR|Adjective, comparative|
|JJS|Adjective, superlative|
|LS|List item marker|
|MD|Modal|
|NN|Noun, singular or mass|
|NNS|Noun, plural|
|NNP|Proper noun, singular|
|NNPS|Proper noun, plural|
|PDT|Predeterminer|
|POS|Possessive ending|
|PRP|Personal pronoun|
|PRP\$|Possessive pronoun|
|RB|Adverb|
|RBR|Adverb, comparative|
|RBS|Adverb, superlative|
|RP|Particle|
|SYM|Symbol|
|TO|to|
|UH|Interjection|
|VB|Verb, base form|
|VBD|Verb, past tense|
|VBG|Verb, gerund or present participle|
|VBN|Verb, past participle|
|VBP|Verb, non-3rd person singular present|
|VBZ|Verb, 3rd person singular present|
|WDT|Wh-determiner|
|WP|Wh-pronoun|
|WP\$|Possessive wh-pronoun|
|WRB|Wh-adverb|
|.|period|
|,|comma|
|:|colon|
|(|left separator|
|)|right separator|
|$|dollar sign|
|\`\`|open double quotes|
|''|close double quotes|

Explore these tags below by searching for sentences in the (automatically tagged) movie summary corpus that have been tagged for each one.

In [9]:
def find_examples(docs, tag, num_examples=10, window=5):
    count=0
    for _, doc in docs:
        for idx, token in enumerate(doc[window:-window]):
            if token.tag_ == tag:
                print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "\033[91m%s\033[0m" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                # for windows users - you may want to use the following print statement
                # to highlight the middle token in each sentence using #s
                # print (' '.join(["%s" % context.text for context in doc[idx:idx+window ]]), "#%s#" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window+window+1] ]))
                count+=1
                if count >= num_examples:
                    return
                
        

In [23]:
find_examples(docs, "DT", num_examples=10, window=5)

After Tre gets involved in [91ma[0m fight at school , his
his teacher calls Reva . [91mThe[0m teacher informs Reva that although
adults alike . Frightened about [91mthe[0m future of her child ,
sends him to live in [91mthe[0m Crenshaw neighborhood of South Central
learn life lessons . On [91mthe[0m night of Tre 's arrival
hears his father firing at [91ma[0m burglar . LAPD officers arrive
LAPD officers arrive more than [91man[0m hour later , and eventually
later , and eventually decide [91mthe[0m crime is unimportant because nothing
because nothing was taken and [91mthe[0m burglar escaped completely unharmed .
burglar escaped completely unharmed . [91mThe[0m police , particularly the African


What's the difference between the following?

* PRP and PRP\$

PRP = personal pronoun (he, her), PRP$ = possessive pronoun (e.g. mine, hers)

* NN and NNP

singular noun vs plural noun

* JJ and JJR

general adjective vs comparative adjective

* VBZ and VB

3rd person singular verb, base form of a verb

Q2: Use the `find_examples` function to help understand the usage of each part-of-speech tag; work with a partner to manually tag the following four sentences

1. "Open the pod bay doors, Hal"

VB, DT, JJ, JJ, NNS, NNP

In [12]:
get_spacy_tags("Open the pod bay doors, Hal")

Open VB
the DT
pod JJ
bay NN
doors NNS
, ,
Hal NNP


2. "Frankly, my dear, I don't give a damn"

RB, PRP$, NN, PRP, VB, RB, VB, DT, NN


In [17]:
get_spacy_tags("Frankly, my dear, I don't give a damn")

Frankly RB
, ,
my PRP$
dear NN
, ,
I PRP
do VBP
n't RB
give VB
a DT
damn NN


3. "May the Force be with you"

MD, DT, NN, VB, IN, PRP

In [19]:
get_spacy_tags("May the force be with you")

May MD
the DT
force NN
be VB
with IN
you PRP


4. One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know

JJ, NN, PRP, VBD, DT, NN, IN, PRP\\$, NN. RB, PRP, VBD, IN, PRP\\$, NN, PRP, VBP, RB, VB

In [24]:
get_spacy_tags("One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know")

One CD
morning NN
I PRP
shot VBD
an DT
elephant NN
in IN
my PRP$
pajamas NNS
. .
How WRB
he PRP
got VBD
in IN
my PRP$
pajamas NNS
, ,
I PRP
do VBP
n't RB
know VB


Q3. After tagging the sentences above by hand, run them through the spacy tagger; what's spacy's accuracy on these sentences?

In [21]:
# Overall, spacy was relatively accurate 
# EXCEPT for the last
# Some issues where we disagreed:
# pajamas = plural noun;
# one = modifies the noun verb, but spacy sees it as a number only
