## Parts of Speech

Spacy is designed to go beyond just individual tokenisation but can also help in contextual understanding of the words being used. POS tags are both coarse (noun, verb, adj) and fine-grained (plural noun, past tense verb, superlative adj)

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

# POS tag
for token in doc:  
    print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {token.dep_:{10}}")

 # POS - Coarse Tag
 # TAG - Fine Tag
 # DEP - Syntactical Dependency   

Apple      PROPN      NNP        nsubj     
is         AUX        VBZ        aux       
looking    VERB       VBG        ROOT      
at         ADP        IN         prep      
buying     VERB       VBG        pcomp     
U.K.       PROPN      NNP        dobj      
startup    NOUN       NN         advcl     
for        ADP        IN         prep      
$          SYM        $          quantmod  
1          NUM        CD         compound  
billion    NUM        CD         pobj      


In [3]:
# Difference between two meanings of read

doc1 = nlp(u'I am reading books on NLP')
token = doc1[1]
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")

doc2 = nlp(u'I read a book on NLP')
token = doc2[1]
print(f"{token.text:{10}} {token.pos_:{10}} {token.tag_:{10}} {spacy.explain(token.tag_):{10}}")


am         AUX        VBP        verb, non-3rd person singular present
read       VERB       VBD        verb, past tense


In [4]:
## POS counts table
doc3 = nlp(u"The quick brown fox jumps over the lazy dog's back")
POS_counts = doc3.count_by(spacy.attrs.POS)
for k,v in sorted(POS_counts.items()):
    print(f"{k:{10}} {doc3.vocab[k].text:{5}} {v}")

        84 ADJ   3
        85 ADP   1
        90 DET   2
        92 NOUN  3
        94 PART  1
       100 VERB  1


In [5]:
## Tag counts table
doc3 = nlp(u"The quick brown fox jumps over the lazy dog's back")
Tag_counts = doc3.count_by(spacy.attrs.TAG)
for k,v in sorted(Tag_counts.items()):
    print(f"{k:{10}} {doc3.vocab[k].text:{5}} {v}")

        74 POS   1
1292078113972184607 IN    1
10554686591937588953 JJ    3
13927759927860985106 VBZ   1
15267657372422890137 DT    2
15308085513773655218 NN    3


There is a difference in length of POS numerical value (84,85,90, etc) vs TAG numerical value in length since some more popular tags are put in the front while the less popular tags are at longer hash values

In [6]:
### Visualis of POS
from spacy import displacy

doc4 = nlp(u"The quick brown fox jumps over the lazy dog's back")
displacy.render(doc4, style='dep', jupyter=True) # dep - dependency tree


In [7]:
options = {'distance':110,'compact':True,'colors':{'DEP':'#ff0000'}} # distance - distance between nodes, compact - compact the tree, colors - color of the nodes
displacy.render(doc4, style='dep', options=options, jupyter=True) # dep - dependency tree

In [9]:
# Render of Sentence

doc5 = nlp(u"The quick brown fox jumps over the lazy dog's back. This is a short sentence.")
spans = doc5.sents
displacy.serve(spans, style='dep', options=options) # dep - dependency tree


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.
