## <mark> POS tagging

Part of Speech tagging (or PoS tagging) is a process that assigns parts of speech (or words) to each word in a sentence. For example, the tag “Noun” would be assigned to nouns.

The basic idea behind Part-of-speech tagging is that different parts of speech have syntactic rules associated with them: verbs change depending on tense, subjects replace pronouns, determiners like ‘a’ or ’the’ don’t show up after certain prepositions, etc. By assigning tags for every word in language content, one can create more specific machine learning models and rephrase sentences according to data inputs from text mining software.

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [6]:
doc1 = nlp('''Then the Queen grew terribly jealous of Snow White and thought and thought how she could get rid of her, till at last she went to a hunter and engaged him for a large sum of money to take Snow White out into the forest and there kill her and bring back her heart.
But when the hunter had taken Snow White out into the forest and thought to kill her, she was so beautiful that his heart failed him, and he let her go, telling her she must not, for his sake and for her own, return to the King's palace. 
Then he killed a deer and took back the heart to the Queen, telling her that it was the heart of Snow White.
Snow White wandered on and on till she got through the forest and came to a mountain hut and knocked at the door, but she got no reply. She was so tired that she lifted up the latch and walked in, and there she saw three little beds and three little chairs and three little cupboards all ready for use. 
''')

for token in doc1[:5]:
    print('Words is : ' , token.text)
    print('POS is   : ' , token.pos_  , '===', spacy.explain(token.pos_))
    print('Dep is   : ' , token.dep_, '===', spacy.explain(token.dep_))
    print('Tag is   : ' , token.tag_, '===', spacy.explain(token.tag_))
    print('-----------------------')

Words is :  Then
POS is   :  ADV === adverb
Dep is   :  advmod === adverbial modifier
Tag is   :  RB === adverb
-----------------------
Words is :  the
POS is   :  DET === determiner
Dep is   :  det === determiner
Tag is   :  DT === determiner
-----------------------
Words is :  Queen
POS is   :  PROPN === proper noun
Dep is   :  nsubj === nominal subject
Tag is   :  NNP === noun, proper singular
-----------------------
Words is :  grew
POS is   :  VERB === verb
Dep is   :  ROOT === root
Tag is   :  VBD === verb, past tense
-----------------------
Words is :  terribly
POS is   :  ADV === adverb
Dep is   :  advmod === adverbial modifier
Tag is   :  RB === adverb
-----------------------


In [8]:
for token in doc1[10:15]:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

thought    VERB     VBD    verb, past tense
and        CCONJ    CC     conjunction, coordinating
thought    VERB     VBD    verb, past tense
how        SCONJ    WRB    wh-adverb
she        PRON     PRP    pronoun, personal


In [11]:
POS_counts = doc1.count_by(spacy.attrs.POS)

for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc1.vocab[k].text:{5}}: {v}')

84. ADJ  : 10
85. ADP  : 21
86. ADV  : 12
87. AUX  : 7
89. CCONJ: 18
90. DET  : 16
92. NOUN : 23
93. NUM  : 3
94. PART : 4
95. PRON : 24
96. PROPN: 13
97. PUNCT: 14
98. SCONJ: 7
100. VERB : 28
103. SPACE: 4


In [12]:
TAG_counts = doc1.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc1.vocab[k].text:{4}}: {v}')

74. POS : 1
164681854541413346. RB  : 13
783433942507015291. NNS : 3
1292078113972184607. IN  : 23
1534113631682161808. VBG : 2
2593208677638477497. ,   : 9
3822385049556375858. VBN : 3
4062917326063685704. PRP$: 4
5595707737748328492. TO  : 2
6860118812490040284. RP  : 3
6893682062797376370. _SP : 4
8427216679587749980. CD  : 3
10554686591937588953. JJ  : 10
12646065887601541794. .   : 5
13656873538139661788. PRP : 20
14200088355797579614. VB  : 7
15267657372422890137. DT  : 16
15308085513773655218. NN  : 20
15794550382381185553. NNP : 13
16235386156175103506. MD  : 2
17109001835818727656. VBD : 21
17524233984504158541. WRB : 2
17571114184892886314. CC  : 18


In [13]:
DEP_counts = doc1.count_by(spacy.attrs.DEP)

for k,v in sorted(DEP_counts.items()):
    print(f'{k}. {doc1.vocab[k].text:{4}}: {v}')

0.     : 9
399. advcl: 1
400. advmod: 2
402. amod: 7
403. appos: 2
405. aux : 1
406. auxpass: 2
407. cc  : 3
410. conj: 3
415. det : 8
416. dobj: 6
428. npadvmod: 2
429. nsubj: 4
430. nsubjpass: 3
438. pcomp: 1
439. pobj: 14
440. poss: 3
443. prep: 16
445. punct: 13
447. relcl: 2
450. xcomp: 1
7037928807040764755. compound: 9
8206900633647566924. ROOT: 4


In [14]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [16]:
text = ' Himalaya is separating the plains of the Indian subcontinent from the Tibetan Plateau.'
tags = nltk.pos_tag(nltk.word_tokenize(text))

for w , m in tags:
    print(f'word : ({w}), type : ({m}) , means :  ({spacy.explain(m)})')

word : (Himalaya), type : (NNP) , means :  (noun, proper singular)
word : (is), type : (VBZ) , means :  (verb, 3rd person singular present)
word : (separating), type : (VBG) , means :  (verb, gerund or present participle)
word : (the), type : (DT) , means :  (determiner)
word : (plains), type : (NNS) , means :  (noun, plural)
word : (of), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (the), type : (DT) , means :  (determiner)
word : (Indian), type : (JJ) , means :  (adjective (English), other noun-modifier (Chinese))
word : (subcontinent), type : (NN) , means :  (noun, singular or mass)
word : (from), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (the), type : (DT) , means :  (determiner)
word : (Tibetan), type : (NNP) , means :  (noun, proper singular)
word : (Plateau), type : (NNP) , means :  (noun, proper singular)
word : (.), type : (.) , means :  (punctuation mark, sentence closer)


In [18]:
text = '''
Mount Everest is Earth's highest mountain above sea level, 
located in the Mahalangur Himal sub-range of the Himalayas. 
The China–Nepal border runs across its summit point. 
Its elevation of 8,848.86 m was most recently established in 2020 by the Chinese and Nepali authorities.
'''

In [20]:
custom_sent_tokenizer = PunktSentenceTokenizer(text)
tokenized = custom_sent_tokenizer.tokenize(text)
tokenized[:2]

["\nMount Everest is Earth's highest mountain above sea level, \nlocated in the Mahalangur Himal sub-range of the Himalayas.",
 'The China–Nepal border runs across its summit point.']

In [24]:
for i in tokenized[:1]:
    for w , m in nltk.pos_tag(nltk.word_tokenize(i)):
        print(f'word : ({w}), type : ({m}) , means :  ({spacy.explain(m)})')

word : (Mount), type : (NNP) , means :  (noun, proper singular)
word : (Everest), type : (NNP) , means :  (noun, proper singular)
word : (is), type : (VBZ) , means :  (verb, 3rd person singular present)
word : (Earth), type : (NNP) , means :  (noun, proper singular)
word : ('s), type : (POS) , means :  (possessive ending)
word : (highest), type : (JJS) , means :  (adjective, superlative)
word : (mountain), type : (NN) , means :  (noun, singular or mass)
word : (above), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (sea), type : (NN) , means :  (noun, singular or mass)
word : (level), type : (NN) , means :  (noun, singular or mass)
word : (,), type : (,) , means :  (punctuation mark, comma)
word : (located), type : (VBN) , means :  (verb, past participle)
word : (in), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (the), type : (DT) , means :  (determiner)
word : (Mahalangur), type : (NNP) , means :  (noun, proper singular)
word : (

In [27]:
import re           
train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

In [29]:
train_text[:300]

"PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nFebruary 2, 2005\n\n\n9:10 P.M. EST \n\nTHE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: \n\nAs a new Congress gathers, all of us in the elected branches of governme"

In [32]:
custom_sent_tokenizer = PunktSentenceTokenizer(sample_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
tokenized[:3]

["PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION\n \nJanuary 31, 2006\n\nTHE PRESIDENT: Thank you all.",
 'Mr. Speaker, Vice President Cheney, members of Congress, members of the Supreme Court and diplomatic corps, distinguished guests, and fellow citizens: Today our nation lost a beloved, graceful, courageous woman who called America to its founding ideals and carried on a noble dream.',
 'Tonight we are comforted by the hope of a glad reunion with the husband who was taken so long ago, and we are grateful for the good life of Coretta Scott King.']

In [34]:
for i in tokenized[:1]:
    for w , m in nltk.pos_tag(nltk.word_tokenize(i)):
        print(f'word : ({w}), type : ({m}) , means :  ({spacy.explain(m)})')
    print('-----------------------------------------------')

word : (PRESIDENT), type : (NNP) , means :  (noun, proper singular)
word : (GEORGE), type : (NNP) , means :  (noun, proper singular)
word : (W.), type : (NNP) , means :  (noun, proper singular)
word : (BUSH), type : (NNP) , means :  (noun, proper singular)
word : ('S), type : (POS) , means :  (possessive ending)
word : (ADDRESS), type : (NNP) , means :  (noun, proper singular)
word : (BEFORE), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (A), type : (NNP) , means :  (noun, proper singular)
word : (JOINT), type : (NNP) , means :  (noun, proper singular)
word : (SESSION), type : (NNP) , means :  (noun, proper singular)
word : (OF), type : (IN) , means :  (conjunction, subordinating or preposition)
word : (THE), type : (NNP) , means :  (noun, proper singular)
word : (CONGRESS), type : (NNP) , means :  (noun, proper singular)
word : (ON), type : (NNP) , means :  (noun, proper singular)
word : (THE), type : (NNP) , means :  (noun, proper singular)
word : (STATE)