### Text Preprocessing in Python - II

<img src="https://user-images.githubusercontent.com/32620288/166104650-bca608ed-afc3-4c56-8bf2-eebf0b52b054.png" width="400" height="1">

*Divakar Kumar*

----

In [4]:
# import the necessary libraries
import nltk
import string
import re

### Part of Speech Tagging:
The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

In [5]:
text="The Russia-Ukraine war has now entered day 66 with heavy shelling by invading troops along the entire line of contract in the Dobass region, east Ukraine. Meanwhile, dozens of people were injured in the blast that took off during UN Secretary-General Antonio Guterres' visit to Kyiv. On the other hand, Ukraine has accused Russia of robbing the occupied cities, leading to massive food insecurity"
text

"The Russia-Ukraine war has now entered day 66 with heavy shelling by invading troops along the entire line of contract in the Dobass region, east Ukraine. Meanwhile, dozens of people were injured in the blast that took off during UN Secretary-General Antonio Guterres' visit to Kyiv. On the other hand, Ukraine has accused Russia of robbing the occupied cities, leading to massive food insecurity"

In [6]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag
  
# convert text into word_tokens with their tags
def pos_tagging(text):
    word_tokens = word_tokenize(text)
    return pos_tag(word_tokens)
  
pos_tagging("The Russia-Ukraine war has now entered day 66 with heavy shelling by invading troops along the entire line of contract in the Dobass region, east Ukraine. Meanwhile, dozens of people were injured in the blast that took off during UN Secretary-General Antonio Guterres' visit to Kyiv. On the other hand, Ukraine has accused Russia of robbing the occupied cities, leading to massive food insecurity")

[('The', 'DT'),
 ('Russia-Ukraine', 'NNP'),
 ('war', 'NN'),
 ('has', 'VBZ'),
 ('now', 'RB'),
 ('entered', 'VBN'),
 ('day', 'NN'),
 ('66', 'CD'),
 ('with', 'IN'),
 ('heavy', 'JJ'),
 ('shelling', 'NN'),
 ('by', 'IN'),
 ('invading', 'VBG'),
 ('troops', 'NNS'),
 ('along', 'IN'),
 ('the', 'DT'),
 ('entire', 'JJ'),
 ('line', 'NN'),
 ('of', 'IN'),
 ('contract', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Dobass', 'NNP'),
 ('region', 'NN'),
 (',', ','),
 ('east', 'JJ'),
 ('Ukraine', 'NNP'),
 ('.', '.'),
 ('Meanwhile', 'RB'),
 (',', ','),
 ('dozens', 'NNS'),
 ('of', 'IN'),
 ('people', 'NNS'),
 ('were', 'VBD'),
 ('injured', 'VBN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('blast', 'NN'),
 ('that', 'WDT'),
 ('took', 'VBD'),
 ('off', 'RP'),
 ('during', 'IN'),
 ('UN', 'NNP'),
 ('Secretary-General', 'NNP'),
 ('Antonio', 'NNP'),
 ('Guterres', 'NNP'),
 ("'", 'POS'),
 ('visit', 'NN'),
 ('to', 'TO'),
 ('Kyiv', 'NNP'),
 ('.', '.'),
 ('On', 'IN'),
 ('the', 'DT'),
 ('other', 'JJ'),
 ('hand', 'NN'),
 (',', ','),
 ('Uk



<img src="https://user-images.githubusercontent.com/32620288/166105128-d714cb89-3f6c-46b4-8e1e-a770d47498b9.png" width="600" height="1">

In [7]:
# download the tagset 
nltk.download('tagsets')
  
# extract information about the tag
nltk.help.upenn_tagset('NN')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### Chunking:
Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.

In [8]:
sentence = 'Sri Lankas Opposition To Bring No-confidence Motion Against Govt Amid Economic Turbulence'
sentence

'Sri Lankas Opposition To Bring No-confidence Motion Against Govt Amid Economic Turbulence'

In [None]:
from nltk.tokenize import word_tokenize 
from nltk import pos_tag
  
# define chunking function with text and regular
# expression representing grammar as parameter
def chunking(text, grammar):
    word_tokens = word_tokenize(text)
  
    # label words with part of speech
    word_pos = pos_tag(word_tokens)
  
    # create a chunk parser using grammar
    chunkParser = nltk.RegexpParser(grammar)
  
    # test it on the list of word tokens with tagged pos
    tree = chunkParser.parse(word_pos)
      
    for subtree in tree.subtrees():
        print(subtree)
    tree.draw()
      
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)

(S
  Sri/NNP
  Lankas/NNP
  Opposition/NNP
  To/TO
  Bring/VB
  No-confidence/NNP
  Motion/NNP
  Against/NNP
  Govt/NNP
  Amid/NNP
  Economic/NNP
  (NP Turbulence/NN))
(NP Turbulence/NN)


In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

###### Libraries like spaCy and Textblob are more suited for chunking.

![image](https://user-images.githubusercontent.com/32620288/166138721-0e6384ef-009a-4f8f-a3eb-51c0dc22a026.png)

### Named Entity Recognition:
Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

In [1]:
text="Sri Lanka's leader of Opposition and Samagi Jana Balawegaya (SJB) leader Sajith Premadasa announced bringing a no-confidence motion against the incumbent govt."
text

"Sri Lanka's leader of Opposition and Samagi Jana Balawegaya (SJB) leader Sajith Premadasa announced bringing a no-confidence motion against the incumbent govt."

In [3]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
  
def named_entity_recognition(text):
    # tokenize the text
    word_tokens = word_tokenize(text)
  
    # part of speech tagging of words
    word_pos = pos_tag(word_tokens)
  
    # tree of word entities
    print(ne_chunk(word_pos))
  
named_entity_recognition(text)

(S
  (PERSON Sri/NNP)
  (PERSON Lanka/NNP)
  's/POS
  leader/NN
  of/IN
  (ORGANIZATION Opposition/NNP)
  and/CC
  (PERSON Samagi/NNP Jana/NNP Balawegaya/NNP)
  (/(
  (ORGANIZATION SJB/NNP)
  )/)
  leader/NN
  (PERSON Sajith/NNP Premadasa/NNP)
  announced/VBD
  bringing/VBG
  a/DT
  no-confidence/JJ
  motion/NN
  against/IN
  the/DT
  incumbent/JJ
  govt/NN
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\divak\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
