### Text Preprocessing in Python - I

<img src="https://user-images.githubusercontent.com/32620288/166104650-bca608ed-afc3-4c56-8bf2-eebf0b52b054.png" width="400" height="1">

----

In [1]:
# import the necessary libraries
import nltk
import string
import re

### Text Lowercase:
We lowercase the text to reduce the size of the vocabulary of our text data.

In [6]:
#Text
input_str = "The Additional Directorate General of Public Information (ADGPI) under the Ministry of Defence announced through its official Twitter handle and said,General Manoj Pande, PVSM, AVSM, VSM, ADC takes over as the 29th COAS of Indian Army from General MM Naravane.General Pande was previously the army's vice chief, a charge he took in February when he replaced Lieutenant General CP Mohanty. Gen Pande was the head of the Eastern Army Command before becoming Vice Chief of the Army and was responsible for guarding the Line of Actual Control (LAC) in the Sikkim and Arunachal Pradesh sectors."
input_str

"The Additional Directorate General of Public Information (ADGPI) under the Ministry of Defence announced through its official Twitter handle and said,General Manoj Pande, PVSM, AVSM, VSM, ADC takes over as the 29th COAS of Indian Army from General MM Naravane.General Pande was previously the army's vice chief, a charge he took in February when he replaced Lieutenant General CP Mohanty. Gen Pande was the head of the Eastern Army Command before becoming Vice Chief of the Army and was responsible for guarding the Line of Actual Control (LAC) in the Sikkim and Arunachal Pradesh sectors."

In [8]:
def text_lowercase(text):
    return text.lower()

text_lowercase(input_str)

'the women said 600 of the soldiers are wounded with some suffering from gangrene. they provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. they said people are eating porridge, old cheese and rudimentary bread.the azov regiment has its roots in the azov battalion which was formed in 2014 by far-right activists at the start of the conflict in the east between ukraine and moscow-backed separatists and which has elicited criticism for its tactics'

### Remove numbers:
We can either remove numbers or convert the numbers into their textual representations. 

We can use regular expressions to remove the numbers. 

In [7]:
input_str = "The women said 600 of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in 2014 by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics"
input_str

'The women said 600 of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in 2014 by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics'

In [9]:
# Remove numbers
def remove_numbers(text):
    result = re.sub(r'\d+', '', text)
    return result
 
remove_numbers(input_str)

'The women said  of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in  by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics'

##### We can also convert the numbers into words. This can be done by using the inflect library.

In [12]:
pip install inflect

Collecting inflectNote: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\divak\anaconda3\python.exe -m pip install --upgrade pip' command.



  Downloading inflect-5.5.2-py3-none-any.whl (33 kB)
Installing collected packages: inflect
Successfully installed inflect-5.5.2


In [13]:
input_str = "The women said 600 of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in 2014 by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics"
input_str

'The women said 600 of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in 2014 by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics'

In [14]:
# import the inflect library
import inflect
p = inflect.engine()
 
# convert number into words
def convert_number(text):
    # split string into list of words
    temp_str = text.split()
    # initialise empty list
    new_string = []
 
    for word in temp_str:
        # if word is a digit, convert the digit
        # to numbers and append into the new_string list
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)
 
        # append the word as it is
        else:
            new_string.append(word)
 
    # join the words of new_string to form a string
    temp_str = ' '.join(new_string)
    return temp_str
 

convert_number(input_str)

'The women said six hundred of the soldiers are wounded with some suffering from gangrene. They provided grisly videos and photos sent by their husbands of men with amputated limbs, bullet wounds and other injuries. They said people are eating porridge, old cheese and rudimentary bread.The Azov Regiment has its roots in the Azov Battalion which was formed in two thousand and fourteen by far-right activists at the start of the conflict in the east between Ukraine and Moscow-backed separatists and which has elicited criticism for its tactics'

### Remove punctuation:
We remove punctuations so that we don’t have different forms of the same word. If we don’t remove the punctuation, then been. been, been! will be treated separately.

In [16]:
input_str= "Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"
input_str

"Hey, did you know that the summer break is coming? Amazing right !! It's only 5 more days !!"

In [17]:
# remove punctuation
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
 
remove_punctuation(input_str)

'Hey did you know that the summer break is coming Amazing right  Its only 5 more days '

### Remove whitespaces:
We can use the join and split function to remove all the white spaces in a string.

In [18]:
input_str= "Two Ukrainian women whose husbands are defending a besieged steel plant in the southern city of Mariupol are calling for any evacuation of civilians to also include soldiers, saying they fear the troops will be tortured and killed              if left behind and captured by Russian forces."
input_str

'Two Ukrainian women whose husbands are defending a besieged steel plant in the southern city of Mariupol are calling for any evacuation of civilians to also include soldiers, saying they fear the troops will be tortured and killed              if left behind and captured by Russian forces.'

In [19]:
# remove whitespace from text
def remove_whitespace(text):
    return  " ".join(text.split())
 
remove_whitespace(input_str)

'Two Ukrainian women whose husbands are defending a besieged steel plant in the southern city of Mariupol are calling for any evacuation of civilians to also include soldiers, saying they fear the troops will be tortured and killed if left behind and captured by Russian forces.'

### Remove default stopwords:
Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

---------
![image.png](attachment:image.png)

------------

In [20]:
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
example_text

'This is a sample sentence and we are going to remove the stopwords from this.'

In [21]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
# remove stopwords function
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text
 
example_text = "This is a sample sentence and we are going to remove the stopwords from this."
remove_stopwords(example_text)

['This', 'sample', 'sentence', 'going', 'remove', 'stopwords', '.']

### Stemming:
Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words.

books      --->    book

looked     --->    look

denied     --->    deni

flies      --->    fli

If the text is not in tokens, then we need to convert it into tokens. After we have converted strings of text into tokens, we can convert the word tokens into their root form. There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them.

In [22]:
text="Ukrainian forces fought Saturday to hold off a Russian advance in their country's south and east, where the Kremlin is seeking to capture the industrial Donbas region and Western military analysts said Moscow's offensive was going much slower than planned."
text

"Ukrainian forces fought Saturday to hold off a Russian advance in their country's south and east, where the Kremlin is seeking to capture the industrial Donbas region and Western military analysts said Moscow's offensive was going much slower than planned."

In [23]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
 
# stem words in the list of tokenized words
def stem_words(text):
    word_tokens = word_tokenize(text)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems
stem_words(text)

['ukrainian',
 'forc',
 'fought',
 'saturday',
 'to',
 'hold',
 'off',
 'a',
 'russian',
 'advanc',
 'in',
 'their',
 'countri',
 "'s",
 'south',
 'and',
 'east',
 ',',
 'where',
 'the',
 'kremlin',
 'is',
 'seek',
 'to',
 'captur',
 'the',
 'industri',
 'donba',
 'region',
 'and',
 'western',
 'militari',
 'analyst',
 'said',
 'moscow',
 "'s",
 'offens',
 'wa',
 'go',
 'much',
 'slower',
 'than',
 'plan',
 '.']

### Lemmatization:
Like stemming, lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter. 

In [24]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
# lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas
 

lemmatize_word(text)

['Ukrainian',
 'force',
 'fight',
 'Saturday',
 'to',
 'hold',
 'off',
 'a',
 'Russian',
 'advance',
 'in',
 'their',
 'country',
 "'s",
 'south',
 'and',
 'east',
 ',',
 'where',
 'the',
 'Kremlin',
 'be',
 'seek',
 'to',
 'capture',
 'the',
 'industrial',
 'Donbas',
 'region',
 'and',
 'Western',
 'military',
 'analysts',
 'say',
 'Moscow',
 "'s",
 'offensive',
 'be',
 'go',
 'much',
 'slower',
 'than',
 'plan',
 '.']