**There are two main parts to Natural Language Processing**:

- Data Preprocessing
- Algorithm Development

This the first step where we will started with NLP and understand the basic concepts and tools that are needed to work with text data. We shall be focusing mainly on the first and the most crucial part of Natural Language Processing – Text Preprocessing.

The dataset used for this step can be found here: https://www.kaggle.com/datasets/datatattle/email-classification-nlp

In [1]:
import pandas as pd
df = pd.read_csv('SMS_test.csv',encoding='cp1252')
df.head()

Unnamed: 0,S. No.,Message_body,Label
0,1,"UpgrdCentre Orange customer, you may now claim...",Spam
1,2,"Loan for any purpose £500 - £75,000. Homeowner...",Spam
2,3,Congrats! Nokia 3650 video camera phone is you...,Spam
3,4,URGENT! Your Mobile number has been awarded wi...,Spam
4,5,Someone has contacted our dating service and e...,Spam


# Text Preprocessing For NLP

Data Preprocessing is the most essential step for any Machine Learning model. How well the raw data has been cleaned and preprocessed plays a major role in the performance of the model. Likewise in the case of NLP, the very first step is Text Processing.

The various preprocessing steps that are involved are :

- Lower Casing
- Tokenization
- Punctuation Mark Removal
- Stop Word Removal
- Stemming
- Lemmatization

### Text Pre-processing Using Lower Casing



In [2]:
sentence = df.iloc[0]['Message_body']
print("Before lower casing: "+ sentence)
sentence = sentence.lower()
print("After lower casing: "+ sentence)

Before lower casing: UpgrdCentre Orange customer, you may now claim your FREE CAMERA PHONE upgrade for your loyalty. Call now on 0207 153 9153. Offer ends 26th July. T&C's apply. Opt-out available
After lower casing: upgrdcentre orange customer, you may now claim your free camera phone upgrade for your loyalty. call now on 0207 153 9153. offer ends 26th july. t&c's apply. opt-out available


### Understand Tokenization In Text Pre-processing
 
**Tokenization** is the process of breaking up the paragraph into smaller units such as sentences or words. Each unit is then considered as an individual token.

The fundamental principle of Tokenization is to try to understand the meaning of the text by analyzing the smaller units or tokens that constitute the paragraph.

**NLTK** is the Natural Language Toolkit library in python that is used for Text Preprocessing.

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### Sentence Tokenize

Now we shall take a paragraph as input and tokenize it into its constituting sentences. The result is a list stored in the variable ‘sentences’. It contains each sentence of the paragraph. The length of the list gives us the total number of sentences.

In [4]:
paragraph=df.iloc[14]['Message_body']
print("Paragraph: "+paragraph)
# Tokenize Sentences
sentences = nltk.sent_tokenize(paragraph.lower())
print("Paragraph breakdown into sentences : "+ str(sentences))
print ("No. of Sentences: "+str(len(sentences)))

Paragraph: Ur cash-balance is currently 500 pounds - to maximize ur cash-in now send CASH to 86688 only 150p/msg. CC: 08708800282 HG/Suite342/2Lands Row/W1J6HL
Paragraph breakdown into sentences : ['ur cash-balance is currently 500 pounds - to maximize ur cash-in now send cash to 86688 only 150p/msg.', 'cc: 08708800282 hg/suite342/2lands row/w1j6hl']
No. of Sentences: 2


#### Word Tokenize

Similarly, we can also tokenize the paragraph into words. The result is a list called ‘words’, containing each word of the paragraph. The length of the list gives us the total number of words present in our paragraph.

In [5]:
# Tokenize Words
paragraph=df.iloc[14]['Message_body']
words = nltk.word_tokenize(paragraph.lower())
print(words)
print("No. of Words: "+str(len(words)))

['ur', 'cash-balance', 'is', 'currently', '500', 'pounds', '-', 'to', 'maximize', 'ur', 'cash-in', 'now', 'send', 'cash', 'to', '86688', 'only', '150p/msg', '.', 'cc', ':', '08708800282', 'hg/suite342/2lands', 'row/w1j6hl']
No. of Words: 24


**Note: Here we observe that the punctuations have also been considered as separate words.**

#### Punctuation Mark Removal

This brings us to the next step. We must now remove the punctuation marks from our list of words.

We can remove all the punctuation marks from our list of words by excluding any alphanumeric element.

In [6]:
new_words= [word for word in words if word.isalnum()]
print(new_words)

['ur', 'is', 'currently', '500', 'pounds', 'to', 'maximize', 'ur', 'now', 'send', 'cash', 'to', '86688', 'only', 'cc', '08708800282']


### Stop Word Removal

**Stop words** are a collection of words that occur frequently in any language but do not add much meaning to the sentences. These are common words that are part of the grammar of any language. Every language has its own set of stop words. For example some of the English stop words are “the”, “he”, “him”, “his”, “her”, “herself” etc.

In [7]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

These are all the English stop words.

You can also get the stop words of other languages by simply changing the parameter. Have some fun and try passing “Spanish” or “French” as the parameter!

Since these stop words do not add much value to the overall meaning of the sentence, we can easily remove these words from our text data. This helps in **dimensionality reduction** by eliminating unnecessary information.

In [9]:
WordSet = []
for word in new_words:
   if word not in set(stopwords.words("english")):
      WordSet.append(word)
print(WordSet)

['ur', 'currently', '500', 'pounds', 'maximize', 'ur', 'send', 'cash', '86688', 'cc', '08708800282']


In [10]:
print(len(WordSet))

11


### Stemming

**Stemming** is the process of reduction of a word into its root or stem word. The word affixes are removed leaving behind only the root form or lemma.

For example: The words “connecting”, “connect”, “connection”, “connects” are all reduced to the root form “connect”. The words “studying”, “studies”, “study” are all reduced to “studi”.


In [11]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [12]:
paragraph="Linguistics is the scientific study of language. It encompasses the analysis of every aspect of language, as well as the methods for studying and modeling them. The traditional areas of linguistic analysis include phonetics, phonology, morphology, syntax, semantics, and pragmatics."
words = nltk.word_tokenize(paragraph.lower())
new_words= [word for word in words if word.isalnum()]
WordSet = []
for word in new_words:
   if word not in set(stopwords.words("english")):
      WordSet.append(word)

print("Before Stemming: "+ str(WordSet))

WordSetStem = []
for word in WordSet:
   WordSetStem.append(ps.stem(word))
print("After Stemming: "+ str(WordSetStem))

Before Stemming: ['linguistics', 'scientific', 'study', 'language', 'encompasses', 'analysis', 'every', 'aspect', 'language', 'well', 'methods', 'studying', 'modeling', 'traditional', 'areas', 'linguistic', 'analysis', 'include', 'phonetics', 'phonology', 'morphology', 'syntax', 'semantics', 'pragmatics']
After Stemming: ['linguist', 'scientif', 'studi', 'languag', 'encompass', 'analysi', 'everi', 'aspect', 'languag', 'well', 'method', 'studi', 'model', 'tradit', 'area', 'linguist', 'analysi', 'includ', 'phonet', 'phonolog', 'morpholog', 'syntax', 'semant', 'pragmat']



**Note**: The word list obtained after performing stemming does not always contain words that are a part of the English vocabulary. In our example, words such as “scientif“, “studi“, “everi” are not proper words, i.e. they do not make sense to us.

### Lemmatization
We have just seen, how we can reduce the words to their root words using Stemming.

However, Stemming does not always result in words that are part of the language vocabulary. It often results in words that have no meaning to the users. In order to overcome this drawback, we shall use the concept of Lemmatization.

In [13]:
from nltk.stem import WordNetLemmatizer
lm= WordNetLemmatizer()
nltk.download('wordnet')
import nltk
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [14]:
WordSetLem = []
for word in WordSet:
   WordSetLem.append(lm.lemmatize(word))
print(WordSetLem)

['linguistics', 'scientific', 'study', 'language', 'encompasses', 'analysis', 'every', 'aspect', 'language', 'well', 'method', 'studying', 'modeling', 'traditional', 'area', 'linguistic', 'analysis', 'include', 'phonetics', 'phonology', 'morphology', 'syntax', 'semantics', 'pragmatic']
