# **Data Preprocessing**

Pre-processing the data is the process of cleaning and preparing the text for classification. Online texts contain usually
lots of noise and uninformative parts such as HTML tags, scripts and advertisements. In addition, on words level, many
words in the text do not have an impact on the general orientation of it.
Keeping those words makes the dimensionality of the problem high and hence the classification more difficult since each
word in the text is treated as one dimension. 

## **1. Text Cleaning / Normalization**

 - Text normalization reduces variations in word forms to a common form when the variations mean the same thing
 - Before text data is used in training NLP models, it's pre-processed to a suitable form. Text normalization is often an essential step in text pre-processing. Text normalization simplifies the modelling process and can improve the model's performance.
 - Some of the Text Normalization techniques include :

### **A. Removing White Spaces**

Most of the time the text data that you have may contain extra spaces in between the words, after or before a sentence. So to start with we will remove these extra spaces from each sentence by using regular expressions.

In [57]:
#Removing all white spaces using regex
import re
string = " My office is    located in Mumbai"
new_str = re.sub(r"\s+"," ", string)
new_str

' My office is located in Mumbai'

In [6]:
#Removing Leading or trailing whitespaces using strip
string = "   My office is located in Mumbai   "
string.strip()

'My office is located in Mumbai'

### **B. Removing punctuations and special characters**


The punctuations present in the text do not add value to the data. The punctuation, when attached to any word, will create a problem in differentiating with other words.


In [12]:
import string
txt = "Hey All, I am going to London!!@@"
clean_txt = "".join([i for i in txt if i not in string.punctuation])
clean_txt

'Hey All I am going to London'

### **C. Case Normalization**

Case Normalization is essential so that machine models don’t group capitalized words (HELLo) as different from their lowercase counterparts (hello)


In [13]:
#Converting to lower case
txt = "Hey All, I am going to London for Bumble project!!@@"
txt.lower()

'hey all, i am going to london for bumble project!!@@'

In [14]:
#Converting to upper case
txt = "Hey All, I am going to London for Bumble project!!@@"
txt.upper()

'HEY ALL, I AM GOING TO LONDON FOR BUMBLE PROJECT!!@@'

### **D. Removing Stopwords**

Stopwords include: I, he, she, and, but, was were, being, have, etc, which do not add meaning to the data. So these words must be removed which helps to reduce the features from our data.


In [16]:
import nltk

In [17]:
from nltk.corpus import stopwords
nltk.download("stopwords")
stop = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
text = "Hey All I am going to London this summer"
text_new = " ".join([i for i in text.split() if i not in stop])
text_new

'Hey All I going London summer'

### E. **Lemmatization & Stemming**

**Stemming:** A technique that takes the word to its root form. It just removes suffixes from the words. The stemmed word might not be part of the dictionary, i.e it will not necessarily give meaning

**Lemmatization:** Takes the word to its root form called Lemma. It helps to bring words to their dictionary form. It is applied to nouns by default. It is more accurate as it uses more informed analysis to create groups of words with similar meanings based on the context, so it is complex and takes more time. This is used where we need to retain the contextual information.


In [20]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

In [21]:
porter = PorterStemmer()
print("Porter Stemmer Outputs")
print(porter.stem("programming"))
print(porter.stem("changing"))
print(porter.stem("troubling"))
print(porter.stem("mugged"))

Porter Stemmer Outputs
program
chang
troubl
mug


In [22]:
lancaster=LancasterStemmer()
print("Lancaster Stemmer Outputs")
print(lancaster.stem("programming"))
print(lancaster.stem("changing"))
print(lancaster.stem("troubling"))
print(lancaster.stem("mugged"))

Lancaster Stemmer Outputs
program
chang
troubl
mug


In [23]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [24]:
print(wordnet_lemmatizer.lemmatize("programming"))
print(wordnet_lemmatizer.lemmatize("changing"))
print(wordnet_lemmatizer.lemmatize("troubling"))
print(wordnet_lemmatizer.lemmatize("mugged"))

programming
changing
troubling
mugged


In the above output, you must be wondering that no actual root form has been given for any word, this is because they are given without context. You need to provide the context in which you want to lemmatize that is the parts-of-speech (POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.

In [25]:
print(wordnet_lemmatizer.lemmatize("programming",pos='v'))
print(wordnet_lemmatizer.lemmatize("changing",pos='v'))
print(wordnet_lemmatizer.lemmatize("troubling",pos='v'))
print(wordnet_lemmatizer.lemmatize("London",pos='v'))

program
change
trouble
London


#### **Stemming or Lemmatization?**

- Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

- Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma.

So when to use what! The above points show that if speed is focused then stemming should be used since lemmatizers scan a corpus which consumed time and processing. If you are building a language application in which language is important you should use lemmatization as it uses a corpus to match root forms.


### **F. Spelling Correction:**

In some applications such as Information Retrieval, it's useful to correct spelling errors. For example, 'infromation' is normalized to 'information'.


In [26]:
from textblob import TextBlob
text = "I am ging to London f0r an importnt projct"
textBlb = TextBlob(text)            # Making our first textblob
textCorrected = textBlb.correct()   # Correcting the text
textCorrected

TextBlob("I am going to London for an important project")

### **Clean Text Library for Text Cleaning**

In [27]:
!pip install clean-text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting clean-text
  Downloading clean_text-0.6.0-py3-none-any.whl (11 kB)
Collecting emoji<2.0.0,>=1.0.0
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 5.1 MB/s 
[?25hCollecting ftfy<7.0,>=6.0
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.6 MB/s 
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=e34af865f9308049d12e6a74791ea6c29570fa8ef66386fc8428a32744a093ce
  Stored in directory: /root/.cache/pip/wheels/8a/4e/b6/57b01db010d17ef6ea9b40300af725ef3e210cb1acfb7ac8b6
Successfully built emoji
Installing collected packages: ftfy, emoji, clean-text
Successfully installed clean-text-0.6.0 emoji-1.7.0 ftfy-6.1.1


In [28]:
from cleantext import clean



In [37]:
clean("My email id is adi@gmail.com and the url for my website is https://quantiphi.com. My phone number is +91799331115 @@@. The parking in New York costs $50 per hour",
    fix_unicode=True,               # fix various unicode errors
    to_ascii=True,                  # transliterate to closest ASCII representation
    lower=True,                     # lowercase text
    no_line_breaks=False,           # fully strip line breaks as opposed to only normalizing them
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=False,               # replace all numbers with a special token
    no_digits=True,                # replace all digits with a special token
    no_currency_symbols=True,      # replace all currency symbols with a special token
    no_punct=True,                 # remove punctuations
    replace_with_punct="",          # instead of removing punctuations you may replace them
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="",
    replace_with_currency_symbol="<CUR>",
    lang="en"                      
)

'my email id is <email> and the url for my website is <url> my phone number is +<phone> the parking in new york costs <cur> per hour'

### **G. Tokenization**

 - Tokenization is the first step in any NLP pipeline. It has an important effect on the rest of your pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document. 

- Tokenization can separate sentences, words, characters, or subwords. When we split the text into sentences, we call it sentence tokenization. For words, we call it word tokenization.



#### White Space Tokenization

In [40]:
sent = "I work as a Machine Learning Engineer"
sent.split()

['I', 'work', 'as', 'a', 'Machine', 'Learning', 'Engineer']

In the example below, we’ll perform sentence tokenization using the comma as a separator.


In [42]:
sents = "I work as a Machine Learning Engineer. My office is in Mumbai"
sents.split(".")

['I work as a Machine Learning Engineer', ' My office is in Mumbai']

#### NLTK based tokenizers

In [43]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize, wordpunct_tokenize, TweetTokenizer, MWETokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [50]:
text = "I work as a Machine Learning Engineer.  My office is in Mumbai."

Word and Sentence tokenizer

In [51]:
print(word_tokenize(text))

['I', 'work', 'as', 'a', 'Machine', 'Learning', 'Engineer', '.', 'My', 'office', 'is', 'in', 'Mumbai', '.']


In [52]:
print(sent_tokenize(text))

['I work as a Machine Learning Engineer.', 'My office is in Mumbai.']


#### Punctuation-based tokenizer
This tokenizer splits the sentences into words based on whitespaces and punctuations.

In [53]:
print(wordpunct_tokenize(text))

['I', 'work', 'as', 'a', 'Machine', 'Learning', 'Engineer', '.', 'My', 'office', 'is', 'in', 'Mumbai', '.']


#### Tweet tokenizer
When we want to apply tokenization in text data like tweets, the tokenizers mentioned above can’t produce practical tokens. Through this issue, NLTK has a rule based tokenizer special for tweets. We can split emojis into different words if we need them for tasks like sentiment analysis.

In [54]:
text = "Hello😂😂🙌😆"
tok = TweetTokenizer()
print(tok.tokenize(text))

['Hello', '😂', '😂', '🙌', '😆']


#### MWET tokenizer

NLTK’s multi-word expression tokenizer (MWETokenizer) provides a function add_mwe() that allows the user to enter multiple word expressions before using the tokenizer on the text. More simply, it can merge multi-word expressions into single tokens.

In [55]:
text = "I want to go to New York"
tok = MWETokenizer()
print(tok.tokenize(word_tokenize(text)))

['I', 'want', 'to', 'go', 'to', 'New', 'York']


In [56]:
text = "I want to go to New York"
tok = MWETokenizer()
tok.add_mwe(("New","York"))
print(tok.tokenize(word_tokenize(text)))

['I', 'want', 'to', 'go', 'to', 'New_York']
