## Introduction 
In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important.

Objective of this kernel is to understand the various text preprocessing steps with code examples.

Some of the common text preprocessing / cleaning steps are:

    Lower casing
    Removal of Punctuations
    Removal of Stopwords
    Removal of Frequent words
    Removal of Rare words
    Stemming
    Lemmatization
    Removal of emojis
    Removal of emoticons
    Removal of URLs
    Removal of HTML tags
    Chat words conversion
    Spelling correction
So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role.

In [254]:
# first import some necessary libraries
import numpy as np
import pandas as pd
import re
import nltk
import string
pd.options.mode.chained_assignment = None

full_df = pd.read_csv("Amazon_Alexa_Reviews.csv")
full_df.head()

Unnamed: 0.1,Unnamed: 0,Rating,Title,Review,Review_Date
0,0,4,4.0 out of 5 stars\nAwesome,I liked this gen very much. The only disadvant...,30-Aug-23
1,1,3,3.0 out of 5 stars\nVery Very Average. Very di...,This latest version of Echo Dot may be a bette...,15-Aug-23
2,2,4,4.0 out of 5 stars\ni was not be able to reset...,it could not be possible to reset it at my ne...,21-Aug-23
3,3,4,4.0 out of 5 stars\nHangs a lot,Nothing great then previous ones .,25-Aug-23
4,4,4,4.0 out of 5 stars\nAwesome,AwesomeBut some time can not listenOver all good,14-Aug-23


In [255]:
# Lets get the shape of the data
full_df.shape

(90, 5)

There are 90 rows and 5 columns in this dataset.

In [256]:
# information on data
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   90 non-null     int64 
 1   Rating       90 non-null     int64 
 2   Title        90 non-null     object
 3   Review       89 non-null     object
 4   Review_Date  90 non-null     object
dtypes: int64(2), object(3)
memory usage: 3.6+ KB


In [257]:
columns = full_df.columns
print(columns)

Index(['Unnamed: 0', 'Rating', 'Title', 'Review', 'Review_Date'], dtype='object')


##### Here we dont need the columns 'Unnamed: 0' and 'Title' for any processing.

In [258]:
columns_to_drop = ['Unnamed: 0', 'Title',]
full_df.drop(columns_to_drop,axis=1, inplace=True)
full_df.head()

Unnamed: 0,Rating,Review,Review_Date
0,4,I liked this gen very much. The only disadvant...,30-Aug-23
1,3,This latest version of Echo Dot may be a bette...,15-Aug-23
2,4,it could not be possible to reset it at my ne...,21-Aug-23
3,4,Nothing great then previous ones .,25-Aug-23
4,4,AwesomeBut some time can not listenOver all good,14-Aug-23


#### We have learned about the data wrangling before so lets first do some cleaning task.
- Changing the format of date

In [259]:
# full_df['Review_Date'] = pd.to_datetime(df['Review_Date'])
# full_df['Review_Date'] = full_df['Review_Date'].dt.strftime('%Y-%m-%d')
# full_df.head(2)

### Now lets work on Review column of the data.

In [260]:
df = full_df[["Review"]]
df["Review"] = df["Review"].astype(str)
df.head(3)

Unnamed: 0,Review
0,I liked this gen very much. The only disadvant...
1,This latest version of Echo Dot may be a bette...
2,it could not be possible to reset it at my ne...


### Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

In [261]:
df["Review"] = df["Review"].str.lower()
df.head()

Unnamed: 0,Review
0,i liked this gen very much. the only disadvant...
1,this latest version of echo dot may be a bette...
2,it could not be possible to reset it at my ne...
3,nothing great then previous ones .
4,awesomebut some time can not listenover all good


### Removal of Punctuations
One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.
string.punctuation gives => 

`!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

In [262]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["Review"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,Review,text_wo_punct
0,i liked this gen very much. the only disadvant...,i liked this gen very much the only disadvanta...
1,this latest version of echo dot may be a bette...,this latest version of echo dot may be a bette...
2,it could not be possible to reset it at my ne...,it could not be possible to reset it at my ne...
3,nothing great then previous ones .,nothing great then previous ones
4,awesomebut some time can not listenover all good,awesomebut some time can not listenover all good


### Removal of stopwords  
Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis.

In [263]:
# Actual words present in stopwords
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [264]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,Review,text_wo_punct,text_wo_stop
0,i liked this gen very much. the only disadvant...,i liked this gen very much the only disadvanta...,liked gen much disadvantage found alexa way du...
1,this latest version of echo dot may be a bette...,this latest version of echo dot may be a bette...,latest version echo dot may better version pre...
2,it could not be possible to reset it at my ne...,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...
3,nothing great then previous ones .,nothing great then previous ones,nothing great previous ones
4,awesomebut some time can not listenover all good,awesomebut some time can not listenover all good,awesomebut time listenover good


### Removal of Frequent words
We might also have some frequent words which are of not so much importance to us.

**Note : If we use something like tfidf, this is automatically taken care of**.




In [265]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('alexa', 28),
 ('good', 25),
 ('product', 21),
 ('like', 15),
 ('sound', 15),
 ('quality', 15),
 ('echo', 13),
 ('device', 13),
 ('dot', 11),
 ('times', 11)]

Here the word "alexa" appears maximum times(28) in data, so lets remove some frequent words as well.
Lets try to remove the top 10 most frequent words of data.

In [266]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,Review,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,i liked this gen very much. the only disadvant...,i liked this gen very much the only disadvanta...,liked gen much disadvantage found alexa way du...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,this latest version of echo dot may be a bette...,latest version echo dot may better version pre...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...
3,nothing great then previous ones .,nothing great then previous ones,nothing great previous ones,nothing great previous ones
4,awesomebut some time can not listenover all good,awesomebut some time can not listenover all good,awesomebut time listenover good,awesomebut time listenover


Lets check frequent words again.

In [267]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stopfreq"].values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('music', 11),
 ('amazon', 11),
 ('working', 10),
 ('doesnt', 10),
 ('time', 9),
 ('one', 9),
 ('work', 8),
 ('smart', 8),
 ('gen', 7),
 ('understand', 7)]

Now we dont need these two columns. 
["text_wo_punct", "text_wo_stop"]

In [268]:

df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)


### Removal of Rare words
We have to remove the rare words from the corpus also, cause they may not have actual impact further. 

Lets find out 10 most rare words out there in data.

In [269]:
n_rare_words = 10

RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
RAREWORDS

{'airtel',
 'another',
 'broadband',
 'charger',
 'havr',
 'launch',
 'refusing',
 'unhappy',
 'years',
 '😁superb'}

In [270]:
n_rare_words = 10

RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])

def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,Review,text_wo_stopfreq,text_wo_stopfreqrare
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...
3,nothing great then previous ones .,nothing great previous ones,nothing great previous ones
4,awesomebut some time can not listenover all good,awesomebut time listenover,awesomebut time listenover


In [271]:
# We dont need this column either.
df.drop(["text_wo_stopfreq"], axis=1, inplace=True) 
df.head(2)

Unnamed: 0,Review,text_wo_stopfreqrare
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...


### Stemming

It is used to reduce words to their root or base form. It involves removing prefixes and suffixes from words to obtain their core meaning or root form.  

  
  
For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words console and consoling, the stemmer will remove the suffix and make them consol which is not a proper english word.
  
  There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.



In [272]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text_wo_stopfreqrare"].apply(lambda text: stem_words(text))
df.head()

Unnamed: 0,Review,text_wo_stopfreqrare,text_stemmed
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,like gen much disadvantag found way dumb last ...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previou one ...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possibl reset new locatioon custom help ...
3,nothing great then previous ones .,nothing great previous ones,noth great previou one
4,awesomebut some time can not listenover all good,awesomebut time listenover,awesomebut time listenov


**BUT**,  
 One of the main disadvantages of stemming is over-stemming, where a stemmer removes prefixes or suffixes that are part of the actual word. This can result in the loss of important information. On the flip side, under-stemming occurs when a stemmer fails to remove a suffix that should be removed.   

 Stemming typically doesn't consider the semantics or meaning of words. It treats words purely based on their textual patterns. This can lead to words with different meanings being stemmed to the same root, causing ambiguity.  

Stemming algorithms are language-dependent. Stemming algorithms may produce stems that are not standard words in the language.  

**SO**,  
More advanced techniques like lemmatization or contextual embeddings (e.g., Word2Vec or BERT) may be preferred over stemming to capture word meanings and relationships more accurately.



## Lemmatization
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

In [273]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/sujan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [274]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text_wo_stopfreqrare"].apply(lambda text: lemmatize_words(text))
df.head(10)

Unnamed: 0,Review,text_wo_stopfreqrare,text_stemmed,text_lemmatized
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,like gen much disadvantag found way dumb last ...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previou one ...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possibl reset new locatioon custom help ...,could possible reset new locatioon customer he...
3,nothing great then previous ones .,nothing great previous ones,noth great previou one,nothing great previous one
4,awesomebut some time can not listenover all good,awesomebut time listenover,awesomebut time listenov,awesomebut time listenover
5,freedom sale i brought this with light. workin...,freedom sale brought light working 👍,freedom sale brought light work 👍,freedom sale brought light working 👍
6,good product. stayed upto the expectations. co...,stayed upto expectations could atleast 1000 lo...,stay upto expect could atleast 1000 lower side...,stayed upto expectation could atleast 1000 low...
7,good,,,
8,very good connectivity with iphone and android...,connectivity iphone android build macho,connect iphon android build macho,connectivity iphone android build macho
9,the sound is so amazing that you are lost in y...,amazing lost ears feel skin whole body music p...,amaz lost ear feel skin whole bodi music play ...,amazing lost ear feel skin whole body music pl...


Lemmatization doesnot actually overcut the word, it does fair with word.

In [275]:
lemmatizer.lemmatize("Nothing")


'Nothing'

We can do verb transformation as well.

In [276]:
lemmatizer.lemmatize("brought",'v')


'bring'

#### Lets do further processing with 'text_lemmatized' cause in this data after stemming process is not so good.

### Removal of Emojis
With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.

In [277]:
def remove_emoji(string):
    ''' user defined function that remove the emoji from text'''
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)


In [278]:
# first, let's test this on : something amazing 😱️

remove_emoji("something amazing 😱️")
# remove_emoji("freedom sale brought light working 👍")

'something amazing '

In [279]:
df["text_lemmatized_wo_emoji"] = df["text_lemmatized"].apply(remove_emoji)
df.head(6)

Unnamed: 0,Review,text_wo_stopfreqrare,text_stemmed,text_lemmatized,text_lemmatized_wo_emoji
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,like gen much disadvantag found way dumb last ...,liked gen much disadvantage found way dumb las...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previou one ...,latest version may better version previous one...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possibl reset new locatioon custom help ...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...
3,nothing great then previous ones .,nothing great previous ones,noth great previou one,nothing great previous one,nothing great previous one
4,awesomebut some time can not listenover all good,awesomebut time listenover,awesomebut time listenov,awesomebut time listenover,awesomebut time listenover
5,freedom sale i brought this with light. workin...,freedom sale brought light working 👍,freedom sale brought light work 👍,freedom sale brought light working 👍,freedom sale brought light working


Lets remove some unnecessary columns again. 


In [280]:
columns = ['text_wo_stopfreqrare','text_stemmed','text_lemmatized']
df.drop(columns, axis=1, inplace=True)
df.head(3)

Unnamed: 0,Review,text_lemmatized_wo_emoji
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...


### Removal of URLs
 We have to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis.



In [281]:
def remove_urls(text):
    '''
    user defined function to remove urls from the text.
    '''
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [282]:
# Lets test on this: 
text = "Please visit my personal website through this link https://www.suzansharma.com.np"
remove_urls(text)


'Please visit my personal website through this link '

#### Applying this removeu_rls function if there exist any urls in our data.

In [283]:
df["text_wo_urls"] = df["text_lemmatized_wo_emoji"].apply(lambda text: remove_urls(text))
df.head(3)

Unnamed: 0,Review,text_lemmatized_wo_emoji,text_wo_urls
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...


### Removal of HTML Tags
If we scrap the data from different websites. We might end up having html strings as part of our text. So it is most important part for better text analysis and NLP relaged task.



In [284]:
def remove_html(text):
    ''' user defined function to remove html from text'''
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> Sujan Sharma</h1>
<p> ML Engineer</p>
<a href="https://www.suzansharma.com.np"> Visit ME</a>
</div>"""

print(remove_html(text))


 Sujan Sharma
 ML Engineer
 Visit ME



In [285]:
# using BeautifulSoup , removing the html tag
from bs4 import BeautifulSoup

def remove_html(text):
    ''' 
    user defined function 
    that use BeautifulSoup 
    to remove html from text
    '''
    return BeautifulSoup(text, "lxml").text

text = """<div>
<h1> Sujan Sharma</h1>
<p> ML Engineer</p>
<a href="https://www.suzansharma.com.np"> Visit ME</a>
</div>"""

print(remove_html(text))


 Sujan Sharma
 ML Engineer
 Visit ME



#### Applying this remove_html function if there exist any urls in our data.

In [286]:
df["text_wo_htmls"] = df["text_wo_urls"].apply(lambda text: remove_html(text))
df.head(3)

Unnamed: 0,Review,text_lemmatized_wo_emoji,text_wo_urls,text_wo_htmls
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...,liked gen much disadvantage found way dumb las...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...,latest version may better version previous one...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...,could possible reset new locatioon customer he...


### Spelling Correction

Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis.

**IF**  
We have use stemming here then the spelling of most of the word could have been wrong. And this step can correct those text.

In [291]:
from textblob import TextBlob

def correct_spellings(text):
    '''
    user defined function to 
    correct the spelling of 
    the text with textblob package.
    '''
    textBLB = TextBlob(text)
    return textBLB.correct().string
        

In [296]:
# first lets check on this:
text = "Evrything is goig good in this training."
correct_spellings(text)

'Everything is going good in this training.'

In [297]:
df["Final_Review"] = df["text_wo_htmls"].apply(lambda text: correct_spellings(text))


In [298]:
# final remove on unnecessary columns in the data
columns = ['text_lemmatized_wo_emoji','text_wo_urls','text_wo_htmls']
df.drop(columns, axis=1, inplace=True)
df.head(10)

Unnamed: 0,Review,Final_Review
0,i liked this gen very much. the only disadvant...,liked gen much disadvantage found way dumb las...
1,this latest version of echo dot may be a bette...,latest version may better version previous one...
2,it could not be possible to reset it at my ne...,could possible rest new location customer help...
3,nothing great then previous ones .,nothing great previous one
4,awesomebut some time can not listenover all good,awesomebut time listener
5,freedom sale i brought this with light. workin...,freedom sale brought light working
6,good product. stayed upto the expectations. co...,stayed unto expectation could least 1000 lower...
7,good,
8,very good connectivity with iphone and android...,conductivity phone andros build much
9,the sound is so amazing that you are lost in y...,amazing lost ear feel skin whole body music pl...


### So above data with Final_review columns is the output of all the processing on text in Review columns. Which later on go futher processing like Vectorization.