# Text Data Preprocessing

In any machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important.

Objective of this notebook is to understand the various text preprocessing steps with code examples.

Some of the common text preprocessing / cleaning steps are:

* Lower casing
* Removal of Punctuations
* Removal of Stopwords
* Removal of Frequent words
* Removal of Rare words
* Stemming
* Lemmatization
* Removal of emojis
* Removal of URLs


So these are the different types of text preprocessing steps which we can do on text data. But we need not do all of these all the times. We need to carefully choose the preprocessing steps based on our use case since that also play an important role.

For example, in sentiment analysis use case, we need not remove the emojis as it will convey some important information about the sentiment. Similarly we need to decide based on our use cases.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
#pd.options.mode.chained_assignment = None

## Read the data

In [2]:
pd.set_option('max_colwidth', 100)

In [3]:
pd.__version__

'1.0.5'

In [4]:
df = pd.read_csv('../data/text.csv', lineterminator='\n')
df.shape

(5000, 1)

In [5]:
df.head()

Unnamed: 0,text
0,@161252 What's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.


In [6]:
df.iloc[85:90]

Unnamed: 0,text
85,"@235730 and click on "" How to disable Automatic Restart"".\nShould you need any assistance, DM th..."
86,@AmericanAir So disappointed in your service. How can you not keep customers and staff updated.
87,"@811677 Hey, we're unable to tell when an item will be back in stock when it is listed as ""Tempo..."
88,Woke up....hyped for a Sunday stream......................................................\n\nCo...
89,"@542004 We can take a look for you, Jessi. Please DM your booking reference, full name, home and..."


## Lower Casing
Lower casing is a common text preprocessing technique. The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.



In [7]:
df['text_lower'] = df.text.str.lower()
df.head()

Unnamed: 0,text,text_lower
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.


## Removal of Punctuations

One another common text preprocessing technique is to remove the punctuations from the text data. This is again a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.

We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the `string.punctuation` in python contains the following punctuation symbols 
```
!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`
```

We can add or remove more punctuations as per our need.

In [8]:
"".join(["\\"+c for c in string.punctuation])

'\\!\\"\\#\\$\\%\\&\\\'\\(\\)\\*\\+\\,\\-\\.\\/\\:\\;\\<\\=\\>\\?\\@\\[\\\\\\]\\^\\_\\`\\{\\|\\}\\~'

In [9]:
print("[" + "".join(["\\"+c for c in string.punctuation]) + "]")

[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~]


In [10]:
df["text_wo_punct"] = df.text_lower.str.replace("-"," ")

In [11]:
df.head()

Unnamed: 0,text,text_lower,text_wo_punct
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,@161252 what's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5 20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.


In [12]:
df["text_wo_punct"] = df.text_wo_punct.str.replace( "[" + "".join(["\\"+c for c in string.punctuation]) + "]" , "")
df.head()

Unnamed: 0,text,text_lower,text_wo_punct
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced


## Removal of stopwords
Stopwords are commonly occuring words in a language like 'the', 'a' and so on. They can be removed from the text most of the times, as they don't provide valuable information for downstream analysis. In cases like Part of Speech tagging, we should not remove them as provide very valuable information about the POS.

These stopword lists are already compiled for different languages and we can safely use them. For example, the stopword list for english language from the nltk package can be seen below.

In [13]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kukre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [15]:
df.text_wo_punct.iloc[0]

'161252 whats that egg website people talk about'

In [16]:
text  = df.text_wo_punct.iloc[0]

" ".join([word for word in text.split() if word not in stopwords.words('english')])

'161252 whats egg website people talk'

In [17]:
def remove_words(text, wordlist):
    """ custome function to remove words from text present in the wordlist"""
    return " ".join([word for word in text.split() if word not in wordlist])

In [18]:
df['text_wo_stop'] = df.text_wo_punct.apply(lambda text: remove_words(text, stopwords.words('english')))
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced


## Removal of Frequent words
In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.

Let us get the most common words adn then remove them in the next step

In [19]:
stop_word_list = stopwords.words('english')
stop_word_list.remove('when')
stop_word_list.remove('what')

", ".join(stop_word_list)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn,

In [20]:
from collections import Counter
cnt = Counter()

for text in df.text_wo_stop.values:
    for word in text.split():
        cnt[word] += 1
        
cnt.most_common(10)

[('us', 836),
 ('please', 747),
 ('dm', 633),
 ('help', 460),
 ('thanks', 405),
 ('hi', 404),
 ('get', 352),
 ('sorry', 314),
 ('like', 281),
 ('send', 276)]

In [21]:
[w for w,_ in cnt.most_common(10)]

['us', 'please', 'dm', 'help', 'thanks', 'hi', 'get', 'sorry', 'like', 'send']

In [22]:
df['text_wo_stopfreq'] = df.text_wo_stop.apply(lambda text: remove_words(text, [w for w,_ in cnt.most_common(10)]  ))
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk,161252 whats egg website people talk
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced


## Removal of Rare words
This is very similar to previous preprocessing step but we will remove the rare words from the corpus.

In [23]:
cnt.most_common()[:-20:-1]

[('custom', 1),
 ('puma', 1),
 ('inr1400', 1),
 ('170844', 1),
 ('bricked', 1),
 ('implementing', 1),
 ('407091', 1),
 ('reunion', 1),
 ('gravity', 1),
 ('319396', 1),
 ('684726', 1),
 ('hotmail', 1),
 ('sean', 1),
 ('457844', 1),
 ('703576', 1),
 ('598743', 1),
 ('hk', 1),
 ('313942', 1),
 ('httpstcobqcl3gv57t', 1)]

In [24]:
df['text_wo_stopfreqrare'] = df.text_wo_stopfreq.apply(lambda text: remove_words(text, [w for w,_ in cnt.most_common()[:-10:-1]]  ))
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_stopfreq,text_wo_stopfreqrare
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk,161252 whats egg website people talk,161252 whats egg website people talk
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced


In [25]:
pd.DataFrame(cnt.most_common(), columns=['word','count']).query('count == 1')

Unnamed: 0,word,count
4230,161252,1
4231,why🤷🏻‍♀️,1
4232,httpstcobxrvfeixxq,1
4233,693975,1
4234,futher,1
...,...,...
13472,bricked,1
13473,170844,1
13474,inr1400,1
13475,puma,1


## Stemming
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form (From Wikipedia)

For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words console and consoling, the stemmer will remove the suffix and make them consol which is not a proper english word.

There are several type of stemming algorithms available and one of the famous one is porter stemmer which is widely used. We can use nltk package for the same.

In [26]:
from nltk.stem.porter import PorterStemmer


stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df['text_stemmed'] = df.text_lower.apply(stem_words)
df.head()[['text','text_lower','text_stemmed']]

Unnamed: 0,text,text_lower,text_stemmed
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,@161252 what' that egg websit peopl talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updat to io 11.1.1 if you haven't had the chanc to do so...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 that better than have an unstabl connect that drop everi 5-20 min
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,@virginamerica is probabl one of the best airlin i'v ever experienced.


'ppl' ->  # normailisation

## Lemmatization
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called as lemma) belongs to the language.

As a result, this one is generally slower than stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.

Let us use the WordNetLemmatizer in nltk to lemmatize our sentences

In [27]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kukre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [28]:
from nltk.stem import WordNetLemmatizer


lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df['text_lemmatized'] = df.text_lower.apply(lemmatize_words)
df.head()[['text','text_lower','text_stemmed', 'text_lemmatized']]

Unnamed: 0,text,text_lower,text_stemmed,text_lemmatized
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,@161252 what' that egg websit peopl talk about,@161252 what's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updat to io 11.1.1 if you haven't had the chanc to do so...,@693975 we can assist you. we recommend updating to io 11.1.1 if you haven't had the chance to d...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 that better than have an unstabl connect that drop everi 5-20 min,@331912 @115955 thats better than having an unstable connection that drop every 5-20 min
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,@virginamerica is probabl one of the best airlin i'v ever experienced.,@virginamerica is probably one of the best airline i've ever experienced.


## Redo the lemmatization process with POS tag for our dataset.

In [29]:
lemmatizer.lemmatize('having')

'having'

In [30]:
lemmatizer.lemmatize('having', 'v') 

'have'

In [33]:
from nltk.corpus import wordnet
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}


In [34]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kukre\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [35]:
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}

def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_stopfreq,text_wo_stopfreqrare,text_stemmed,text_lemmatized
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk,161252 whats egg website people talk,161252 whats egg website people talk,@161252 what' that egg websit peopl talk about,@161252 What's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,@693975 we can assist you. we recommend updat to io 11.1.1 if you haven't had the chanc to do so...,@693975 We can assist you. We recommend update to iOS 11.1.1 if you haven't have the chance to d...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,@331912 @115955 that better than have an unstabl connect that drop everi 5-20 min,@331912 @115955 Thats good than have an unstable connection that drop every 5-20 min
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,@virginamerica is probabl one of the best airlin i'v ever experienced.,@VirginAmerica be probably one of the best airline I've ever experienced.


## Removal of Emojis

With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day to day life as well. Probably we might need to remove these emojis for some of our textual analysis.

Thanks to [this code](https://stackoverflow.com/a/58356570/8210613), please find below a helper function to remove emojis from our text.

In [41]:
#https://stackoverflow.com/a/58356570/8210613
def remove_emoji(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

df["text_no_emoji"] = df["text"].apply(remove_emoji)
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_stopfreq,text_wo_stopfreqrare,text_stemmed,text_lemmatized,text_no_emoji,text_no_url
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk,161252 whats egg website people talk,161252 whats egg website people talk,@161252 what' that egg websit peopl talk about,@161252 What's that egg website people talk about,@161252 What's that egg website people talk about,@161252 What's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,Why! #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,Why!🤷🏻‍♀️ #iOS11 @AppleSupport
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,@693975 we can assist you. we recommend updat to io 11.1.1 if you haven't had the chanc to do so...,@693975 We can assist you. We recommend update to iOS 11.1.1 if you haven't have the chance to d...,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,@331912 @115955 that better than have an unstabl connect that drop everi 5-20 min,@331912 @115955 Thats good than have an unstable connection that drop every 5-20 min,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,@virginamerica is probabl one of the best airlin i'v ever experienced.,@VirginAmerica be probably one of the best airline I've ever experienced.,@VirginAmerica is probably one of the best airlines I've ever experienced.,@VirginAmerica is probably one of the best airlines I've ever experienced.


## Removal of URLs

Next preprocessing step is to remove any URLs present in the data. For example, if we are doing a twitter analysis, then there is a good chance that the tweet will have some URL in it. Probably we might need to remove them for our further analysis.

We can use the below code snippet to do that

In [42]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)


df["text_no_url"] = df["text"].apply(remove_urls)
df.head()

Unnamed: 0,text,text_lower,text_wo_punct,text_wo_stop,text_wo_stopfreq,text_wo_stopfreqrare,text_stemmed,text_lemmatized,text_no_emoji,text_no_url
0,@161252 What's that egg website people talk about,@161252 what's that egg website people talk about,161252 whats that egg website people talk about,161252 whats egg website people talk,161252 whats egg website people talk,161252 whats egg website people talk,@161252 what' that egg websit peopl talk about,@161252 What's that egg website people talk about,@161252 What's that egg website people talk about,@161252 What's that egg website people talk about
1,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why🤷🏻‍♀️ ios11 applesupport httpstcobxrvfeixxq,why!🤷🏻‍♀️ #ios11 @applesupport https://t.co/bxrvfeixxq,Why!🤷🏻‍♀️ #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,Why! #iOS11 @AppleSupport https://t.co/BXrVfeIXxq,Why!🤷🏻‍♀️ #iOS11 @AppleSupport
2,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 we can assist you. we recommend updating to ios 11.1.1 if you haven't had the chance to ...,693975 we can assist you we recommend updating to ios 1111 if you havent had the chance to do so...,693975 assist recommend updating ios 1111 havent chance also dm us following link futher support...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,693975 assist recommend updating ios 1111 havent chance also following link futher support https...,@693975 we can assist you. we recommend updat to io 11.1.1 if you haven't had the chanc to do so...,@693975 We can assist you. We recommend update to iOS 11.1.1 if you haven't have the chance to d...,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...,@693975 We can assist you. We recommend updating to iOS 11.1.1 if you haven't had the chance to ...
3,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 thats better than having an unstable connection that drops every 5-20 mins,331912 115955 thats better than having an unstable connection that drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,331912 115955 thats better unstable connection drops every 5 20 mins,@331912 @115955 that better than have an unstabl connect that drop everi 5-20 min,@331912 @115955 Thats good than have an unstable connection that drop every 5-20 min,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins,@331912 @115955 Thats better than having an unstable connection that drops every 5-20 mins
4,@VirginAmerica is probably one of the best airlines I've ever experienced.,@virginamerica is probably one of the best airlines i've ever experienced.,virginamerica is probably one of the best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,virginamerica probably one best airlines ive ever experienced,@virginamerica is probabl one of the best airlin i'v ever experienced.,@VirginAmerica be probably one of the best airline I've ever experienced.,@VirginAmerica is probably one of the best airlines I've ever experienced.,@VirginAmerica is probably one of the best airlines I've ever experienced.


## Discussion activity:

* What usecases can you think for NLP?
    - analysis of speech  - news article, transcription of speeches -- topic modelling vs topic classification 

* What role does preprocessing play in the application of NLP?