# Data Preprocessing: Cleaning

Let's analyze this tweet when Sir Alex Ferguson has undergone surgery for a brain haemorrhage.


# Tweet:

"Get well soon fergy lad &lt;3!!! u're a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery &amp; send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan 💙💙"

## 1. Eliminating HTML characters: 

Data obtained from web usually contains a lot of html entities like &lt; &gt; &amp; which gets embedded in the original data. It is thus necessary to get rid of these entities.


In [2]:
def unescape(s):
    s = s.replace("&lt;", "<")
    s = s.replace("&gt;", ">")
    s = s.replace("&amp;", "&")
    return s

In [3]:
inputTweet = "Get well soon fergy lad &lt;3!!! u're a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery &amp; send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan 💙💙"
outputTweet = unescape(inputTweet)
print("Cleaned Tweet: ", outputTweet)

#or

import html
inputTweet = html.unescape(inputTweet)
#print("Cleaned Tweet: ", inputTweet)

Cleaned Tweet:  Get well soon fergy lad <3!!! u're a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery & send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan 💙💙


 ## 2. Data Decoding

Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.

In [4]:
inputTweet = inputTweet.encode('ascii', 'ignore').decode('utf-8', 'ignore')
print("Cleaned Tweet: ", inputTweet)

Cleaned Tweet:  Get well soon fergy lad <3!!! u're a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery & send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan 


## 3. Expanding Apostrophe 

All the apostrophes should be converted into standard lexicons. One can use a lookup table of all possible keys to get rid of disambiguates.

In [5]:
dictApost = {
    "'s" : " is",
    "you're" : "you are",
    "u're" : "you are",
    "can't" : " cannot",
    "won't" : " will not", 
    "isn't" : " is not", 
    "it's" : " it is", 
    "o'clock" : " of the clock"
} ## Need a huge dictionary

In [6]:
words = inputTweet.split()
reformed = [dictApost[word] if word in dictApost else word for word in words]
inputTweet = " ".join(reformed)
print(inputTweet)

Get well soon fergy lad <3!!! you are a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery & send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan


## 4. Remove of Punctuations


In [7]:
print(inputTweet)
import re
import string
inputTweet1 = re.sub(r'[^\w\s]', '', inputTweet)
print(inputTweet1)

table = str.maketrans({key: None for key in string.punctuation})
inputTweet = inputTweet.translate(table)
print(inputTweet)


Get well soon fergy lad <3!!! you are a awsm legend. As a Man City fan ... I wish Sir Alex a speedy recovery & send my best wishes to his family at this difficult time. #ManchesterCityFan#footballfan
Get well soon fergy lad 3 you are a awsm legend As a Man City fan  I wish Sir Alex a speedy recovery  send my best wishes to his family at this difficult time ManchesterCityFanfootballfan
Get well soon fergy lad 3 you are a awsm legend As a Man City fan  I wish Sir Alex a speedy recovery  send my best wishes to his family at this difficult time ManchesterCityFanfootballfan


## 5. Remove of Email-IDs

In [8]:
string  = 'My name is CR7 wisdomic@panda.com'
pattern = r"\w+@[a-z]+\.[a-z]+"
regex = re.compile(pattern)
listobj = regex.sub('',string)
print(listobj)

My name is CR7 


## 6. Standardizing Text

In [9]:
lookup_dict = {
    'awsm':'awesome',
    'lol' : 'laughing out loud',
    'brb':'be right back',
    'btw':'by the way',
    'lmk':'let me know',
    'g2g':'got to go',
    'rt':'retweet',
    'dm':'direct message',
    "awsm" : "awesome",
    "luv" :"love"}

In [10]:
def Standardize(input_text):
    words = input_text.split() 
    new_words = []
    new_text = []
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
            new_words.append(word)
        else:
            new_words.append(word)
    return " ".join(new_words) 

In [11]:
print(Standardize(inputTweet))

Get well soon fergy lad 3 you are a awesome legend As a Man City fan I wish Sir Alex a speedy recovery send my best wishes to his family at this difficult time ManchesterCityFanfootballfan


## 7.  Remove Non-ASCII Characters

In [12]:

def remove_non_ascii(words):
    
    list_of_words = []
    for wordlist in words:
        wordlist = [unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore') for word in wordlist]
        list_of_words.append(wordlist)
    return list_of_words


## 8. Spelling Correction

Spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

In [17]:
from textblob import TextBlob
blob = TextBlob("sciemce amd teknology")
print(blob.correct())

science and technology
