# Twitter Data Cleaning

In this program, we're going to code how to clean twitter data. There are 4 main steps that we're gonna be doing:<br><br>
1. Removing data duplicates<br>
2. Decoding html entities into regular characters (and clearing unnecessary new lines)<br>
3. Removing links, hash character, username and punctuations (if any)<br>
4. Removing stopwords<br><br>
These whole processes are intended to obtain clean twitter data that's going to be used for further process (analysis).

***
### Import modules
***
Import modules needed for cleaning process. There are several modules used here:<br><br>
1. **pandas** --> to open data file and to apply certain operation to the data.<br>
2. **html** --> to decode html entities into regular characters.<br>
3. **re** --> to filter and delete unnecessary links, hash, username and punctuations.<br>
4. **nltk** --> to clean stopwords.

In [2]:
import pandas as pd
import html
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

### Data Importing
***
Import the data file that needs to be cleaned. In this case, I'm using a csv twitter data sample that I randomly got from the internet.<br><br>P.S. Please note that you may need a different kind of approach to connect or import the file according to the data source you're using.

In [3]:
pd.set_option('display.max_colwidth', None) #this code is intended to show the full data content in every column
data = pd.read_csv('sample.csv')
data.head()

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time (UTC),Tweet Content,Tweet Type,Client,Retweets Received,Likes Received,Tweet Location,Tweet Language,...,Name,Username,User Bio,Verified or Non-Verified,Profile URL,Protected or Non-protected,User Followers,User Following,User Account Creation Date,Impressions
0,"""1167429261210218497""",https://twitter.com/animalhealthEU/status/1167429261210218497,30 Aug 2019 13:30:00,Pets change our lives &amp; become a part of our families ❤️\nThat's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy &amp; healthy pet 🐱🐶\n#MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3,Tweet,Twitter Ads Composer,0,4,Brussels,English,...,AnimalhealthEurope,animalhealthEU,AnimalhealthEurope represents manufacturers of animal medicines in Europe #AnimalHealthMatters,Non-Verified,https://twitter.com/animalhealthEU,Non-Protected,3697,542,17 Dec 2012 09:14:15,7394
1,"""1167375334670557185""",https://twitter.com/PennyBrohnUK/status/1167375334670557185,30 Aug 2019 09:55:43,Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB,Tweet,Twitter Web App,0,5,"Pill, Bristol",English,...,Penny Brohn UK,PennyBrohnUK,"We help people live well with the impact of cancer through physical, psychological and emotional support. We rely on voluntary donations &amp; we need your support!",Non-Verified,https://twitter.com/PennyBrohnUK,Non-Protected,3227,1571,15 Sep 2010 09:44:02,6454
2,"""1167237977615097861""",https://twitter.com/lordbyronaf/status/1167237977615097861,30 Aug 2019 00:49:54,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA,ReTweet,Twitter for Android,0,0,"Ohio, USA",English,...,Lord ByronAF,lordbyronaf,"It's easier to be who you are, than it is to be who you think others want you to be. 18+ only.\nCultured Brute #NorseUp #Bearcats",Non-Verified,https://twitter.com/lordbyronaf,Non-Protected,7808,8617,25 Jul 2012 15:43:47,0
3,"""1167236897078480898""",https://twitter.com/CountessDavis/status/1167236897078480898,30 Aug 2019 00:45:37,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA,ReTweet,Twitter for Android,0,0,,English,...,Lisa Countess davis,CountessDavis,I am named after @ElvisPresley daughter Lisa Marie Presley\nI am nicknamed after @Dannycountkoker,Non-Verified,https://twitter.com/CountessDavis,Non-Protected,291,81,26 Jan 2017 18:21:42,0
4,"""1167228378191204353""",https://twitter.com/Local12/status/1167228378191204353,30 Aug 2019 00:11:46,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA,ReTweet,TweetDeck,0,0,"Cincinnati, OH",English,...,Local 12/WKRC-TV,Local12,Local 12 is #Cincinnati's trusted source for breaking news &amp; complete coverage from the Weather Authority! Add us on Snapchat: Local12,Verified,https://twitter.com/Local12,Non-Protected,198675,651,02 Sep 2008 20:09:44,0


## 1. Removing Data Duplicates

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;There are lots of data duplicates inside the data file. If we look into the raw data, those tweet duplicates was tweeted from different places, which means those tweets are tweeted by different accounts and (probably) different users. For some cases, if the analysis fancy the location detail, those duplicated tweet may should not be deleted, but if the analysis only requires the tweets and doesn't care about the location or any other data attributes, then the duplicated tweets may should be deleted. And in this case, I consider to delete the duplicated tweets anyway cause in this program, we assume that we only need the tweets for nlp analysis purpose or something like that.

In [4]:
new_data = data.drop_duplicates('Tweet Content',keep='first') #delete the duplicates by dropping them and store the result value to a new variable
new_data.head()

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time (UTC),Tweet Content,Tweet Type,Client,Retweets Received,Likes Received,Tweet Location,Tweet Language,...,Name,Username,User Bio,Verified or Non-Verified,Profile URL,Protected or Non-protected,User Followers,User Following,User Account Creation Date,Impressions
0,"""1167429261210218497""",https://twitter.com/animalhealthEU/status/1167429261210218497,30 Aug 2019 13:30:00,Pets change our lives &amp; become a part of our families ❤️\nThat's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy &amp; healthy pet 🐱🐶\n#MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3,Tweet,Twitter Ads Composer,0,4,Brussels,English,...,AnimalhealthEurope,animalhealthEU,AnimalhealthEurope represents manufacturers of animal medicines in Europe #AnimalHealthMatters,Non-Verified,https://twitter.com/animalhealthEU,Non-Protected,3697,542,17 Dec 2012 09:14:15,7394
1,"""1167375334670557185""",https://twitter.com/PennyBrohnUK/status/1167375334670557185,30 Aug 2019 09:55:43,Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB,Tweet,Twitter Web App,0,5,"Pill, Bristol",English,...,Penny Brohn UK,PennyBrohnUK,"We help people live well with the impact of cancer through physical, psychological and emotional support. We rely on voluntary donations &amp; we need your support!",Non-Verified,https://twitter.com/PennyBrohnUK,Non-Protected,3227,1571,15 Sep 2010 09:44:02,6454
2,"""1167237977615097861""",https://twitter.com/lordbyronaf/status/1167237977615097861,30 Aug 2019 00:49:54,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA,ReTweet,Twitter for Android,0,0,"Ohio, USA",English,...,Lord ByronAF,lordbyronaf,"It's easier to be who you are, than it is to be who you think others want you to be. 18+ only.\nCultured Brute #NorseUp #Bearcats",Non-Verified,https://twitter.com/lordbyronaf,Non-Protected,7808,8617,25 Jul 2012 15:43:47,0
6,"""1167163662051631104""",https://twitter.com/luapppank/status/1167163662051631104,29 Aug 2019 19:54:36,Will you be at #FIX19? Want a preview of @AG_EM33 story? Then check back Monday for #ChangeOfHeart where we sat down with Alin to discuss her new life since her heart #transplant #MoreThanMedicine https://t.co/Xl9zjr7kZ1,ReTweet,Twitter for iPhone,0,0,"Scottsdale, AZ",English,...,paul knapp,luapppank,16.2 (and rising) hdcp golfer. fairways and greens. I loathe mustard! extremely grateful heart transplant recipient. #lifeisgood #donatelife,Non-Verified,https://twitter.com/luapppank,Non-Protected,69,81,17 May 2009 12:58:43,0
12,"""1166892836165496835""",https://twitter.com/AndreaWestbyMD/status/1166892836165496835,29 Aug 2019 01:58:26,Y’all ⁦@UMNNMFamMedRes⁩ Rose Marie Leslie is a ⁦@mnstatefair⁩ blue ribbon winning vinaigrette maker #morethanmedicine #alltheskills https://t.co/QzY8wsq4DO,Tweet,Twitter for iPhone,0,27,"Minneapolis, MN",English,...,Andrea Westby,AndreaWestbyMD,"She/her/hers. Teacher of family medicine, former rural family doc. Runner yogi cyclist swimmer. Food enthusiast. Social &amp; repro justice advocate #kalefueledrage",Non-Verified,https://twitter.com/AndreaWestbyMD,Non-Protected,938,1247,15 Apr 2015 19:58:49,1876


### Storing new sample data to a new csv file
***
we're going to store the new data to a new csv file cause the previous data contains index, and dropping some of the data mess the data structure, that's why we store the result of the duplicates dropping process to a new variable, so that we could save it to a new data file and later will be used for the next data cleaning process.

In [5]:
new_data.to_csv(r'new_sample.csv', index = False)

In [6]:
new_sample = pd.read_csv('new_sample.csv')
new_sample.head()

Unnamed: 0,Tweet Id,Tweet URL,Tweet Posted Time (UTC),Tweet Content,Tweet Type,Client,Retweets Received,Likes Received,Tweet Location,Tweet Language,...,Name,Username,User Bio,Verified or Non-Verified,Profile URL,Protected or Non-protected,User Followers,User Following,User Account Creation Date,Impressions
0,"""1167429261210218497""",https://twitter.com/animalhealthEU/status/1167429261210218497,30 Aug 2019 13:30:00,Pets change our lives &amp; become a part of our families ❤️\nThat's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy &amp; healthy pet 🐱🐶\n#MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3,Tweet,Twitter Ads Composer,0,4,Brussels,English,...,AnimalhealthEurope,animalhealthEU,AnimalhealthEurope represents manufacturers of animal medicines in Europe #AnimalHealthMatters,Non-Verified,https://twitter.com/animalhealthEU,Non-Protected,3697,542,17 Dec 2012 09:14:15,7394
1,"""1167375334670557185""",https://twitter.com/PennyBrohnUK/status/1167375334670557185,30 Aug 2019 09:55:43,Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB,Tweet,Twitter Web App,0,5,"Pill, Bristol",English,...,Penny Brohn UK,PennyBrohnUK,"We help people live well with the impact of cancer through physical, psychological and emotional support. We rely on voluntary donations &amp; we need your support!",Non-Verified,https://twitter.com/PennyBrohnUK,Non-Protected,3227,1571,15 Sep 2010 09:44:02,6454
2,"""1167237977615097861""",https://twitter.com/lordbyronaf/status/1167237977615097861,30 Aug 2019 00:49:54,What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA,ReTweet,Twitter for Android,0,0,"Ohio, USA",English,...,Lord ByronAF,lordbyronaf,"It's easier to be who you are, than it is to be who you think others want you to be. 18+ only.\nCultured Brute #NorseUp #Bearcats",Non-Verified,https://twitter.com/lordbyronaf,Non-Protected,7808,8617,25 Jul 2012 15:43:47,0
3,"""1167163662051631104""",https://twitter.com/luapppank/status/1167163662051631104,29 Aug 2019 19:54:36,Will you be at #FIX19? Want a preview of @AG_EM33 story? Then check back Monday for #ChangeOfHeart where we sat down with Alin to discuss her new life since her heart #transplant #MoreThanMedicine https://t.co/Xl9zjr7kZ1,ReTweet,Twitter for iPhone,0,0,"Scottsdale, AZ",English,...,paul knapp,luapppank,16.2 (and rising) hdcp golfer. fairways and greens. I loathe mustard! extremely grateful heart transplant recipient. #lifeisgood #donatelife,Non-Verified,https://twitter.com/luapppank,Non-Protected,69,81,17 May 2009 12:58:43,0
4,"""1166892836165496835""",https://twitter.com/AndreaWestbyMD/status/1166892836165496835,29 Aug 2019 01:58:26,Y’all ⁦@UMNNMFamMedRes⁩ Rose Marie Leslie is a ⁦@mnstatefair⁩ blue ribbon winning vinaigrette maker #morethanmedicine #alltheskills https://t.co/QzY8wsq4DO,Tweet,Twitter for iPhone,0,27,"Minneapolis, MN",English,...,Andrea Westby,AndreaWestbyMD,"She/her/hers. Teacher of family medicine, former rural family doc. Runner yogi cyclist swimmer. Food enthusiast. Social &amp; repro justice advocate #kalefueledrage",Non-Verified,https://twitter.com/AndreaWestbyMD,Non-Protected,938,1247,15 Apr 2015 19:58:49,1876


### Extract the tweets data

We assume we're only gonna use the tweets only.

In [7]:
tweets = new_sample['Tweet Content'] #extract the tweets and store the values into a variable
tweets.head()

0    Pets change our lives &amp; become a part of our families ❤️\nThat's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy &amp; healthy pet 🐱🐶\n#MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3
1                                                                       Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB
2                                                                                                                                                                      What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA
3                                  Will you be at #FIX19? Want a preview of @AG_EM33 story? Then check back Monday for #ChangeOfHeart where we sat down with Alin to discuss her new life since her heart #transplant #MoreThanMedicine 

## 2. Decoding HTML Entities and Cleaning Newlines

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So far, we already got the tweets but there are encoded html entities and newlines (\n) inside the tweets and we need to decode or clean them first.

In [8]:
for i in range (len(tweets)):
    x = tweets[i].replace("\n"," ") #cleaning newline "\n" from the tweets
    tweets[i] = html.unescape(x)
tweets.head()

0    Pets change our lives & become a part of our families ❤️ That's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy & healthy pet 🐱🐶 #MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3
1                                                             Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB
2                                                                                                                                                            What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA
3                        Will you be at #FIX19? Want a preview of @AG_EM33 story? Then check back Monday for #ChangeOfHeart where we sat down with Alin to discuss her new life since her heart #transplant #MoreThanMedicine https://t.co/Xl9zjr7kZ1
4               

## 3. Removing Unnecessary Stuffs using Regular Expression (ReGex)
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this process we're cleaning up links on the tweets as well as hash characters, usernames, and punctuations, cause those things are usually not needed and could possibly disrupt the analysis by messing up the measurement score and lead to an inaccurate outcome later on, so we're going to get rid of em! 😉

In [9]:
for i in range (len(tweets)):
    tweets[i] = re.sub(r"(@[A-Za-z0-9_]+)|[^\w\s]|#|http\S+", "", tweets[i])
tweets.head()

0    Pets change our lives  become a part of our families  Thats why our members offer many solutions to help you to enjoy a longlasting bond with your happy  healthy pet  MorethanMedicine PetCare PetsareFamily 
1                                                       Another spot of our morethanmedicine bus in bristol this week If you need support with your cancer diagnosis call us on 0303 3000 118 livingwellwithcancer 
2                                                                                                                                                                             What a great team   morethanmedicine 
3                           Will you be at FIX19 Want a preview of  story Then check back Monday for ChangeOfHeart where we sat down with Alin to discuss her new life since her heart transplant MoreThanMedicine 
4                                                                                                                Yall  Rose Marie Leslie is a  blue ribb

## 4. Removing Stopwords
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;According to Wikipedia, stop words are words which are filtered out before or after processing of natural language data. Stopwords are removed or excluded from the given text so that more focus can be given to those words which define the meaning of the text.

* **Preparing stopwords**
<br><br>notice that in this process we take "not" out of the stopwords corpus, because "not" is crucial to the literal meaning or context of the text, and "not good" obviously doesn't resemble "good".

In [10]:
tweets_to_token = tweets
sw = stopwords.words('english') #remove 'not' from stopwords
sw.remove('not')

* **Tokenize tweets**
<br><br>break the tweets into words.

In [11]:
for i in range(len(tweets_to_token)):
    tweets_to_token[i] = word_tokenize(tweets_to_token[i])

* **removing stopwords out of the tweets**
<br><br>filter out stopwords out of the data.

In [12]:
for i in range(len(tweets_to_token)):
    tweets_to_token[i] = [word for word in tweets_to_token[i] if not word in sw]

tweets_to_token

0                                   [Pets, change, lives, become, part, families, Thats, members, offer, many, solutions, help, enjoy, longlasting, bond, happy, healthy, pet, MorethanMedicine, PetCare, PetsareFamily]
1                                                                           [Another, spot, morethanmedicine, bus, bristol, week, If, need, support, cancer, diagnosis, call, us, 0303, 3000, 118, livingwellwithcancer]
2                                                                                                                                                                                  [What, great, team, morethanmedicine]
3                                                               [Will, FIX19, Want, preview, story, Then, check, back, Monday, ChangeOfHeart, sat, Alin, discuss, new, life, since, heart, transplant, MoreThanMedicine]
4                                                                                                                 [Yall, Rose, Marie

#### Comparing data before and after cleaning
These are the comparison between the data before and after cleaning. The data after cleaning is free of html encodes, links, and any other unnecessary characters or stuffs. See that the data before cleaning still contains data duplicates, that's why we see same tweets on data number 2,3 and 4 there, and also notice that data after cleaning are in array containing words, this is because in analysis process, we will calculate the data by words and not sentences, that's why we tokenized the data earlier.

In [13]:
print("-----------------------------------------------------\n          DATA BEFORE CLEANING:\n-----------------------------------------------------")
data['Tweet Content'].head()

-----------------------------------------------------
          DATA BEFORE CLEANING:
-----------------------------------------------------


0    Pets change our lives &amp; become a part of our families ❤️\nThat's why our members offer many solutions to help you to enjoy a long-lasting bond with your happy &amp; healthy pet 🐱🐶\n#MorethanMedicine #PetCare #PetsareFamily https://t.co/fZNIXge9a3
1                                                                       Another spot of our #morethanmedicine bus in #bristol this week! If you need support with your cancer diagnosis call us on 0303 3000 118. #livingwellwithcancer https://t.co/eZGLz0BkXB
2                                                                                                                                                                      What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine https://t.co/g2YzMDUpVA
3                                                                                                                                                                      What a great team ⁦@HealthSourceOH⁩ ⁦@Local12⁩ #morethanmedicine 

In [14]:
print("-----------------------------------------------------\n            DATA AFTER CLEANING:\n-----------------------------------------------------")
tweets_to_token.head()

-----------------------------------------------------
            DATA AFTER CLEANING:
-----------------------------------------------------


0    [Pets, change, lives, become, part, families, Thats, members, offer, many, solutions, help, enjoy, longlasting, bond, happy, healthy, pet, MorethanMedicine, PetCare, PetsareFamily]
1                                            [Another, spot, morethanmedicine, bus, bristol, week, If, need, support, cancer, diagnosis, call, us, 0303, 3000, 118, livingwellwithcancer]
2                                                                                                                                                   [What, great, team, morethanmedicine]
3                                [Will, FIX19, Want, preview, story, Then, check, back, Monday, ChangeOfHeart, sat, Alin, discuss, new, life, since, heart, transplant, MoreThanMedicine]
4                                                                                  [Yall, Rose, Marie, Leslie, blue, ribbon, winning, vinaigrette, maker, morethanmedicine, alltheskills]
Name: Tweet Content, dtype: object