# Part 2: Data Pre-Processing

Author: `Arushi Bhandari`

This notebook details the pre-processing of the merged data

In [1]:
# importing required packages
import pandas as pd
import re
import pickle

In [2]:
tweets = pd.read_csv("tweetswusers.csv", index_col=0)

In [3]:
tweets.head()

Unnamed: 0,level_0,tweetUrl,renderedContent,userId,source,media,username,displayname,rawDescription,verified,followersCount,location,protected,profileUrl
0,0,https://twitter.com/PSGrewal2/status/135421733...,@amaanbali Pray for their recovery 🙏🏼\n\n#Farm...,1576671000.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,PSGrewal2,P.S. Grewal,#WeTheNorth | #LeafsForever | #TFCLive,False,966,"Brampton, Ontario",False,https://twitter.com/PSGrewal2
1,1,https://twitter.com/PSGrewal2/status/135421126...,@HarvKudos What was he even thinking? 🤡\n\n#Fa...,1576671000.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,PSGrewal2,P.S. Grewal,#WeTheNorth | #LeafsForever | #TFCLive,False,966,"Brampton, Ontario",False,https://twitter.com/PSGrewal2
2,2,https://twitter.com/PSGrewal2/status/135420534...,@RaviSinghKA You owned him #FarmersProtest htt...,1576671000.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'thumbnailUrl': 'https://pbs.twimg.com/tweet...,PSGrewal2,P.S. Grewal,#WeTheNorth | #LeafsForever | #TFCLive,False,966,"Brampton, Ontario",False,https://twitter.com/PSGrewal2
3,3,https://twitter.com/PSGrewal2/status/135405595...,Look what's trending number 1 in Canada right ...,1576671000.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",[{'previewUrl': 'https://pbs.twimg.com/media/E...,PSGrewal2,P.S. Grewal,#WeTheNorth | #LeafsForever | #TFCLive,False,966,"Brampton, Ontario",False,https://twitter.com/PSGrewal2
4,4,https://twitter.com/PSGrewal2/status/135401934...,@JaskaranSandhu_ Beautiful sight to see 🙏🏽\n\n...,1576671000.0,"<a href=""https://mobile.twitter.com"" rel=""nofo...",,PSGrewal2,P.S. Grewal,#WeTheNorth | #LeafsForever | #TFCLive,False,966,"Brampton, Ontario",False,https://twitter.com/PSGrewal2


In [4]:
tweets.loc[1]['renderedContent']

'@HarvKudos What was he even thinking? 🤡\n\n#FarmersProtest'

### Text Cleaning and Preparation

##### Special character cleaning

I remove the following special characters:

`\r`   
`\n`   
`\`  

In [5]:
tweets['parsed'] = tweets['renderedContent'].str.replace("\r", " ")
tweets['parsed'] = tweets['parsed'].str.replace("\n", " ")
tweets['parsed'] = tweets['parsed'].str.replace("    ", " ")

In [6]:
tweets.loc[1]['parsed']

'@HarvKudos What was he even thinking? 🤡  #FarmersProtest'

I then make the text lowercase

In [7]:
# Lowercasing the text
tweets['lower'] = tweets['parsed'].str.lower()

In [8]:
tweets.loc[1]['lower']

'@harvkudos what was he even thinking? 🤡  #farmersprotest'

Next, I remove punctuation.

In [9]:
punctuation_signs = list("!()-[]{};:'\,<>?$%^&*_~")
tweets['nopunc'] = tweets['lower']

for punct_sign in punctuation_signs:
    tweets['nopunc'] = tweets['nopunc'].str.replace(punct_sign, '')

In [10]:
tweets.loc[1]['nopunc']

'@harvkudos what was he even thinking 🤡  #farmersprotest'

##### Dealing with emojis

I convert the emojis to text as it can be used to understand the tweet better.

In [11]:
with open('Emoji_Dict.p', 'rb') as fp:
    Emoji_Dict = pickle.load(fp)
Emoji_Dict = {v: k for k, v in Emoji_Dict.items()}

def convert_emojis(text):
    for emot in Emoji_Dict:
        text = re.sub(r'('+emot+')', "_".join(Emoji_Dict[emot].replace(",","").replace(":","").split()), text)
    return text

text = "I won 🥇 in 🏏"
convert_emojis(text)

'I won 1st_place_medal in cricket'

In [12]:
tweets['emojis'] = tweets['nopunc'].apply(convert_emojis)

In [13]:
tweets.loc[1]['emojis']

'@harvkudos what was he even thinking clown_face  #farmersprotest'

### Saving Datasets
After pre-processing, I export the dataset to be used for modelling.

In [14]:
tweets.to_csv('cleaned.csv', encoding='utf-8')

I also store a random selection of 800 tweets for manually labelling them to create a training dataset for the model.

In [15]:
#randomsample = tweets.sample(n=800)
#randomsample = randomsample.reset_index()
#randomsample.to_csv('toclassify2.csv', encoding='utf-8')