<a href="https://colab.research.google.com/github/YagyanshB/SemEval-Task6-CS408/blob/main/Pre_Processing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre Processing of OLID 2019 Dataset:

---

This document will highlight the steps undertaken to pre-process and cleanse the OLID 2019 Dataset obtained from Codalab Competition. The link to the competition is [here](https://competitions.codalab.org/competitions/20011#learn_the_details).

All the necessary steps have been extensively explained and it is strongly recommended to follow through with it. Additionally, we would also recommend to run the code within GoodleColab because of it's user friendly environment and robust services offered.

# Importing Required Libraries:

---

For Data Cleansing and Data Wrangling to be carried out we need to import a certain set of libraries into our Developing Environment.

In [570]:
import pandas as pd
import numpy as np
import spacy 
from spacy.lang.en.stop_words import STOP_WORDS as stop

Since the data has been uploaded within my personal Google Drive, the path to the file may vary. It is therefore strongly recommended to upload the files in your personal computer or operating system when running the source code.

# Uploading & Exploring Dataset in Google Colab:

---

Under this section we would be uploading our datasets that we have downloaded from the CodaLab Competitions. It is extremely important to mention the right file path in order for Google Colab to upload the files in the right manner.

In [571]:
# We mount our Google Drive within Google Colab. Since I have already uploaded my files on my Google Drive
# this task becomes fairly convenient for myself. If running the program, please be sure to mount the 
# dataset on your google drive as well. 

from google.colab import drive
drive.mount('/content/drive')

train_file = 'drive/My Drive/olid-training-v1.0.tsv'

test_file_a = 'drive/My Drive/testset-levela.tsv' 
test_labels_a = 'drive/My Drive/labels-levela.csv' 

test_labels_b = 'drive/My Drive/labels-levelb.csv' 
test_file_b = 'drive/My Drive/testset-levelb.tsv' 

test_file_c = 'drive/My Drive/testset-levelc.tsv' 
test_labels_c = 'drive/My Drive/labels-levelc.csv' 

# Within the Code below, we re run the code to ensure that our files have been
# mounted in the right manner. It is imperative to have the link to the directory
# sent correctly; else the files won't be loaded.

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [572]:
import pandas as pd

df = pd.read_csv('drive/My Drive/olid-training-v1.0.tsv', sep='\t')

print (df) # Skimming and looking at the first initial rows of our dataset.

          id  ... subtask_c
0      86426  ...       NaN
1      90194  ...       IND
2      16820  ...       NaN
3      62688  ...       NaN
4      43605  ...       NaN
...      ...  ...       ...
13235  95338  ...       IND
13236  67210  ...       NaN
13237  82921  ...       OTH
13238  27429  ...       NaN
13239  46552  ...       NaN

[13240 rows x 5 columns]


In [573]:
df.shape # It tells us the dimensions of our dataset that is 13240 ROWS spread across 5 COLUMNS.

(13240, 5)

# Categorisation of Tweets According to Sub-Tasks:

---

After having uploaded the OLID 2019 Dataset, we explore and have a high-level approach of looking into how the tweets have been classified and categorised based on the three different sub-tasks.

In [574]:
df['subtask_a'].value_counts() # Categorisation of tweets into Offensive and Not Offensive Category

NOT    8840
OFF    4400
Name: subtask_a, dtype: int64

In [575]:
df['subtask_b'].value_counts() # Categorisation of tweets into Targeted Insult and Untargeted Insult.

TIN    3876
UNT     524
Name: subtask_b, dtype: int64

In [576]:
df['subtask_c'].value_counts() # Categorisation of tweets into Offense targeted towards Individual, Group or Other Category.

IND    2407
GRP    1074
OTH     395
Name: subtask_c, dtype: int64

# Lowercasing All Tweets Present:

---

As human beings we have the ability to express different emotions by capitalising on text messages or tweets. The machine doesn't have the necessary ability to read emotions due to which we would be lower casing all the tweets from our OLID 2019 Dataset.

In [577]:
df.sample(5)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
5191,33006,@USER @USER He is probably dying due to dog fu...,NOT,,
5128,27384,"@USER OK, on second thought, Antifa wasn't eve...",NOT,,
4651,63857,@USER From the liberal media.....Yea...we shou...,NOT,,
10359,63019,@USER Funny. She had no problem sleeping her ...,NOT,,
5195,50275,@USER Look at how shamelessly liberals lie to ...,OFF,TIN,GRP


In [578]:
x = 'LOCKDOWN SUCKS'

In [579]:
x.lower()

'lockdown sucks'

In [580]:
df['processed_tweet'] = df['tweet'].apply(lambda x: str(x).lower())

In [581]:
df.sample(5)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c,processed_tweet
11154,31511,#boom @USER maybe that book deal isnt such a g...,NOT,,,#boom @user maybe that book deal isnt such a g...
3808,95018,@USER He will never live this down! You know h...,NOT,,,@user he will never live this down! you know h...
8261,36520,.@USER won this round. Yucker ALWAYS interrupt...,OFF,TIN,IND,.@user won this round. yucker always interrupt...
12864,82767,@USER bakkt is doing what an ETF would have do...,OFF,TIN,IND,@user bakkt is doing what an etf would have do...
9766,92852,@USER @USER @USER @USER @USER @USER @USER @USE...,OFF,UNT,,@user @user @user @user @user @user @user @use...


# Expanding Contracted Words:

---

A contraction is a shortened form of a word (or group of words) that omits certain letters or sounds. In most contractions, an apostrophe represents the missing letters. The most common contractions are made up of verbs, auxiliaries, or modals attached to other words: He would=He'd. I have=I've.

In order for our model to understand what a sentence means or has to offer it is important to realise the context of our sentence or tweet. For this purpose, we have to expand on the different contracting words we as users so conveniently use.

In [582]:
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and ",
"you're": "you are"}


In [583]:
x = " i'm don't he'll" # I am do not he will

In [584]:
def cont_to_exp(x):
  if type(x) is str:
    for key in contractions:
        value = contractions[key]
        x = x.replace(key,value)
    return x

  else:
    return x

In [585]:
cont_to_exp(x)

' i am do not he will'

In [586]:
%%timeit
df['processed_tweet'] = df['processed_tweet'].apply(lambda x: cont_to_exp(x))

1 loop, best of 5: 353 ms per loop


In [587]:
df.sample(5)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c,processed_tweet
10151,96943,@USER @USER @USER @USER @USER @USER so stronge...,NOT,,,@user @user @user @user @user @user so stronge...
1496,67365,@USER @USER I have. Its horse dung. She can't ...,NOT,,,@user @user i have. its horse dung. she cannot...
3901,46418,"@USER Well, there's that one idiot you keep th...",OFF,TIN,IND,"@user well, there is that one idiot you keep t..."
11017,86899,@USER He is spending your contributions. Vote ...,NOT,,,@user he is spending your contributions. vote ...
12307,24912,@USER @USER He/she deserves those tattoos. Hop...,NOT,,,@user @user he/she deserves those tattoos. hop...


# Removal of @USER Mentions:

---



In [590]:
df['processed_tweet']

0         she should ask a few native americans what th...
1          go home you’re drunk!!!  #maga #trump2020 👊🇺...
2        amazon is investigating chinese employees who ...
3         someone should havetaken" this piece of shit ...
4          obama wanted liberals &amp; illegals to move...
                               ...                        
13235     sometimes i get strong vibes from people and ...
13236    benidorm ✅  creamfields ✅  maga ✅   not too sh...
13237     and why report this garbage.  we do not give ...
13238                                                pussy
13239    #spanishrevenge vs. #justice #humanrights and ...
Name: processed_tweet, Length: 13240, dtype: object

In [591]:
df[df['processed_tweet'].str.contains('@user')]
df['processed_tweet'] = df['processed_tweet'].str.replace('@user','') # remove '@user' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('user','') # remove '@user' tokens

# We need to bear in mind to use a lower cased @user since we have lowercased our processed tweets.
# This shows us that there are 12274 tweets containing @user mentions .
# This information would be irrelevant in training our machine learning model and we decided to make it redundant.

In [592]:
test_1 = df['processed_tweet'].str.contains('@user')
test_1a = df['processed_tweet'].str.contains('user')

test_1.value_counts() 
test_1a.value_counts()

# On testing we notice that there are no tweets that have any @user mentions.
# This showcases that our tweets has completed this processing step.


False    13240
Name: processed_tweet, dtype: int64

#Removal of URLS:

---



In [593]:
import re # Importing the Regular Experssion library

In [594]:
  df['processed_tweet'] = df['processed_tweet'].str.replace('url','') # remove 'url' tokens

In [595]:
df['processed_tweet'].head(10)

0     she should ask a few native americans what th...
1      go home you’re drunk!!!  #maga #trump2020 👊🇺🇸👊 
2    amazon is investigating chinese employees who ...
3     someone should havetaken" this piece of shit ...
4      obama wanted liberals &amp; illegals to move...
5                          liberals are all kookoo !!!
6                                 oh noes! tough shit.
7     was literally just talking about this lol all...
8                                 buy more icecream!!!
9     canada doesn’t need another cuck! we already ...
Name: processed_tweet, dtype: object

#Removal of Emojis Promoting Profanity:

---



In [596]:
emoji_dict = {'❤️': ' love ',
                       '❤️' : ' love ',
                       '❤' : ' love ',
                       '😘' : ' kisses ',
                      '😭' : ' cry ',
                      '💪' : ' strong ',
                      '🌍' : ' earth ',
                      '💰' : ' money ',
                      '👍' : ' ok ',
                       '👌' : ' ok ',
                      '😡' : ' angry ',
                      '🍆' : ' dick ',
                      '🤣' : ' haha ',
                      '😂' : ' haha ',
                      '🖕' : ' fuck you ',
                      '👊' : ' punch you',}

In [597]:
df['processed_tweet'] = df['processed_tweet'].str.replace('😭','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('❤','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('🤣','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('🍆','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('🖕','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('😂','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('😡','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('👌','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('👍','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('💰','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('🌍','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('💪','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('😘','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('❤️','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('❤️','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('👊','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('🇺🇸','') # remove 'emoji' tokens
df['processed_tweet'] = df['processed_tweet'].str.replace('✅','') # remove 'emoji' tokens

In [598]:
df['processed_tweet'].head(10)

0     she should ask a few native americans what th...
1          go home you’re drunk!!!  #maga #trump2020  
2    amazon is investigating chinese employees who ...
3     someone should havetaken" this piece of shit ...
4      obama wanted liberals &amp; illegals to move...
5                          liberals are all kookoo !!!
6                                 oh noes! tough shit.
7     was literally just talking about this lol all...
8                                 buy more icecream!!!
9     canada doesn’t need another cuck! we already ...
Name: processed_tweet, dtype: object

# Removal of Special Characters

---



In [599]:
df.sample(5)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c,processed_tweet
1482,94619,@USER melbourne didnt like me can I join you i...,NOT,,,melbourne didnt like me can i join you instea...
6598,67185,There is only ONE reason why Christine Blasey ...,NOT,,,there is only one reason why christine blasey ...
8844,46787,@USER Dear Andrew I saw the endorsement by Cor...,OFF,TIN,IND,dear andrew i saw the endorsement by cory boo...
13123,94883,@USER We need to stop expecting liberals to ac...,NOT,,,we need to stop expecting liberals to act rea...
11008,12496,@USER @USER @USER @USER @USER @USER @USER @USE...,NOT,,,...


In [600]:
x = 'thanks for proving democrats can’t let thei...'

In [601]:
re.sub('[^a-zA-z0-9\s]',"",x)

'thanks for proving democrats cant let thei'

In [602]:
df['processed_tweet'] = df['processed_tweet'].apply(lambda x: re.sub('[^a-zA-z0-9\s]',"",x))

In [603]:
df['processed_tweet'].head(10)

0     she should ask a few native americans what th...
1                go home youre drunk  maga trump2020  
2    amazon is investigating chinese employees who ...
3     someone should havetaken this piece of shit t...
4      obama wanted liberals amp illegals to move i...
5                             liberals are all kookoo 
6                                   oh noes tough shit
7     was literally just talking about this lol all...
8                                    buy more icecream
9     canada doesnt need another cuck we already ha...
Name: processed_tweet, dtype: object

# Tokenisation of Tweets

---



In [604]:
df['processed_tweet'] = df['processed_tweet'].apply(lambda x: x.split())

In [605]:
df.sample(5)

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c,processed_tweet
8889,30602,@USER If you want more shooting just call G So...,NOT,,,"[if, you, want, more, shooting, just, call, g,..."
11906,11777,A 5th columnist always imagines himself as a p...,OFF,TIN,GRP,"[a, 5th, columnist, always, imagines, himself,..."
2895,65180,TRYING TO HIDE ALL THE EVIDENCE 😂😎✍👀 #MAGA #QA...,NOT,,,"[trying, to, hide, all, the, evidence, maga, q..."
367,14407,@USER Brown like you,NOT,,,"[brown, like, you]"
12641,27235,@USER Um bc she is??????,NOT,,,"[um, bc, she, is]"


# Stemming of Tweets

---



In [562]:
df['processed_tweet'].head(10)

0     she should ask a few native americans what th...
1                go home youre drunk  maga trump2020  
2    amazon is investigating chinese employees who ...
3     someone should havetaken this piece of shit t...
4      obama wanted liberals amp illegals to move i...
5                             liberals are all kookoo 
6                                   oh noes tough shit
7     was literally just talking about this lol all...
8                                    buy more icecream
9     canada doesnt need another cuck we already ha...
Name: processed_tweet, dtype: object

In [563]:
x = 'she should ask a few native americans what th...'

In [569]:
from nltk.stem.porter import *
stemmer = PorterStemmer()

df['processed_tweet'] = df['processed_tweet'].apply(lambda x: [stemmer.stem(i) for i in x]) # stemming

0    [@, U, S, E, R,  , S, h, e,  , s, h, o, u, l, ...
1    [@, U, S, E, R,  , @, U, S, E, R,  , G, o,  , ...
2    [A, m, a, z, o, n,  , i, s,  , i, n, v, e, s, ...
3    [@, U, S, E, R,  , S, o, m, e, o, n, e,  , s, ...
4    [@, U, S, E, R,  , @, U, S, E, R,  , O, b, a, ...
Name: tokenized_tweet, dtype: object