In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns 
import re 
import nltk
# !pip install contractions
import contractions

# !pip install textblob
from textblob import TextBlob

# !pip install emot
import re
import pickle

# !pip install inflect
import inflect

import string

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Data Exploration and Preprocessing**

Below is the annotated dataset of tweets for the OffensEval competition: Identifying and Categorizing Offensive Language in Social Media. 

Twitter user names are substituted by @USER and IRLs by word URL. 

In this assignment we focus on subtask_a which indicates whether a Tweet has been annotated as offensive or not. Two labels are present: 

- **(NOT)** Not Offensive - This post does not contain offense or profanity.
- **(OFF)** Offensive - This post contains offensive language or a targeted (veiled or direct) offense.

A Tweet is labeled as offensive if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct.

In [None]:
data_path = 'drive/MyDrive/NLP_Final_Assignment/data/olid-training-v1.0.tsv'

df = pd.read_csv(data_path, sep='\t', header=0)

In [None]:
df.head()

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,86426,@USER She should ask a few native Americans wh...,OFF,UNT,
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF,TIN,IND
2,16820,Amazon is investigating Chinese employees who ...,NOT,,
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,UNT,
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,,


In [None]:
df = df.drop(['subtask_b',	'subtask_c'], 1)

  """Entry point for launching an IPython kernel.


In [None]:
# Change labels to binary digits
df_labels = pd.get_dummies(df["subtask_a"]).drop('NOT', 1)
df = pd.concat((df, df_labels), axis = 1)
df = df.drop('subtask_a', 1).rename(columns={"OFF": "offensive"})

df.head()

  
  after removing the cwd from sys.path.


Unnamed: 0,id,tweet,offensive
0,86426,@USER She should ask a few native Americans wh...,1
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,1
2,16820,Amazon is investigating Chinese employees who ...,0
3,62688,"@USER Someone should'veTaken"" this piece of sh...",1
4,43605,@USER @USER Obama wanted liberals &amp; illega...,0


In [None]:
# Dataset unique per tweet? 
print("Number of unique id's: " , str(df['id'].nunique()))
print("Number of rows: " , str(len(df)))

Number of unique id's:  13240
Number of rows:  13240


In [None]:
duplicated = df[df.duplicated()]
duplicated

# drop duplicates
df = df.drop_duplicates()
df.shape

(13212, 2)

In [None]:
# Set id as index, keep only text and label
df = df.set_index('id')
df.head()

Unnamed: 0_level_0,tweet,offensive
id,Unnamed: 1_level_1,Unnamed: 2_level_1
86426,@USER She should ask a few native Americans wh...,1
90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,1
16820,Amazon is investigating Chinese employees who ...,0
62688,"@USER Someone should'veTaken"" this piece of sh...",1
43605,@USER @USER Obama wanted liberals &amp; illega...,0




Below an example of first 20 tweets from the dataset:

In [None]:
df.tweet[0:20].values

array(['@USER She should ask a few native Americans what their take on this is.',
       '@USER @USER Go home you’re drunk!!! @USER #MAGA #Trump2020 👊🇺🇸👊 URL',
       'Amazon is investigating Chinese employees who are selling internal data to third-party sellers looking for an edge in the competitive marketplace. URL #Amazon #MAGA #KAG #CHINA #TCOT',
       '@USER Someone should\'veTaken" this piece of shit to a volcano. 😂"',
       '@USER @USER Obama wanted liberals &amp; illegals to move into red states',
       '@USER Liberals are all Kookoo !!!',
       '@USER @USER Oh noes! Tough shit.',
       '@USER was literally just talking about this lol all mass shootings like that have been set ups. it’s propaganda used to divide us on major issues like gun control and terrorism',
       '@USER Buy more icecream!!!',
       '@USER Canada doesn’t need another CUCK! We already have enough #LooneyLeft #Liberals f**king up our great country! #Qproofs #TrudeauMustGo',
       '@USER @USER @USER I

Tweets include @USER and URL which could be taken out or left in. These two might be useful for model to detect the direction of the offense but are not necessary to detect the presence of offensive language. However, keeping them in could also help with detecting POS which can be used to improve the model detecting offensive language, so we'll keep them in. Removing them may make some sentences weird.

Tweets also include emojis and hashtags which should be dealt with. Hashtags can be cleaned by removing '#' and including them as words in the dataset. Emojis will be translated into words during text pre-processing. 

Moreover, there are other aspects of text cleaning and normalization which need evaluating for this project: extra whitespace, special characters, upper-lowercase characters, repetitions of letters (e.g. Hiiiiii instead of Hi) and contractions ('don't', 'yall').

Finally, there are some cases of self-censored words. Some users wrote f*uck or sh.t, using '.' or '\*' to censor profanity. While human coders can understand this as offensive and code as such, this may be missed by models. So improving this in such way that self-censored words are fully represented could also improve the model. 

Below are some tweets which are coded offensive: 


In [None]:
df[df['offensive'] == 1].tweet[100:120].values

array(["@USER @USER @USER @USER I wasn't proposing scare tactics. I really meant what I said. Trump can easily pull the good maga Republicans and can easily steal enough votes from GOP Dem and independents and will bring fresh breath to our rotten politics. Think about it.",
       '@USER @USER @USER @USER Only a liberal would support a liberal that spent a MILLION to get liberals elected to office. #LibFAIL! URL',
       '@USER It’s so weirdly vicious and bitter to extrapolate from ‘everyone should have access to decent healthcare’ to ‘Liberals think all criminals should be free.’  It reveals a pretty brutalist and impenetrable mind.',
       '@USER What the fuck game are you watching?',
       '@USER Why? Why are liberals so trashy?',
       '@USER Yes I saw this and I will say tapper kept basket her and she kept coming back with stupid answers And he finally gave up I don’t think he is a big fan of her policies he’s not that stupid please',
       '@USER If you go by anything other 

4400 tweets are tagged as offensive and 8840 tweets as not offensive. The dataset is not balanced. Balancing the dataset could be beneficial for the classification task but on the other hand the unbalanced dataset represent the reality as tweets including offensive content are not the majority of Twitter (Zampieri et al. 2019). Hence the dataset will not be balanced. 

In [None]:
df.offensive.value_counts()

0    8817
1    4395
Name: offensive, dtype: int64

In [None]:
df.offensive.isnull().sum()

# No missing values in the label, all tweets are coded for subtask A.

0

Below I try to find the self-censored words to correct them. I do this before removing symbols or de-duplicating punctuation so that self-censored words can be found easily. The function below finds the words that begin and end with a letter but in between have some symbols that I have seen for self-censoring profanity. Some examples are printed below: 

In [None]:
# create a function that finds words that begin and end with a letter but have 
def selfcensored(sentence):    
    pattern = re.compile(r'[a-zA-Z]+[\.\*\?!&^]{1,}[a-zA-Z]+')
    found = re.findall(pattern,sentence.lower())
    if found:
      return found
  
# for sent in df['tweet_clean']:
#   if selfcensored(sent) != None:
#       print(selfcensored(sent))


# list of self-censored profanity cases
  
list_selfcensored = ['f\*\*king', 'sh\*t', 'p\*\*sy', 'f\*cks', 'fu\*k', 
                     'f\*cking', 'b\*\*ch', 'bullsh\*t', 'f\*ck', 'sh\!t', 
                     'f\*\*ked', 'f\*\*\*ing', 'a\*\*hole', 'd\*mbasses', 'da\*n'
                     ]



In [None]:
df[df['tweet'].str.contains(r'[a-zA-Z]+[\*]{1,}[a-zA-Z]+')][['tweet']].values[0:3]

array([['@USER Canada doesn’t need another CUCK! We already have enough #LooneyLeft #Liberals f**king up our great country! #Qproofs #TrudeauMustGo'],
       ["@USER At this point in time... I don't think Pres. Trump gives a sh*t... and neither do I! LOL URL"],
       ['@USER @USER Did Chuck think Juanita Broderick credible-Keith Ellison’s girlfriend domestic abuse credible? The truth is Chuck is a sh*t stirrer for a cause. In this case- ruin a mans impeccable career-embarrass his wife &amp; daughters-all in a sleazy days work. Y?Liberals destroy what dont like']],
      dtype=object)

In [None]:
# Text preprocessing 

def expand_contractions(text):
    clean_text = contractions.fix(text)
    
    return clean_text
    

def remove_punc(text):    # Removing special characters in string but keeping 
                 # punctuations: ",.;:!?" also * (bc of self-censored words) and
                 # @ because of referral to a user
  punc = '''()-[]{}'"\<>/#$%^&_~'''

  for punctuation in punc:
    text = text.replace(punctuation, '')
  
  return text


def whitespace_be(text): # remove whitespace beginning and end of tweet due to cleaning

  text = re.sub('^\s+|\s+$', '', text, flags=re.UNICODE)

  return text

def whitespace_double(text): # remove duplicate whitespace 
  text = re.sub('\s+', ' ', text, flags=re.UNICODE)

  return text

# Convert emojis into words
from emot.emo_unicode import UNICODE_EMOJI, EMOTICONS_EMO

def convert_emojis(text):
    for emot in UNICODE_EMOJI:
        text = text.replace(emot, "_".join(UNICODE_EMOJI[emot].replace(",","").replace(":","").split()))
    return text

def clean_text(text):
    
    text = text.lower()
    
    text = expand_contractions(text)
        
    text = remove_punc(text)
    
    text = whitespace_be(text)  

    text = whitespace_double(text)  

    text = convert_emojis(text)

    return text

df['tweet_clean'] = df.apply(lambda x: clean_text(x['tweet']), axis=1)

In [None]:
df['tweet_clean'].values[0:5]

array(['@user she should ask a few native americans what their take on this is.',
       '@user @user go home you are drunk!!! @user maga trump2020 oncoming_fistUnited_Statesoncoming_fist url',
       'amazon is investigating chinese employees who are selling internal data to thirdparty sellers looking for an edge in the competitive marketplace. url amazon maga kag china tcot',
       '@user someone shouldvetaken this piece of shit to a volcano. face_with_tears_of_joy',
       '@user @user obama wanted liberals amp; illegals to move into red states'],
      dtype=object)

Apparently only 23 tweets had self-censored words, 18 were labeled as offensive and it seems that they were labeled this way because of these words. While I expected there to be more, it is still good to correct this. 5 tweets with self-censored profanity seems to be labeled non-offensive which can be due to differences between annotators. 

In [None]:
df[df['tweet_clean'].str.contains('|'.join(list_selfcensored))]['offensive'].value_counts()

1    18
0     5
Name: offensive, dtype: int64

In [None]:
checkOFF = df[(df['tweet_clean'].str.contains('|'.join(list_selfcensored))) & (df['offensive'] == 1)]
checkOFF.tweet.values

array(['@USER Canada doesn’t need another CUCK! We already have enough #LooneyLeft #Liberals f**king up our great country! #Qproofs #TrudeauMustGo',
       "@USER At this point in time... I don't think Pres. Trump gives a sh*t... and neither do I! LOL URL",
       '@USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER @USER Even pedophiles in interviews will tell you the only way to stop them is to kill them. Pedophiles even admit to when out of prison they will offend again and that’s what keeps them alive in prison... the hope of another opportunity to molest an innocent child. They are sick f*cks',
       "@USER @USER This B**ch Behar needs to shut that big ugly hole on her face and think about the fact she's not giving Kavanaugh or his family the

In [None]:
checkNOT = df[(df['tweet_clean'].str.contains('|'.join(list_selfcensored))) & (df['offensive'] == 0)]
checkNOT.tweet.values
# I dont know why these tweets are not marked as , they seem offensive

array(['@USER @USER Did Chuck think Juanita Broderick credible-Keith Ellison’s girlfriend domestic abuse credible? The truth is Chuck is a sh*t stirrer for a cause. In this case- ruin a mans impeccable career-embarrass his wife &amp; daughters-all in a sleazy days work. Y?Liberals destroy what dont like',
       '2/2 More from Mark Judge,Kavanaugh\'s bro:liberals are trying to take our fun away... Brent Musburger can’t call a hot girl hot... Obama wants to outlaw guns because it’s all about the children.The children,the children...No one can belch because of the f*cking children" URL',
       '@USER @USER Conservatives characterize an attempted rape allegation as “bullsh*t” then wonder why liberals describe them as anti-woman.  The right thing to do is properly investigate the allegation. If she lying then prosecute her. If she’s telling the truth Kavanaugh shouldn’t be confirmed.',
       '@USER She is really good for him and told him how he needed to straighten up. I like her and I l

It seems that indeed most words in the self-censored list are labeled as offensive. There was cases where f\*cking and sh\*t was labeled as not offensive
but I will correct these words because mostly they are taken as offensive by 
the annotators.

Before cleaning these manually, I tried a spelling corrector (TextBlob) to see if these will be corrected by the spelling corrector. However this did not work, so I changed the words in the list manually before applying the spelling corrector. 

In [None]:
# change self-censored words
list_selfcensored = ['f\*\*king', 'sh\*t', 'p\*\*sy', 'f\*cks', 'fu\*k', 
                     'f\*cking', 'b\*\*ch', 'bullsh\*t', 'f\*ck', 'sh\!t', 
                     'f\*\*ked', 'f\*\*\*ing', 'a\*\*hole', 'd\*mbasses', 
                     'da\*n']

df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*\*king','fucking', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('sh\*t','shit', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('p\*\*sy','pussy', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*cks','fucks', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('fu\*k','fuck', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*cking','fucking', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('b\*\*ch','bitch', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('bullsh\*t','bullshit', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*ck','fuck', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('sh\!t','shit', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*\*ked','fucked', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('f\*\*\*ing','fucking', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('a\*\*hole','asshole', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('d\*mbasses','dumbasses', x, flags=re.UNICODE))
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub('da\*n','damn', x, flags=re.UNICODE))

# finally remove *
df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub(r'\*','', x, flags=re.UNICODE))

In [None]:
# how does it seem now

df[df['tweet'].str.contains('|'.join(list_selfcensored))][['tweet', 'tweet_clean']].values[0:3]

array([['@USER Canada doesn’t need another CUCK! We already have enough #LooneyLeft #Liberals f**king up our great country! #Qproofs #TrudeauMustGo',
        '@user canada does not need another cuck! we already have enough looneyleft liberals fucking up our great country! qproofs trudeaumustgo'],
       ["@USER At this point in time... I don't think Pres. Trump gives a sh*t... and neither do I! LOL URL",
        '@user at this point in time... i do not think pres. trump gives a shit... and neither do i! lol url'],
       ['@USER @USER Did Chuck think Juanita Broderick credible-Keith Ellison’s girlfriend domestic abuse credible? The truth is Chuck is a sh*t stirrer for a cause. In this case- ruin a mans impeccable career-embarrass his wife &amp; daughters-all in a sleazy days work. Y?Liberals destroy what dont like',
        '@user @user did chuck think juanita broderick crediblekeith ellison’s girlfriend domestic abuse credible? the truth is chuck is a shit stirrer for a because. in th

In [None]:
# de-duplicating punctuations
def my_replacer(match):
    match = match.group()
    return match[0] + (" " if " " in match else "")

regex = r"[\.\?\!]{2,}"

df['tweet_clean'] = df.tweet_clean.apply(lambda x: re.sub(regex, my_replacer, x, 0))

df[df['tweet'].str.contains(r'[\.\?\!]{2,}')][['tweet', 'tweet_clean']].values


array([['@USER @USER Go home you’re drunk!!! @USER #MAGA #Trump2020 👊🇺🇸👊 URL',
        '@user @user go home you are drunk! @user maga trump2020 oncoming_fistUnited_Statesoncoming_fist url'],
       ['@USER Liberals are all Kookoo !!!',
        '@user liberals are all kookoo !'],
       ['@USER Buy more icecream!!!', '@user buy more icecream!'],
       ...,
       ['@USER @USER @USER @USER Right. Dang. She is the s...t',
        '@user @user @user @user right. dang. she is the s.t'],
       ['@USER @USER @USER So have the conservatives accepted the antisemitism definition yet?..',
        '@user @user @user so have the conservatives accepted the antisemitism definition yet?'],
       ['@USER @USER BUT GUN CONTROL!!!', '@user @user but gun control!']],
      dtype=object)

In [None]:
# apply spelling check - takes way too long
# df['tweet_clean'] = df.tweet_clean.apply(lambda txt: ''.join(TextBlob(txt).correct()))

**Dealing with emojis**

In the last step of text preprocessing, emoji's were translated into words to be able to take them into account as well. We check how this looks like:

In [None]:
# Convert emojis into words
from emot.emo_unicode import UNICODE_EMOJI

def convert_emojis(text):
    for emot in UNICODE_EMOJI:
        text = text.replace(emot, "_".join(UNICODE_EMOJI[emot].replace(",","").replace(":","").split()))
    return text

df['tweet_clean'] = df.tweet_clean.apply(lambda txt: convert_emojis(txt))


Below we can see that the emojis are turned into word representations

In [None]:
df[['tweet', 'tweet_clean']].values[0:5] 

array([['@USER She should ask a few native Americans what their take on this is.',
        '@user she should ask a few native americans what their take on this is.'],
       ['@USER @USER Go home you’re drunk!!! @USER #MAGA #Trump2020 👊🇺🇸👊 URL',
        '@user @user go home you are drunk! @user maga trump2020 oncoming_fistUnited_Statesoncoming_fist url'],
       ['Amazon is investigating Chinese employees who are selling internal data to third-party sellers looking for an edge in the competitive marketplace. URL #Amazon #MAGA #KAG #CHINA #TCOT',
        'amazon is investigating chinese employees who are selling internal data to thirdparty sellers looking for an edge in the competitive marketplace. url amazon maga kag china tcot'],
       ['@USER Someone should\'veTaken" this piece of shit to a volcano. 😂"',
        '@user someone shouldvetaken this piece of shit to a volcano. face_with_tears_of_joy'],
       ['@USER @USER Obama wanted liberals &amp; illegals to move into red states',
 

**Text Normalization : Lemmatization and Tokenization**

To normalize the tweet text, we apply lemmatizer and tokenizer. I will use Spacy to lemmatize but will rely on Tweet Tokenizer of NLTK to tokenize the tweets. 

In [None]:
# !pip install -U spacy
# !python3 -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")
spacy.__version__

'2.2.4'

In [None]:
# function to get lemmas from spacy nlp 

def get_lemmas (tweet):
    tweets = nlp(tweet)
    return " ".join([token.lemma_ for token in tweets])

In [None]:
# function to get POS information

def get_pos (tweet):
    tweets = nlp(tweet)
    return " ".join([token.pos_ for token in tweets])

In [None]:
tweets = df.tweet_clean.values
tweet_lemmas = [get_lemmas(tweet) for tweet in tweets]
tweet_pos = [get_pos(tweet) for tweet in tweets]
df['tweet_lemmas'] = tweet_lemmas
df['tweet_pos'] = tweet_pos #maybe need it for modelling?

In [None]:
df.head(10)

Unnamed: 0_level_0,tweet,offensive,tweet_clean,tweet_lemmas,tweet_pos
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
86426,@USER She should ask a few native Americans wh...,1,@user she should ask a few native americans wh...,@user -PRON- should ask a few native americans...,X PRON VERB VERB DET ADJ ADJ PROPN PRON DET VE...
90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,1,@user @user go home you are drunk! @user maga ...,@user @user go home -PRON- be drunk ! @user ma...,X PUNCT VERB ADV PRON AUX ADJ PUNCT X PROPN PR...
16820,Amazon is investigating Chinese employees who ...,0,amazon is investigating chinese employees who ...,amazon be investigate chinese employee who be ...,PROPN AUX VERB ADJ NOUN PRON AUX VERB ADJ NOUN...
62688,"@USER Someone should'veTaken"" this piece of sh...",1,@user someone shouldvetaken this piece of shit...,@user someone shouldvetaken this piece of shit...,PUNCT PRON VERB DET NOUN ADP NOUN ADP DET NOUN...
43605,@USER @USER Obama wanted liberals &amp; illega...,0,@user @user obama wanted liberals amp; illegal...,@user @user obama want liberal amp ; illegal t...,PUNCT PUNCT PROPN VERB NOUN VERB PUNCT NOUN PA...
97670,@USER Liberals are all Kookoo !!!,1,@user liberals are all kookoo !,@user liberal be all kookoo !,ADJ NOUN AUX DET VERB PUNCT
77444,@USER @USER Oh noes! Tough shit.,1,@user @user oh noes! tough shit.,@user @user oh no ! tough shit .,X X INTJ NOUN PUNCT ADJ NOUN PUNCT
52415,@USER was literally just talking about this lo...,1,@user was literally just talking about this lo...,@user be literally just talk about this lol al...,PROPN AUX ADV ADV VERB ADP DET NOUN DET ADJ NO...
45157,@USER Buy more icecream!!!,0,@user buy more icecream!,@user buy more icecream !,X VERB ADJ NOUN PUNCT
13384,@USER Canada doesn’t need another CUCK! We alr...,1,@user canada does not need another cuck! we al...,@user canada do not need another cuck ! -PRON-...,PROPN PROPN AUX PART VERB DET NOUN PUNCT PRON ...


**Additional Features**

The first classification model, a SVM classifier will take Tf-Idf weighted vectors of the text as input. However, because in this model I am not allowed to use embeddings or transformers, I will add additional feature vectors to my model to take semantics into account when classifying a tweet as offensive. 

The additional features represent emotion (through NRC lexicon) and hate speech (through hate speech lexicon of Bassignana et al. (2018)). The traditional model to detect offensive tweets follows the work of Markov and Daelemans (2021). 

<br>

*Emotion classification of tweets using NRC Lexicon*

In [None]:
!pip install NRCLex
!python -m textblob.download_corpora
from nrclex import NRCLex


Collecting NRCLex
  Downloading NRCLex-3.0.0.tar.gz (396 kB)
[?25l[K     |▉                               | 10 kB 19.3 MB/s eta 0:00:01[K     |█▋                              | 20 kB 9.2 MB/s eta 0:00:01[K     |██▌                             | 30 kB 8.0 MB/s eta 0:00:01[K     |███▎                            | 40 kB 3.7 MB/s eta 0:00:01[K     |████▏                           | 51 kB 3.7 MB/s eta 0:00:01[K     |█████                           | 61 kB 4.4 MB/s eta 0:00:01[K     |█████▉                          | 71 kB 4.6 MB/s eta 0:00:01[K     |██████▋                         | 81 kB 4.8 MB/s eta 0:00:01[K     |███████▍                        | 92 kB 5.3 MB/s eta 0:00:01[K     |████████▎                       | 102 kB 4.3 MB/s eta 0:00:01[K     |█████████                       | 112 kB 4.3 MB/s eta 0:00:01[K     |██████████                      | 122 kB 4.3 MB/s eta 0:00:01[K     |██████████▊                     | 133 kB 4.3 MB/s eta 0:00:01[K     |████████

In [None]:
def return_emotions (tweet):
    emotion = NRCLex(tweet)
    return emotion.affect_list

In [None]:
# check
print(tweets[67])
print(return_emotions(tweets[67]))

@user you are so straight forword manOK_hand i saw you in dance dewwane and you are just talk free ky ap kitnay porrany ho industry mein and i really like you are this quality that you even gather with you are seniorgreen_heart artist love for manmarziyaan cowboy_hat_facethumbs_up
['joy', 'positive', 'trust', 'positive', 'joy', 'positive']


In [None]:
tweet_emotions = [return_emotions(tweet) for tweet in tweets]
df['tweet_emotions'] = tweet_emotions
df['tweet_emotions'] = df['tweet_emotions'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))


<br>

*Using insult words lexicon we can detect abusive words in tweets and take the TDIDF vectors of these words as input to include semantics in the model.*

The insults lexicon is taken from the research of Bassignana et al. (2018) from the following GitHub repository: https://github.com/bgmartins/hate-speech-lexicons





In [None]:
lexicon_path = 'drive/MyDrive/NLP_Final_Assignment/data/abusive_words.txt'
insults_lexicon = pd.read_csv(lexicon_path, sep='\t', header=None)
insults_lexicon.columns = ['insults']
insults_lexicon.head(10)

Unnamed: 0,insults
0,lummox
1,cross-breed
2,dumbbell
3,bum
4,vagrant
5,rentboy
6,rent-boy
7,sonuvabitch
8,rats
9,sons of bitches


In [None]:
# Unique insults
insult_list = list(insults_lexicon['insults'].unique())

# Extract the words if there is an exact match 
df['insult_match'] = df['tweet_clean'].str.findall(r'\b(' + '|'.join(insult_list) + r')\b')
df['insult_match'] = [' '.join(map(str, l)) for l in df['insult_match']]


df.head()

Unnamed: 0_level_0,tweet,offensive,tweet_clean,tweet_lemmas,tweet_pos,tweet_emotions,insult_match
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
86426,@USER She should ask a few native Americans wh...,1,@user she should ask a few native americans wh...,@user -PRON- should ask a few native americans...,X PRON VERB VERB DET ADJ ADJ PROPN PRON DET VE...,,
90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,1,@user @user go home you are drunk! @user maga ...,@user @user go home -PRON- be drunk ! @user ma...,X PUNCT VERB ADV PRON AUX ADJ PUNCT X PROPN PR...,,
16820,Amazon is investigating Chinese employees who ...,0,amazon is investigating chinese employees who ...,amazon be investigate chinese employee who be ...,PROPN AUX VERB ADJ NOUN PRON AUX VERB ADJ NOUN...,,
62688,"@USER Someone should'veTaken"" this piece of sh...",1,@user someone shouldvetaken this piece of shit...,@user someone shouldvetaken this piece of shit...,PUNCT PRON VERB DET NOUN ADP NOUN ADP DET NOUN...,anger disgust negative fear surprise,shit
43605,@USER @USER Obama wanted liberals &amp; illega...,0,@user @user obama wanted liberals amp; illegal...,@user @user obama want liberal amp ; illegal t...,PUNCT PUNCT PROPN VERB NOUN VERB PUNCT NOUN PA...,,


In [None]:
df.offensive.value_counts()

0    8817
1    4395
Name: offensive, dtype: int64

In [None]:
df.insult_match.value_counts()

                                         12054
shit                                       327
stupid                                      91
idiot                                       44
dumb                                        35
                                         ...  
shit foolish                                 1
asshole shit                                 1
ignorant fool ignorant ignorant idiot        1
idiot retard                                 1
midget                                       1
Name: insult_match, Length: 168, dtype: int64

In [None]:
df[df['offensive'] > 0].insult_match.value_counts() 
# from a quick look we can see that majority of the tweets with an insult word
# are under offensive label but majority of offensive tweets do not have 
# insult words (3516 out of 4395)

                      3516
shit                   288
stupid                  83
idiot                   42
ignorant                27
                      ... 
nonsense stupid          1
idiots stupidity         1
fool ignorant fool       1
quack                    1
dense                    1
Name: insult_match, Length: 142, dtype: int64

In [None]:
df[df['offensive'] == 0].insult_match.value_counts() 
# out of 8817 non offensive tweets, 
# 8539 have no offensive word according to the lexicon

                8538
shit              39
mark              17
simple            12
nonsense          12
                ... 
shit foolish       1
fooled             1
rabble             1
chatterbox         1
midget             1
Name: insult_match, Length: 86, dtype: int64

In [None]:
df.head()

Unnamed: 0_level_0,tweet,offensive,tweet_clean,tweet_lemmas,tweet_pos,tweet_emotions,insult_match
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
86426,@USER She should ask a few native Americans wh...,1,@user she should ask a few native americans wh...,@user -PRON- should ask a few native americans...,X PRON VERB VERB DET ADJ ADJ PROPN PRON DET VE...,,
90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,1,@user @user go home you are drunk! @user maga ...,@user @user go home -PRON- be drunk ! @user ma...,X PUNCT VERB ADV PRON AUX ADJ PUNCT X PROPN PR...,,
16820,Amazon is investigating Chinese employees who ...,0,amazon is investigating chinese employees who ...,amazon be investigate chinese employee who be ...,PROPN AUX VERB ADJ NOUN PRON AUX VERB ADJ NOUN...,,
62688,"@USER Someone should'veTaken"" this piece of sh...",1,@user someone shouldvetaken this piece of shit...,@user someone shouldvetaken this piece of shit...,PUNCT PRON VERB DET NOUN ADP NOUN ADP DET NOUN...,anger disgust negative fear surprise,shit
43605,@USER @USER Obama wanted liberals &amp; illega...,0,@user @user obama wanted liberals amp; illegal...,@user @user obama want liberal amp ; illegal t...,PUNCT PUNCT PROPN VERB NOUN VERB PUNCT NOUN PA...,,


In [None]:
# Writing this dataset in a csv file to use for modelling 

writing_path = 'drive/MyDrive/NLP_Final_Assignment/data/df_preprocessed.csv'

df.to_csv(writing_path, encoding='utf-8', index=True)


**References**

Bassignana, E., Basile, V., & Patti, V. (2018). Hurtlex: A multilingual lexicon of words to hurt. In 5th Italian Conference on Computational Linguistics, CLiC-it 2018 (Vol. 2253, pp. 1-6). CEUR-WS.

De Smedt, T., Voué, P., Jaki, S., Röttcher, M., & De Pauw, G. (2020). Profanity & offensive words (POW).

Markov, I., & Daelemans, W. (2021, June). Improving Cross-Domain Hate Speech Detection by Reducing the False Positive Rate. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (pp. 17-22).

Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., & Kumar, R. (2019). Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666.
