## Thesis 2020-2021: Dictionary Baseline

In this notebook, we will create a model based on a dictionary.

In [3]:
import pandas as pd
import numpy as np
import math

import matplotlib
import matplotlib.pyplot as plt

In [4]:
import csv
    
df_train = pd.read_csv('data/hateval2019_en_train.csv')
df_dev = pd.read_csv('data/hateval2019_en_dev.csv')

df_train_dev = df_train.append(df_dev, ignore_index=True)
df_train_dev = df_train_dev.drop(['TR', 'AG'], axis=1)
df_train_dev

Unnamed: 0,id,text,HS
0,201,"Hurray, saving us $$$ in so many ways @potus @...",1
1,202,Why would young fighting age men be the vast m...,1
2,203,@KamalaHarris Illegals Dump their Kids at the ...,1
3,204,NY Times: 'Nearly All White' States Pose 'an A...,0
4,205,Orban in Brussels: European leaders are ignori...,0
...,...,...,...
9995,19196,@SamEnvers you unfollowed me? Fuck you pussy,0
9996,19197,@DanReynolds STFU BITCH! AND YOU GO MAKE SOME ...,1
9997,19198,"@2beornotbeing Honey, as a fellow white chick,...",0
9998,19199,I hate bitches who talk about niggaz with kids...,1


In [5]:
df_train_dev.text[1]

"Why would young fighting age men be the vast majority of the ones escaping a war &amp; not those who cannot fight like women, children, and the elderly?It's because the majority of the refugees are not actually refugees they are economic migrants trying to get into Europe.... https://t.co/Ks0SHbtYqn"

## PREPROCESSING

The very first step in NLP is _preprocessing_, that is, preparing the raw textual data such that we get rid of (most of the) noise and retain the most informative signal.

Steps:
- Convert to lowercase
- Remove stopwords
- Remove @mentions
- Remove URL
- Remove all non-alphanumeric symbols (except # and ')
- ...


In [8]:
import re

from pattern.text.en import singularize
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()

#nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Create a function to clean the tweets
def cleanTxt(text):
    text = text.lower() # Convert everything to lower case
    text = re.sub(r'@[a-zA-Z0-9]+', '', text) # Remove @mentions
    text = re.sub(r'rt[\s]+', '', text) # Remove RT (retweet symbol)
    text = re.sub(r'&amp;', 'and', text) # Replace '&amp;' by 'and'
    text = re.sub(r'https?:\/\/\S+', '', text) # Remove hyper link  
    #text = re.sub(r'\d+', '0', text) # Replace all numbers by a zero
    text = " ".join([singularize(word) for word in tokenizer.tokenize(text) if word not in stop_words]) # Remove stopwords
    #text = " ".join([singularize(word) for word in text])
    text = re.sub(r'[^\w\s#]', ' ', text) # Remove all non-alphanumeric symbols (excluding whitespace and # characters)
    text = re.sub(r'\s+', ' ', text) # Replace multiple whitespaces by a single whitespace
    text = text.strip() # Remove whitespaces at the beginning and at the end
    
    return text

In [7]:
# Clean the text
df_train_dev['text'] = df_train_dev['text'].apply(cleanTxt)

# Show the cleaned text
df_train_dev

NameError: name 'stop_words' is not defined

## Dictionary model: Attemp 1 - baseline

In [75]:
# Get all 'hateful words' from text file named profanity_en.txt
with open('Data/profanity_en.txt') as f:
    hs_lexicon = f.readlines()
hs_lexicon = list(map(lambda l: l.rstrip('\n'), hs_lexicon))

In [76]:
from nltk.tokenize import TweetTokenizer
t = TweetTokenizer()

# Create a function to see if a hateful word appears in input text
def contains_hs(text):
    for word in t.tokenize(text):
        if (word in hs_lexicon):
            return 1
    return 0

In [81]:
df_test = pd.read_csv('data/hateval2019_en_test.csv')
df_test = df_test.drop(['TR', 'AG', 'HS'], axis=1)

df_test_contains_hs = df_test.copy()
df_test_contains_hs['HS'] = df_test_contains_hs['text'].apply(contains_hs)
df_test_contains_hs

Unnamed: 0,id,text,HS
0,34243,"@local1025 @njdotcom @GovMurphy Oh, I could ha...",1
1,30593,Several of the wild fires in #california and #...,0
2,31427,@JudicialWatch My question is how do you reset...,1
3,31694,"#Europe, you've got a problem! We must hurry...",0
4,31865,This is outrageous! #StopIllegalImmigration #...,0
...,...,...,...
2995,31368,you can never take a L off a real bitch😩 im ho...,1
2996,30104,@Brian_202 likes to call me a cunt & a bitch b...,1
2997,31912,@kusha1a @Camio_the_wise @shoe0nhead 1. Never ...,1
2998,31000,If i see and know you a hoe why would i hit yo...,1


In [91]:
import import_ipynb
import evaluate # here we import the local evaluate.ipynb jupyter notebook

# Create prediction file for the dictionary_baseline
df_test_contains_hs[['id', 'HS']].to_csv('predictions/dictionary_baseline.tsv', sep='\t', index=False, header=False)
df_test_contains_hs[['id', 'HS']].to_csv('input/res/en_a.tsv', sep='\t', index=False, header=False)

# Evaluate the result of the dictionary_baseline
evaluate.write_eval("scores_dictionary_baseline")

importing Jupyter notebook from evaluate.ipynb
taskA_fscore: 0.552776935073505
taskA_precision: 0.5630655506062644
taskA_recall: 0.5638615216201424
taskA_accuracy: 0.553


### Now we will use the same exact model, except that we will first clean the text

In [136]:
df_test_clean_contains_hs = df_test.copy()
df_test_clean_contains_hs['text'] = df_test_clean_contains_hs['text'].apply(cleanTxt)
df_test_clean_contains_hs['HS'] = df_test_clean_contains_hs['text'].apply(contains_hs)
df_test_clean_contains_hs

Unnamed: 0,id,text,HS
0,34243,oh could gone taxe since current news nj guv w...,1
1,30593,several wild fire #californium #colorado inten...,1
2,31427,question resettle refugee refugee go home coun...,1
3,31694,#europe got problem must hurry #buildthewall b...,1
4,31865,outrageou #stopillegalimmigration #meritimmigr...,0
...,...,...,...
2995,31368,never take l real bitch im hotter ho chill w,1
2996,30104,_202 like call cunt bitch tell truth can t handle,1
2997,31912,_the_wise 1 never said taught 2 called bitch f...,1
2998,31000,see know hoe would hit back lol bitch got new ...,1


In [137]:
# Create prediction file for the clean_dictionary_baseline
df_test_clean_contains_hs[['id', 'HS']].to_csv('predictions/clean_dictionary_baseline.tsv', sep='\t', index=False, header=False)
df_test_clean_contains_hs[['id', 'HS']].to_csv('input/res/en_a.tsv', sep='\t', index=False, header=False)

# Evaluate the result of the clean_dictionary_baseline
evaluate.write_eval("scores_clean_dictionary_baseline")

taskA_fscore: 0.5267674779121141
taskA_precision: 0.598053074452919
taskA_recall: 0.5781472359058566
taskA_accuracy: 0.5383333333333333


Next steps:
- Done: Optimize the clean function cleanTxt to get better results (idea: check all tweets that have different label)
- Done: Count amount of hateful words in tweet, normalize it, and use it as a feature (check online for best dictionary models for hate speech detection)
- TODO: Hashtag segmentation (check if it contains hateful word; or add hateful hashtag to dictionary!) 
- Extra: Create specialized dictionary

In [141]:
# Check which tweets have different label
temp = df_test_contains_hs.loc[df_test_contains_hs['HS']!=df_test_clean_contains_hs['HS']]
temp

Unnamed: 0,id,text,HS
1,30593,Several of the wild fires in #california and #...,0
3,31694,"#Europe, you've got a problem! We must hurry...",0
6,34192,"""GET this WORSE THAN SCUM OUT OF OUR COUNTRY! ...",0
8,30910,"These savages invade Our Country, disrupt citi...",0
13,32884,Illegals Cross Border Just in Time to Have #An...,0
...,...,...,...
2945,33043,@pastaelitist SAME SHIT. YOU SKINNY AF BITCH I...,0
2947,34113,@diianaaprince BITCH HOW DARE YOU SUPPORT NACY...,0
2953,30237,i'm so salty i tried to go in a store with no ...,0
2979,31970,HOES BE WANTING YOU TO FEEL SOME TYPE OF WAY A...,0


## Dictionary model: Attemp 2 - count hateful words + normalize

In [184]:
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()

# Create a function that counts the amount of hateful words in the given text and then normalizes it by dividing it by the length of the sentence
def count_norm(text):
    length_text = len(text.split())
    if (length_text == 0):
        return 0
    
    count_hs = 0
    for word in tokenizer.tokenize(text):
        if (word in hs_lexicon):
            count_hs += 1
    
    return count_hs/length_text

### Find optimal threshold

In [160]:
df_train_dev['text'] = df_train_dev['text'].apply(cleanTxt)
df_train_dev['count_norm'] = df_train_dev['text'].apply(count_norm)
df_train_dev

Unnamed: 0,id,text,HS,count_norm
0,201,hurray saving u many way #lockthemup #buildthe...,1,0.000000
1,202,would young fighting age man vast majority one...,1,0.120000
2,203,illegal dump kid border like road kill refuse ...,1,0.208333
3,204,ny time nearly white state pose array problem ...,0,0.000000
4,205,orban brussel european leader ignoring person ...,0,0.125000
...,...,...,...,...
9995,19196,unfollowed fuck pussy,0,0.666667
9996,19197,stfu bitch go make satanic music u illuminatu ...,1,0.437500
9997,19198,honey fellow white chick let tell need shut fu...,0,0.217391
9998,19199,hate bitch talk niggaz kid everybody cant find...,1,0.312500


In [178]:
from sklearn.metrics import f1_score
from numpy import arange

# Iterate over different threshold values in save result in dictionary to find optimal threshold 

dict_threshold_f1 = {}
for t in arange(0, 1, 0.01):
    df_train_dev['HS_c'] = df_train_dev['count_norm'].apply(lambda x: 1 if (x >= t) else 0)
    # Compute the F1-score manually
    f1 = f1_score(df_train_dev['HS'].values, df_train_dev['HS_c'].values, average='macro')
    dict_threshold_f1[t] = f1

#dict_threshold_f1
sort_thresholds = sorted(dict_threshold_f1.items(), key=lambda x: x[1], reverse=True)
optimal_threshold = sort_thresholds[0][0]
print(f'Optimal threshold is: {optimal_threshold}')
sort_thresholds[:20]

0.13


[(0.13, 0.5981759105749043),
 (0.12, 0.59526778697202),
 (0.14, 0.5947650955275953),
 (0.15, 0.5934909975256196),
 (0.11, 0.5920512026950562),
 (0.16, 0.5891847698812984),
 (0.1, 0.5875672792231293),
 (0.17, 0.5839745676420859),
 (0.09, 0.5831243432852959),
 (0.18, 0.5824054887957064),
 (0.19, 0.5781317722949599),
 (0.08, 0.5781293780089316),
 (0.2, 0.5771748063368303),
 (0.07, 0.5722968820442701),
 (0.06, 0.5693519467213115),
 (0.21, 0.5672537536556869),
 (0.05, 0.5656072617951466),
 (0.22, 0.5614772993889874),
 (0.04, 0.5592136597011159),
 (0.03, 0.5561014013541175)]

In [186]:
# Create new test dataframe and make HS column based on optimal threshold that we computed in previous cell
df_test_count_norm = df_test.copy()
df_test_count_norm['text'] = df_test_count_norm['text'].apply(cleanTxt)
df_test_count_norm['count_norm'] = df_test_count_norm['text'].apply(count_norm)
df_test_count_norm['HS'] = df_test_count_norm['count_norm'].apply(lambda x: 1 if (x >= optimal_threshold) else 0)

# Create prediction file for the dictionary_count_norm
df_test_count_norm[['id', 'HS']].to_csv('predictions/dictionary_count_norm.tsv', sep='\t', index=False, header=False)
df_test_count_norm[['id', 'HS']].to_csv('input/res/en_a.tsv', sep='\t', index=False, header=False)

# Evaluate the result of the dictionary_count_norm
evaluate.write_eval("scores_dictionary_count_norm")

taskA_fscore: 0.5786438836298406
taskA_precision: 0.5786389332644839
taskA_recall: 0.5800218938149972
taskA_accuracy: 0.5853333333333334


### GREAT NEWS! We have slightly beaten the polarity baseline! We went from f1_score 0.57823 to 0.57864