# Thesis 2020-2021: Hate Speech Detection
## Baselines Subtask A

The following two baselines have been considered by the organizers of this competition in order to provide a benchmark for the comparison of the submitted systems: 
1. The MFC (Most Frequent Classifier) baseline: Trivial model that assigns the most frequent label, estimated on the
training set, to all the instances in the test set.
2. The SVC (Support Vector Classifier) baseline: Linear Support Vector Machine (SVM) based on a TF-IDF representation, where the hyper-parameters are the default values set by the scikit-learn Python library.

In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

import import_ipynb
import evaluate # here we import the local evaluate.ipynb jupyter notebook

importing Jupyter notebook from evaluate.ipynb


We start off by reading the training and development data into a pandas dataframe. 
Columns TR and AG columns are removed as they are irrelevant for Subtask A.

In [4]:
import csv
    
df_train = pd.read_csv('data/hateval2019_en_train.csv')
df_dev = pd.read_csv('data/hateval2019_en_dev.csv')
print(df_train)
df_train_dev = df_train.append(df_dev, ignore_index=True)
df_train_dev = df_train_dev.drop(['TR', 'AG'], axis=1)
df_train_dev

        id                                               text  HS  TR  AG
0      201  Hurray, saving us $$$ in so many ways @potus @...   1   0   0
1      202  Why would young fighting age men be the vast m...   1   0   0
2      203  @KamalaHarris Illegals Dump their Kids at the ...   1   0   0
3      204  NY Times: 'Nearly All White' States Pose 'an A...   0   0   0
4      205  Orban in Brussels: European leaders are ignori...   0   0   0
...    ...                                                ...  ..  ..  ..
8995  9196  @mmdwriter @JRubinBlogger @BenSasse I am proud...   0   0   0
8996  9197  @CheriJacobus Hollywood is complicit in the ra...   0   0   0
8997  9198  @amaziah_filani What a fucking cunt I hate see...   1   1   1
8998  9199                  Hysterical woman like @CoryBooker   0   0   0
8999  9200  Nearly every woman I know has #meToo in their ...   0   0   0

[9000 rows x 5 columns]


Unnamed: 0,id,text,HS
0,201,"Hurray, saving us $$$ in so many ways @potus @...",1
1,202,Why would young fighting age men be the vast m...,1
2,203,@KamalaHarris Illegals Dump their Kids at the ...,1
3,204,NY Times: 'Nearly All White' States Pose 'an A...,0
4,205,Orban in Brussels: European leaders are ignori...,0
...,...,...,...
9995,19196,@SamEnvers you unfollowed me? Fuck you pussy,0
9996,19197,@DanReynolds STFU BITCH! AND YOU GO MAKE SOME ...,1
9997,19198,"@2beornotbeing Honey, as a fellow white chick,...",0
9998,19199,I hate bitches who talk about niggaz with kids...,1


The English dataset is composed out of 13.000 tweets. Out of these tweets, 10.000 are meant for training and development (9.000 training tweets + 1.000 development tweets). As expected, we have 10.000 rows in this dataframe because we have appended both training and development data together.

In [5]:
print(df_train_dev.shape) 

(10000, 3)


## TODO: Plot some great visualizations with this DATA!

## 1. MFC baseline
#### Now we will program the MFC (Most Frequent Classifier Trivial) baseline, which assigns the most frequent label, estimated on the training set, to all the instances in the test set.

First, we compute the most frequent label for HS (Hate Speech), estimated on the training set.

In [6]:
print(df_train_dev['HS'].value_counts())
most_frequent_label = df_train_dev['HS'].value_counts().index[0]
print(f'The most frequent label for HS is: {most_frequent_label}. This means that most tweets in the training set are not labelled as hate speech.')

0    5790
1    4210
Name: HS, dtype: int64
The most frequent label for HS is: 0. This means that most tweets in the training set are not labelled as hate speech.


Next, we read the test set into a dataframe and assign to it the most frequent label that we just computed.

In [7]:
df_test = pd.read_csv('data/hateval2019_en_test.csv')
df_test = df_test.drop(['TR', 'AG'], axis=1)
df_test_mfc = df_test.copy()
df_test_mfc['HS'] = [most_frequent_label]*df_test_mfc.shape[0]
df_test_mfc

Unnamed: 0,id,text,HS
0,34243,"@local1025 @njdotcom @GovMurphy Oh, I could ha...",0
1,30593,Several of the wild fires in #california and #...,0
2,31427,@JudicialWatch My question is how do you reset...,0
3,31694,"#Europe, you've got a problem! We must hurry...",0
4,31865,This is outrageous! #StopIllegalImmigration #...,0
...,...,...,...
2995,31368,you can never take a L off a real bitch😩 im ho...,0
2996,30104,@Brian_202 likes to call me a cunt & a bitch b...,0
2997,31912,@kusha1a @Camio_the_wise @shoe0nhead 1. Never ...,0
2998,31000,If i see and know you a hoe why would i hit yo...,0


In [198]:
# Compute the F1-score manually
f1_mfc = f1_score(df_test['HS'].values, df_test_mfc['HS'].values, average='macro')
print(f'The macro-averaged F1 score for the MFC baseline is: {f1_mfc}') 

# Great! This corresponds with the paper!

The macro-averaged F1 score for the MFC baseline is: 0.36708860759493667


  'precision', 'predicted', average, warn_for)


In [68]:
# create prediction file for the mfc_baseline
df_test_mfc[['id', 'HS']].to_csv('predictions/mfc_baseline.tsv', sep='\t', index=False, header=False)
df_test_mfc[['id', 'HS']].to_csv('input/res/en_a.tsv', sep='\t', index=False, header=False)

# create file with all the evaluations for the mfc_baseline
evaluate.write_eval("scores_mfc")

taskA_fscore: 0.36708860759493667
taskA_precision: 0.29
taskA_recall: 0.5
taskA_accuracy: 0.58


  'precision', 'predicted', average, warn_for)


## 2. SVC baseline
#### Now we will program the SVM (Linear Support Vector Machine) baseline, which is based on a TF-IDF representation, where the hyper-parameters are the default values set by the scikit-learn Python library

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

# First append train and test together to apply TfidfVectorizer() on them, otherwise it doesn't work (different feature sizes)
df_train_test = df_train_dev.append(df_test, ignore_index=True) 

vectorizer = TfidfVectorizer()
X_svc_train_test = vectorizer.fit_transform(df_train_test['text'].values)

X_svc_train = X_svc_train_test[:10000]
X_svc_test = X_svc_train_test[10000:]

y_svc_train = df_train_dev['HS'].values
y_svc_test = df_test['HS'].values

#print(vectorizer.get_feature_names())

# train classifier
clf = SVC(probability=True, kernel='rbf') # rbf is the default kernel
clf.fit(X_svc_train, y_svc_train)
predictions = clf.predict_proba(X_svc_test)
pred = [0 if x[0]>=0.5 else 1 for x in predictions]
f1_svc = f1_score(y_svc_test, pred, average='macro')
print(f'The macro-averaged F1 score for the SVC baseline is: {f1_svc}')

# Does NOT correspond with the paper!!! (paper SVC f1-score for Subtask A: 0.451) TODO: FIX model 

The macro-averaged F1 score for the SVC baseline is: 0.3175842695733467


In [95]:
df_test_svc = df_test.copy()
df_test_svc['HS'] = pred
df_test_svc

Unnamed: 0,id,text,HS
0,34243,"@local1025 @njdotcom @GovMurphy Oh, I could ha...",0
1,30593,Several of the wild fires in #california and #...,0
2,31427,@JudicialWatch My question is how do you reset...,1
3,31694,"#Europe, you've got a problem! We must hurry...",1
4,31865,This is outrageous! #StopIllegalImmigration #...,1
...,...,...,...
2995,31368,you can never take a L off a real bitch😩 im ho...,1
2996,30104,@Brian_202 likes to call me a cunt & a bitch b...,1
2997,31912,@kusha1a @Camio_the_wise @shoe0nhead 1. Never ...,1
2998,31000,If i see and know you a hoe why would i hit yo...,1


In [96]:
# create prediction file for the svc_baseline
df_test_svc[['id', 'HS']].to_csv('predictions/svc_baseline.tsv', sep='\t', index=False, header=False)
df_test_svc[['id', 'HS']].to_csv('input/res/en_a.tsv', sep='\t', index=False, header=False)

# create file with all the evaluations for the svc_baseline
evaluate.write_eval("scores_svc")

taskA_fscore: 0.3175842695733467
taskA_precision: 0.6139206970651438
taskA_recall: 0.5070607553366174
taskA_accuracy: 0.42933333333333334


In [97]:
# EXTRA: Easy TF-IDF example to understand how the TfidfVectorizer() works.

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
X = X.toarray()
X

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524],
       [0.        , 0.6876236 , 0.        , 0.28108867, 0.        ,
        0.53864762, 0.28108867, 0.        , 0.28108867],
       [0.51184851, 0.        , 0.        , 0.26710379, 0.51184851,
        0.        , 0.26710379, 0.51184851, 0.26710379],
       [0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])