# Thesis 2020-2021: Hate Speech Detection
## Baselines Subtask A

The following two baselines have been considered by the organizers of this competition in order to provide a benchmark for the comparison of the submitted systems: 
1. The MFC (Most Frequent Classifier) baseline: Trivial model that assigns the most frequent label, estimated on the
training set, to all the instances in the test set.
2. The SVC (Support Vector Classifier) baseline: Linear Support Vector Machine (SVM) based on a TF-IDF representation, where the hyper-parameters are the default values set by the scikit-learn Python library

In [1]:
import pandas as pd
import numpy as np

We start off by reading the training and development data into a pandas dataframe. 
Columns TR and AG columns are removed as they are irrelevant for Subtask A.

In [78]:
import csv
    
df_train = pd.read_csv('data/hateval2019_en_train.csv')
df_dev = pd.read_csv('data/hateval2019_en_dev.csv')
df_train_dev = df_train.append(df_dev, ignore_index=True)
df_train_dev = df_train_dev.drop(['TR', 'AG'], axis=1)
df_train_dev

Unnamed: 0,id,text,HS
0,201,"Hurray, saving us $$$ in so many ways @potus @...",1
1,202,Why would young fighting age men be the vast m...,1
2,203,@KamalaHarris Illegals Dump their Kids at the ...,1
3,204,NY Times: 'Nearly All White' States Pose 'an A...,0
4,205,Orban in Brussels: European leaders are ignori...,0
...,...,...,...
9995,19196,@SamEnvers you unfollowed me? Fuck you pussy,0
9996,19197,@DanReynolds STFU BITCH! AND YOU GO MAKE SOME ...,1
9997,19198,"@2beornotbeing Honey, as a fellow white chick,...",0
9998,19199,I hate bitches who talk about niggaz with kids...,1


The English dataset is composed out of 13.000 tweets. Out of these tweets, 10.000 are meant for training and development (9.000 training tweets + 1.000 development tweets). As expected, we have 10.000 rows in this dataframe because we have appended both training and development data together.

In [75]:
print(df_train_dev.shape) 

(10000, 3)


## TODO: Plot some great visualizations with this DATA!

## 1. MFC baseline
#### Now we will program the MFC (Most Frequent Classifier Trivial) baseline, which assigns the most frequent label, estimated on the training set, to all the instances in the test set.

First, we compute the most frequent label for HS (Hate Speech), estimated on the training set.

In [77]:
print(df_train_dev['HS'].value_counts())
most_frequent_label = df_train_dev['HS'].value_counts().index[0]
print(f'The most frequent label for HS is: {most_frequent_label}. This means that most tweets in the training set are not labelled as hate speech.')

0    5790
1    4210
Name: HS, dtype: int64
The most frequent label for HS is: 0. This means that most tweets in the training set are not labelled as hate speech.


Next, we read the test set into a dataframe and assign to it the most frequent label that we just computed.

In [64]:
df_test = pd.read_csv('data/hateval2019_en_test.csv')
df_test = df_test.drop(['HS', 'TR', 'AG'], axis=1)
df_test['HS'] = [most_frequent_label]*df_test.shape[0]
df_test

Unnamed: 0,id,text,HS
0,34243,"@local1025 @njdotcom @GovMurphy Oh, I could ha...",0
1,30593,Several of the wild fires in #california and #...,0
2,31427,@JudicialWatch My question is how do you reset...,0
3,31694,"#Europe, you've got a problem! We must hurry...",0
4,31865,This is outrageous! #StopIllegalImmigration #...,0
...,...,...,...
2995,31368,you can never take a L off a real bitch😩 im ho...,0
2996,30104,@Brian_202 likes to call me a cunt & a bitch b...,0
2997,31912,@kusha1a @Camio_the_wise @shoe0nhead 1. Never ...,0
2998,31000,If i see and know you a hoe why would i hit yo...,0


## 2. SVC baseline
#### Now we will program the SVM (Linear Support Vector Machine) baseline, which is based on a TF-IDF representation, where the hyper-parameters are the default values set by the scikit-learn Python library

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_train_dev['text'].values)
print(vectorizer.get_feature_names())
print(X.shape)

(10000, 26851)
