#                                      Hate Speech Detection

# Objective:

The goal of this challenge is to build system to detect hate speech.

There is no legal definition of hate speech because people’s opinions cannot easily be classified as hateful or offensive. Nevertheless, the United Nations defines hate speech as any type of verbal, written or behavioural communication that can attack or use discriminatory language regarding a person or a group of people based on their identity based on religion, ethnicity, nationality, race, colour, ancestry, gender or any other identity factor.

Hope you now have understood what hate speech is. Social media platforms need to detect hate speech and prevent it from going viral or ban it at the right time.

# Dataset: 


The dataset I’m using for the hate speech detection task is downloaded from Kaggle. This dataset was originally collected from Twitter and contains the following columns:

- index

- count

- hate_speech

- offensive_language

- neither

- class

- tweet

In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import cosine_similarity
import re
import nltk
import warnings
warnings.filterwarnings('ignore')
import pickle

In [2]:
df=pd.read_csv('twitter.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [4]:
df = df.drop(['Unnamed: 0'], axis =1)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   count               24783 non-null  int64 
 1   hate_speech         24783 non-null  int64 
 2   offensive_language  24783 non-null  int64 
 3   neither             24783 non-null  int64 
 4   class               24783 non-null  int64 
 5   tweet               24783 non-null  object
dtypes: int64(5), object(1)
memory usage: 1.1+ MB


In [6]:
df.shape

(24783, 6)

In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
count,24783.0,3.243473,0.88306,3.0,3.0,3.0,3.0,9.0
hate_speech,24783.0,0.280515,0.631851,0.0,0.0,0.0,0.0,7.0
offensive_language,24783.0,2.413711,1.399459,0.0,2.0,3.0,3.0,9.0
neither,24783.0,0.549247,1.113299,0.0,0.0,0.0,0.0,9.0
class,24783.0,1.110277,0.462089,0.0,1.0,1.0,1.0,2.0


In [8]:
df.isnull().sum()

count                 0
hate_speech           0
offensive_language    0
neither               0
class                 0
tweet                 0
dtype: int64

In [9]:
df.duplicated().sum()

0

#### We will add a new column to this dataset as labels which will contain the values as:

Hate Speech

Offensive Language

No Hate and Offensive

In [10]:
df['labels']=df['class'].map({0:'Hate Speech',
                            1:'Offensive Language',
                            2:'No Hate and Offensive'})

df.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,labels
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,No Hate and Offensive
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,Offensive Language
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,Offensive Language
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,Offensive Language
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,Offensive Language


### Now we will only select the tweet and labels columns for the rest of the task of training a hate speech detection model:

In [11]:
df1 = df[["tweet", "labels"]]
df1

Unnamed: 0,tweet,labels
0,!!! RT @mayasolovely: As a woman you shouldn't...,No Hate and Offensive
1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,Offensive Language
2,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,Offensive Language
3,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,Offensive Language
4,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,Offensive Language
...,...,...
24778,you's a muthaf***in lie &#8220;@LifeAsKing: @2...,Offensive Language
24779,"you've gone and broke the wrong heart baby, an...",No Hate and Offensive
24780,young buck wanna eat!!.. dat nigguh like I ain...,Offensive Language
24781,youu got wild bitches tellin you lies,Offensive Language


In [12]:
from nltk.corpus import stopwords
import string
stopword=set(stopwords.words('english'))
stemmer = nltk.SnowballStemmer("english")

In [13]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
df1["tweet"] = df1["tweet"].apply(clean)
df1

Unnamed: 0,tweet,labels
0,rt mayasolov woman shouldnt complain clean ho...,No Hate and Offensive
1,rt boy dat coldtyga dwn bad cuffin dat hoe ...,Offensive Language
2,rt urkindofbrand dawg rt ever fuck bitch sta...,Offensive Language
3,rt cganderson vivabas look like tranni,Offensive Language
4,rt shenikarobert shit hear might true might f...,Offensive Language
...,...,...
24778,yous muthafin lie coreyemanuel right tl tras...,Offensive Language
24779,youv gone broke wrong heart babi drove redneck...,No Hate and Offensive
24780,young buck wanna eat dat nigguh like aint fuck...,Offensive Language
24781,youu got wild bitch tellin lie,Offensive Language


In [14]:
from sklearn.feature_extraction .text import CountVectorizer
from sklearn.model_selection import train_test_split
X=df1['tweet']
y=df1['labels']

In [15]:
X.shape,y.shape

((24783,), (24783,))

In [16]:
cv=CountVectorizer()
X=cv.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [17]:
from sklearn.model_selection import KFold

In [18]:
cv1=KFold(n_splits=10,shuffle=True,random_state=42)

In [19]:
from sklearn.naive_bayes import BernoulliNB
BNB=BernoulliNB()
BNB.fit(X_train,y_train)

BernoulliNB()

In [20]:
y_pred=BNB.predict(X_test)


In [21]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,f1_score,precision_score,recall_score

In [22]:
print('Confusion Matrix:',confusion_matrix(y_pred,y_test))
print('Classification Report:',classification_report(y_pred,y_test))
print('Accuracy Score:',accuracy_score(y_pred,y_test))
print('F1 Score:',f1_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Precision Score:',precision_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Recall Score:',recall_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))


Confusion Matrix: [[   0    0    0]
 [  28  474   47]
 [ 437  905 6288]]
Classification Report:                        precision    recall  f1-score   support

          Hate Speech       0.00      0.00      0.00         0
No Hate and Offensive       0.34      0.86      0.49       549
   Offensive Language       0.99      0.82      0.90      7630

             accuracy                           0.83      8179
            macro avg       0.45      0.56      0.46      8179
         weighted avg       0.95      0.83      0.87      8179

Accuracy Score: 0.8267514366059421
F1 Score: 0.4640794339137708
Precision Score: 0.44543607947147223
Recall Score: 0.5625011041163845


In [23]:
from sklearn.svm import LinearSVC
classifier=LinearSVC()
classifier.fit(X_train,y_train)


LinearSVC()

In [24]:
y_pred=classifier.predict(X_test)

In [25]:
print('Confusion Matrix:',confusion_matrix(y_pred,y_test))
print('Classification Report:',classification_report(y_pred,y_test))
print('Accuracy Score:',accuracy_score(y_pred,y_test))
print('F1 Score:',f1_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Precision Score:',precision_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Recall Score:',recall_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))

Confusion Matrix: [[ 143   31  190]
 [  39 1102  189]
 [ 283  246 5956]]
Classification Report:                        precision    recall  f1-score   support

          Hate Speech       0.31      0.39      0.34       364
No Hate and Offensive       0.80      0.83      0.81      1330
   Offensive Language       0.94      0.92      0.93      6485

             accuracy                           0.88      8179
            macro avg       0.68      0.71      0.70      8179
         weighted avg       0.89      0.88      0.88      8179

Accuracy Score: 0.8804254798875168
F1 Score: 0.695917161343886
Precision Score: 0.6822767748141855
Recall Score: 0.7132852369937952


In [26]:
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(X_train,y_train)

DecisionTreeClassifier()

In [27]:
y_pred=dt.predict(X_test)

In [28]:
print('Confusion Matrix:',confusion_matrix(y_pred,y_test))
print('Classification Report:',classification_report(y_pred,y_test))
print('Accuracy Score:',accuracy_score(y_pred,y_test))
print('F1 Score:',f1_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Precision Score:',precision_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))
print('Recall Score:',recall_score(y_pred,y_test,average="macro", pos_label="Hate Speech"))

Confusion Matrix: [[ 160   35  225]
 [  34 1129  219]
 [ 271  215 5891]]
Classification Report:                        precision    recall  f1-score   support

          Hate Speech       0.34      0.38      0.36       420
No Hate and Offensive       0.82      0.82      0.82      1382
   Offensive Language       0.93      0.92      0.93      6377

             accuracy                           0.88      8179
            macro avg       0.70      0.71      0.70      8179
         weighted avg       0.88      0.88      0.88      8179

Accuracy Score: 0.8778579288421567
F1 Score: 0.7020807772791996
Precision Score: 0.69756947060648
Recall Score: 0.7072243263075366


#### Pickling the best models performance

In [31]:
pickle.dump(classifier, open("HATE_SPEECH_linearSVC.pkl", 'wb'))