# Hate Speech Detection with Machine Learning

Hate speech is one of the serious issues we see on social media platforms like Twitter and Facebook daily. Most of the posts containing hate speech can be found in the accounts of people with political views.

In [2]:
from nltk.util import pr
import pandas as pd
import numpy as np


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


In [10]:
import re

In [12]:
import nltk

In [14]:
stemmer = nltk.SnowballStemmer("english")

In [16]:
from nltk.corpus import stopwords

In [18]:
import string

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aser\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [24]:
stopwords = set(stopwords.words('english'))

In [26]:
data = pd.read_csv("twitter.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [28]:
print(data)

       Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0               0      3            0                   0        3      2   
1               1      3            0                   3        0      1   
2               2      3            0                   3        0      1   
3               3      3            0                   2        1      1   
4               4      6            0                   6        0      1   
...           ...    ...          ...                 ...      ...    ...   
24778       25291      3            0                   2        1      1   
24779       25292      3            0                   1        2      2   
24780       25294      3            0                   3        0      1   
24781       25295      6            0                   6        0      1   
24782       25296      3            0                   0        3      2   

                                                   tweet  
0      !!! RT @m

*add a new column to this dataset as labels which will contain the values as:

*Hate Speech*

*Offensive Language*

*No Hate and Offensive*

In [30]:
data["labels"] = data["class"].map({0: "Hate Speech", 
                                   1: "Offensive Language", 
                                   2: "No Hate and Offensive"})

In [32]:
print(data.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet                 labels  
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language  


*Here I select only the tweet and labels columns for the rest of the task of training a hate speech detection model:*

In [39]:
data = data[["tweet", "labels"]]

In [41]:
print(data.head())

                                               tweet                 labels
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language


*here is the function to clear text*

In [47]:
def clean(text):
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = [word for word in text.split(' ') if word not in stopwords]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text
data["tweet"] = data["tweet"].apply(clean)

*split the dataset into training and test sets and train a machine learning model for the task of hate speech detection:*

In [52]:
x = np.array(data["tweet"])
y = np.array(data["labels"])

In [54]:
cv = CountVectorizer()

In [56]:
X = cv.fit_transform(x)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [60]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

In [62]:
clf.score(X_test, y_test)

0.8767575498227167

In [64]:
accuracy = clf.score(X_test, y_test) * 100
accuracy

87.67575498227167

### Now let’s test this machine learning model to see if it detects hate speech or not:

In [73]:
sample = "Let's unite and kill all the people who are protesting against the government"
data = cv.transform([sample]).toarray()
print(clf.predict(data))

['Offensive Language']


# Summary
So this is how you can train a machine learning model for the task of detecting hate speech by using the Python programming language. Hate speech is one of the serious issues we see on social media platforms like Twitter and Facebook daily. Most of the posts containing hate speech can be found in the accounts of people with political views.

### Saving the model

In [79]:
import pickle

In [83]:
pickle.dump(clf, open('hate_speech_detection_model.pkl', 'wb'))

### Loading the model

In [87]:
loaded_model = pickle.load(open('hate_speech_detection_model.pkl', 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

0.8767575498227167
