#  Hate Tweets Detection  on  Twitter  

# Data link: https://www.kaggle.com/vkrahul/twitter-hate-speech

### Procedure Followed

1. Data is imported into a pandas dataframe.
1. Dataframe is checked for null values, No null values is found so we just explore the data.
1. Datframe has a tweets column and a label column where 0 represents neutal and 1 represents Hate speech.
1. Now using NLTk and re library the data is preprocessed.
1. The data is preprocessed by removing everything apart from alphabets, making all the letters lowercase, removing stopwards and stemming words.
1. The preprocessed sentences are used to create a corpus which is a list of sentences.
1. The corpus is vectorized using Bag of words(CountVectorizer())
1. The dependent(label) and independent(vectorized corpus) features of the data are separated.
1. These are split into training and testing data.
1. The training data is used to train a variety of classifiers.
1. The testing data is used to test those classifiers.
1. Finally the the model with the best accuracy is found out.

# Importing Necessary Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Loading Dataset

#### The labelled data was obtained from kaggle

In [3]:
df = pd.read_csv("train.csv")

In [4]:
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


###### Label 0 --> Neutral, Label 1 --> Hate Speech

## Checking for null values

In [5]:
df.isna().sum() # no null values

id       0
label    0
tweet    0
dtype: int64

In [6]:
df.columns

Index(['id', 'label', 'tweet'], dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [8]:
# showing the number of datapoints in each classes
df.label.value_counts()

0    29720
1     2242
Name: label, dtype: int64

# Preprocessing

### Importing Necessary Libraries

In [9]:
import re
from nltk.stem.porter import PorterStemmer
import nltk
from nltk.corpus import stopwords

### For each sentence:
1. Everything apart from alphabets are removed.
1. Letters are lowered
1. Tokenized into words
1. removed stopwords
1. Stemmed the remaining words
1. Joined the words back to make the sentence.

In [10]:
# preprocessing
ps = PorterStemmer() # stemmer
corpus = []
for i in range(len(df)):
    review = re.sub('[^a-zA-Z]'," ",df["tweet"][i])
    review = review.lower()
    review = nltk.word_tokenize(review)
    review = [ps.stem(word) for word in review if word not in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)

In [11]:
corpus[:5]

['user father dysfunct selfish drag kid dysfunct run',
 'user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank',
 'bihday majesti',
 'model love u take u time ur',
 'factsguid societi motiv']

In [12]:
#Separating the dependent variable
y = df["label"]
y[3]

0

#### Using Bag of Words

In [13]:
# Vectorizing the corpus 
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000)
x = cv.fit_transform(corpus).toarray()

In [14]:
# independent variable
# bagof words
# parse matirx
x 

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [15]:
x.shape

(31962, 5000)

In [16]:
y.shape

(31962,)

## Splitting the data

In [17]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size = 0.25,
                                                 random_state =  0)

In [18]:
print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

(23971, 5000) (7991, 5000) (23971,) (7991,)


# Using Multinomial Naive Bayse Classifier

In [19]:
# Importing Model
from sklearn.naive_bayes import MultinomialNB
model1 = MultinomialNB()

In [20]:
# Training
model1.fit(x_train,y_train)

MultinomialNB()

In [21]:
# Testing
y_pred = model1.predict(x_test)

In [22]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
print("The accuracy of the model is ",accuracy_score(y_test,y_pred))
print("\nThe Confusion matrix: ")
print(confusion_matrix(y_test,y_pred))
print("\nThe classification Report :")
print(classification_report(y_test,y_pred))

The accuracy of the model is  0.9439369290451758

The Confusion matrix: 
[[7168  292]
 [ 156  375]]

The classification Report :
              precision    recall  f1-score   support

           0       0.98      0.96      0.97      7460
           1       0.56      0.71      0.63       531

    accuracy                           0.94      7991
   macro avg       0.77      0.83      0.80      7991
weighted avg       0.95      0.94      0.95      7991



# Using Random Forest Classifier

In [23]:
# Loading and Training Model
from sklearn.ensemble import RandomForestClassifier 
model2 = RandomForestClassifier(n_estimators=10, criterion = "entropy")
model2.fit(x_train, y_train)

RandomForestClassifier(criterion='entropy', n_estimators=10)

In [24]:
# Testing
y_pred1 = model2.predict(x_test)

In [25]:
print("The accuracy of RandomForest Classifier: ", accuracy_score(y_test,y_pred1))
print("\nThe classification Report :")
print(classification_report(y_test,y_pred1))

The accuracy of RandomForest Classifier:  0.9562007258165436

The classification Report :
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      7460
           1       0.79      0.46      0.58       531

    accuracy                           0.96      7991
   macro avg       0.88      0.73      0.78      7991
weighted avg       0.95      0.96      0.95      7991



# Using Gaussian Naive Baysian Classifer

In [26]:
# Loading and Training model
from sklearn.naive_bayes import GaussianNB
model3 = GaussianNB()
model3.fit(x_train,y_train)

GaussianNB()

In [27]:
y_pred2 = model3.predict(x_test)
print("\n The Accuracy of model : ", accuracy_score(y_test,y_pred2))
print("\n The Classification report :",classification_report(y_test,y_pred2))


 The Accuracy of model :  0.7596045551245151

 The Classification report :               precision    recall  f1-score   support

           0       0.97      0.76      0.86      7460
           1       0.17      0.70      0.28       531

    accuracy                           0.76      7991
   macro avg       0.57      0.73      0.57      7991
weighted avg       0.92      0.76      0.82      7991



# Using Perceptron model

In [28]:
from sklearn.linear_model import Perceptron
model4 = Perceptron()
model4.fit(x_train,y_train)

Perceptron()

In [29]:
y_pred3 = model4.predict(x_test)
print("\n The Accuracy of model : ", accuracy_score(y_test,y_pred3))
print("\n The Classification report :",classification_report(y_test,y_pred3))


 The Accuracy of model :  0.9519459391815793

 The Classification report :               precision    recall  f1-score   support

           0       0.97      0.98      0.97      7460
           1       0.66      0.58      0.61       531

    accuracy                           0.95      7991
   macro avg       0.81      0.78      0.79      7991
weighted avg       0.95      0.95      0.95      7991



# Result

#### Gaussian NB had the lowest accuracy, whereas Random Forest and Perceptron had the highest accuracy around 95 %