Importing the `train`, `test` and `test_labels` csv file.
---

`trainData` includes a column of text and six other columns of labels with values of 0s and 1s indicating whether the text belongs to those labels.

---
`testData` includes only a single column of text.

---
`testLabels`includes *1s and 0s* corresponding to each label, indicating whether the text belongs to that label or not.


In [None]:
import pandas as pd

trainData = pd.read_csv("./train.csv")

testData = pd.read_csv("./test.csv")

testLabels = pd.read_csv("./test_labels.csv")

# Converting pds to np arrays
trainData = trainData.values
ogTestData = testData.values
testData = testData.values
testLabels = testLabels.values

trainData = trainData[:, 1:]
testData = testData[:, 1:]
testLabels = testLabels[:, 1:]

print("Train Data:")
print(trainData.shape)
print("\n")
print("Test Data:")
print(testData.shape)
print("\n")
print("Test Labels:")
print(testLabels.shape)

Train Data:
(159571, 7)


Test Data:
(153164, 1)


Test Labels:
(153164, 6)


Downloading `stopwords` and importing my data cleaning functions to clean the tweets included.
---

In [None]:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

def lowercase(txt):
    return txt.lower()

def remove_punctuation(txt):
    newTxt = re.compile(r'[^\w\s]').sub('', txt)
    return newTxt

def remove_stopwords(txt):
    newTxt = []
    words = txt.split()
    for word in words:
        if word not in stopwords:
            newTxt.append(word)
    return str(" ".join(newTxt))

def remove_numbers(txt):
    newTxt = re.compile(r'\d+').sub('', txt)
    return newTxt

def remove_url(txt):
    newTxt = re.compile(r'https?://\S+|www\.\S+').sub('', txt)
    return newTxt

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mshum\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Cleaning the data.
---
`trainData` and `testData` get repopulated with the cleaned sentences, row by row.

In [None]:
newTrainData = []
newTestData = []
for row in range(trainData.shape[0]):
    temp = trainData[row][0]
    temp = lowercase(temp)
    temp = remove_punctuation(temp)
    temp = remove_stopwords(temp)
    temp = remove_numbers(temp)
    temp = remove_url(temp)
    newTrainData.append(temp)
    # trainData[row][0] = temp

temp = ""
for row in range(testData.shape[0]):
    temp = testData[row][0]
    temp = lowercase(temp)
    temp = remove_punctuation(temp)
    temp = remove_stopwords(temp)
    temp = remove_numbers(temp)
    temp = remove_url(temp)
    # testData[row][0] = temp
    newTestData.append(temp)

trainData[:, 0] = newTrainData
testData[:, 0] = newTestData

Removing certain row data from `testData` and the corresponding `testLabels` *as they consist of -1s and those text's were not being test*, and it's invalid `testData` which will negatively affect our accuracy and results.

In [None]:
import numpy as np
rows_to_keep = np.all(testLabels != -1, axis=1)

testLabels = testLabels[rows_to_keep]
testData = testData[rows_to_keep]

testData = testData.flatten()

Extracting `features` and `labels` from `trainData`.

In [None]:
features = trainData[:, 0]
labels = trainData[:, 1:]

Firstly I convert the text into numbers using `CountVectorizer`, for both `testData` and `trainData`.

---

After that I extract individual labels for each classification in column vectors.
In the end getting six column vectors of train and test Labels for each label i.e `toxic, severe_toxic, obscene, threat, insult, identity_hate`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import MultiLabelBinarizer
import numpy as np

# Vectorize the text data using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(features)
X_test_vectorized = vectorizer.transform(testData.tolist())

# Labels = toxic, severe_toxic, obscene, threat, insult, identity_hate

yTrainToxic = labels[: , 0].astype(np.float64)
yTestToxic = testLabels[: , 0].astype(np.float64)

yTrainSevereToxic = labels[: , 1].astype(np.float64)
yTestSevereToxic = testLabels[: , 1].astype(np.float64)

yTrainObscene = labels[: , 2].astype(np.float64)
yTestObscene = testLabels[: , 2].astype(np.float64)

yTrainThreat = labels[: , 3].astype(np.float64)
yTestThreat = testLabels[: , 3].astype(np.float64)

yTrainInsult = labels[: , 4].astype(np.float64)
yTestInsult = testLabels[: , 4].astype(np.float64)

yTrainIHate = labels[: , 5].astype(np.float64)
yTestIHate = testLabels[: , 5].astype(np.float64)

Now I run the Multinomial Naive Bayes classifier for each label.
---
1. Creating a separate instance of `MultinomialNB()` for each label.
2. Fitting it on the vectorized train data and the column vector of the specific train label class, i.e `toxic, severe_toxic, obscene, threat, insult, identity_hate`.
3. Getting the instance to predict the vectorized test Data.
4. Using the column vector of the specific test label class, and getting the accuracy.

In [None]:
toxic_classifier = MultinomialNB()

toxic_classifier.fit(X_train_vectorized, yTrainToxic)

y_test_toxic = toxic_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestToxic, y_test_toxic)

print("Toxic Test Accuracy:", accuracy)

Toxic Test Accuracy: 0.9209884647847698


In [None]:
severeToxic_classifier = MultinomialNB()

severeToxic_classifier.fit(X_train_vectorized, yTrainSevereToxic)

y_test_sToxic = severeToxic_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestSevereToxic, y_test_sToxic)

print("Severe Toxic Test Accuracy:", accuracy)

Severe Toxic Test Accuracy: 0.9820407014911375


In [None]:
obscene_classifier = MultinomialNB()

obscene_classifier.fit(X_train_vectorized, yTrainObscene)

y_test_obscene = obscene_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestObscene, y_test_obscene)

print("Obscene Test Accuracy:", accuracy)

Obscene Test Accuracy: 0.9488730501109757


In [None]:
threat_classifier = MultinomialNB()

threat_classifier.fit(X_train_vectorized, yTrainThreat)

y_test_threat = threat_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestThreat, y_test_threat)

print("Threat Validation Accuracy:", accuracy)

Threat Validation Accuracy: 0.9898871487073682


In [None]:
insult_classifier = MultinomialNB()
insult_classifier.fit(X_train_vectorized, yTrainInsult)

y_test_insult = insult_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestInsult, y_test_insult)

print("Insult Test Accuracy:", accuracy)

Insult Test Accuracy: 0.9462158867110569


In [None]:
hate_classifier = MultinomialNB()
hate_classifier.fit(X_train_vectorized, yTrainIHate)

y_test_hate = hate_classifier.predict(X_test_vectorized)

accuracy = accuracy_score(yTestIHate, y_test_hate)

print("Hate Test Accuracy:", accuracy)

Hate Test Accuracy: 0.9806339679264747


Combining the predicted values of six columns to get the predicted labels across each class in a single np array.
---

In [None]:
import numpy as np
predictedValues = np.zeros((X_test_vectorized.shape[0], 6))
predictedValues[:, 0] = y_test_toxic
predictedValues[:, 1] = y_test_sToxic
predictedValues[:, 2] = y_test_obscene
predictedValues[:, 3] = y_test_threat
predictedValues[:, 4] = y_test_insult
predictedValues[:, 5] = y_test_hate
# predictedValues = np.concatenate((y_test_toxic, y_test_sToxic, y_test_obscene, y_test_threat, y_test_insult, y_test_hate), axis=1)

Formatting labels so that the corresponding label is outputted instead of *1s and 0s* for both the `predictedLabels` and the true labels i.e `testLabels`.

In [None]:
label_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

testLabelsOutput = []
predictedValuesOutput = []

for row in range(testLabels.shape[0]):
    label = ""
    for x in range(6):
        if testLabels[row][x] == 1:
            add = label_names[x] + " "
            label += add
    if (label == ""):
        testLabelsOutput.append("")
    else:
        testLabelsOutput.append(label)


for row in range(predictedValues.shape[0]):
    label = ""
    for x in range(6):
        if predictedValues[row][x] == 1:
            add = label_names[x] + " "
            label += add
    if (label == ""):
        predictedValuesOutput.append("")
    else:
        predictedValuesOutput.append(label)

Outputting the `testData`, its `trueLabels` and the `predictedLabels` which we got after predicting the `testData` against each class separately and then combining all the `predictedLabels` into a single matrix.
---
This way we get a clear comparison between the `trueLabels` of the text and what our model was able to predict.

In [None]:
for row in range(testData.shape[0]):
    if (testLabelsOutput[row] != "" and predictedValuesOutput[row] != ""):
        print("Text: ", testData[row])
        print("True Labels: ", testLabelsOutput[row])
        print("Naive Bayes Predicted Labels: ", predictedValuesOutput[row])
        print("\n")

Text:  dj robinson gay hell sucks dick much
True Labels:  toxic obscene insult identity_hate 
Naive Bayes Predicted Labels:  toxic obscene insult 


Text:  fuck antisemitic cunt
True Labels:  toxic obscene insult 
Naive Bayes Predicted Labels:  toxic severe_toxic obscene insult 


Text:  arrogant self serving immature idiot get right
True Labels:  toxic obscene insult 
Naive Bayes Predicted Labels:  toxic insult 


Text:  hate america going bomb shit cities quezas rain
True Labels:  toxic obscene threat 
Naive Bayes Predicted Labels:  toxic 


Text:  bold textyou suck u suck hannah montana
True Labels:  toxic obscene insult 
Naive Bayes Predicted Labels:  toxic severe_toxic obscene insult 


Text:  ghay ass fucker
True Labels:  toxic obscene insult identity_hate 
Naive Bayes Predicted Labels:  toxic severe_toxic obscene insult 


Text:  simple stupid
True Labels:  toxic obscene insult 
Naive Bayes Predicted Labels:  toxic 


Text:  random deletion deleted xanax bars fuck mah nigga fuck