<a href="https://colab.research.google.com/github/frnchskolymps/CCMACLRL_EXERCISES_COM222ML/blob/main/Exercise_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 7: Hate Speech Classification using Multinomial Naive Bayes

Instructions:
- You do not need to split your data. Use the training, validation and test sets provided below.
- Use Multinomial Naive Bayes to train a model that can classify if a sentence is a hate speech or non-hate speech
- A sentence with a label of zero (0) is classified as non-hate speech
- A sentence with a label of one (1) is classified as a hate speech

Apply text pre-processing techniques such as
- Converting to lowercase
- Stop word Removal
- Removal of digits, special characters
- Stemming or Lemmatization but not both
- Count Vectorizer or TF-IDF Vectorizer but not both

Evaluate your model by:
- Providing input by yourself
- Creating a Confusion Matrix
- Calculating the Accuracy, Precision, Recall and F1-Score

In [194]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import seaborn as sns
import re
import os, types

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, precision_score, recall_score, accuracy_score, balanced_accuracy_score, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

nltk.download("punkt")
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [195]:
splits = {'train': 'unique_train_dataset.csv', 'validation': 'unique_validation_dataset.csv', 'test': 'unique_test_dataset.csv'}

**Training Set**

Use this to train your model

In [196]:
df_train = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["train"])

**Validation Set**

Use this set to evaluate your model

In [197]:
df_validation = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["validation"])

**Test Set**
  
Use this set to test your model

In [198]:
df_test = pd.read_csv("hf://datasets/mapsoriano/2016_2022_hate_speech_filipino/" + splits["test"])

## A. Understanding your training data

1. Check the first 10 rows of the training dataset

In [199]:
df_train.head(10)

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
5,"""Ang sinungaling sa umpisa ay sinungaling hang...",1
6,Leni Kiko,0
7,Nahiya si Binay sa Makati kaya dito na lang sa...,1
8,Another reminderHalalan,0
9,[USERNAME] Maybe because VP Leni Sen Kiko and ...,0


2. Check how many rows and columns are in the training dataset using `.info()`

In [200]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21773 entries, 0 to 21772
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    21773 non-null  object
 1   label   21773 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 340.3+ KB


3. Check for NaN values

In [201]:
df_train.isnull().sum()

Unnamed: 0,0
text,0
label,0


4. Check for duplicate rows

In [202]:
df_train.duplicated().sum()

0

**5**. Check how many rows belong to each class

In [203]:
print(len(df_train))

21773


## B. Text pre-processing

6. Remove duplicate rows

In [204]:
df_train.drop_duplicates()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


In [205]:
df_train.duplicated().sum()

0

7. Remove rows with NaN values

In [206]:
df_train.dropna()

Unnamed: 0,text,label
0,Presidential candidate Mar Roxas implies that ...,1
1,Parang may mali na sumunod ang patalastas ng N...,1
2,Bet ko. Pula Ang Kulay Ng Posas,1
3,[USERNAME] kakampink,0
4,Bakit parang tahimik ang mga PINK about Doc Wi...,1
...,...,...
21768,Marcos Talunan Marcos Magnanakaw,1
21769,Grabe kayo kay binay ??????????,0
21770,[USERNAME] Cnu ba naman ang hindImabibighani s...,0
21771,RT [USERNAME]: Tabi tabi yung mga nagsasabing ...,1


In [207]:
def preprocess_text(text):
    text = text.lower()
    text = " ".join(word for word in text.split() if word not in stop_words)
    text = re.sub(r"http\S+|www\.\S+", "", text)
    text = re.sub(r"\w+@\w+\.com", "", text)
    text = re.sub(r"[.,;:!\?\"'`]", "", text)
    text = re.sub(r"[@#$%^&*\/\+-_=\{\}<>]", "", text)
    text = re.sub(r"½m|½s|½t|½ï", "", text)
    text = " ".join(WordNetLemmatizer().lemmatize(word, "v") for word in text.split())
    return text

8. Convert all text to lowercase

In [208]:
from sklearn.model_selection import train_test_split
# convert text to lowercase
df_train["text"] = df_train["text"].str.lower()

In [209]:
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali na sumunod ang patalastas ng n...,1
2,bet ko. pula ang kulay ng posas,1
3,[username] kakampink,0
4,bakit parang tahimik ang mga pink about doc wi...,1


9. Remove digits, URLS and special characters

In [210]:
# removing links
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"http\S+|www\.\S+", "", x))

# removing email addresses
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"\w+@\w+\.com", "", x))

# removing punctuation marks
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[.,;:!\?\"'`]", "", x))

# removing special characters
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"[@#$%^&*\/\+-_=\{\}<>]", "", x))

# removing unnecessary characters
df_train["text"] = df_train["text"].apply(lambda x: re.sub(r"½m|½s|½t|½ï", "", x))


In [211]:
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali na sumunod ang patalastas ng n...,1
2,bet ko pula ang kulay ng posas,1
3,username kakampink,0
4,bakit parang tahimik ang mga pink about doc wi...,1


In [212]:
tagalog_stop_words = [
    "ako", "ako'y", "alam", "at", "dahil", "habang", "ito", "iyan",
    "iyon", "ka", "kailangan", "kasama", "mga", "mas", "na", "ng",
    "ngunit", "o", "sa", "saan", "sino", "tayo", "tungkol", "upang", "wala",
    "ang", "anong", "bawat", "bago", "bakit", "dati", "dito", "ganito",
    "ganoon", "hindi", "iba", "ilalim", "isinasaalang-alang", "iyong",
    "kahit", "kaya", "kung", "maging", "muli", "narito", "ngayon",
    "pagkatapos", "paano", "pati", "tulad", "yung"
]


10. Remove stop words

In [213]:
# Removing stopwords from the data
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (tagalog_stop_words)]))
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali sumunod patalastas nescaf coff...,1
2,bet ko pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink about doc willie ong no re...,1


In [214]:
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali sumunod patalastas nescaf coff...,1
2,bet ko pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink about doc willie ong no re...,1


11. Use Stemming or Lemmatization

In [215]:
lemmatizer = WordNetLemmatizer()
df_train['text'] = df_train['text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))
df_train.head()

Unnamed: 0,text,label
0,presidential candidate mar roxas implies that ...,1
1,parang may mali sumunod patalastas nescaf coff...,1
2,bet ko pula kulay posas,1
3,username kakampink,0
4,parang tahimik pink about doc willie ong no re...,1


## C. Training your model

12. Put all text training data in variable **X_train**

In [216]:
X_train = df_train['text']


13. Put all training data labels in variable **y_train**

In [217]:
y_train = df_train['label']

14. Use `CountVectorizer()` or `TfidfVectorizer()` to convert text data to its numerical form.

Put the converted data to **X_train_transformed** variable

In [218]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

15. Create an instance of `MultinomalNB()`

In [219]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=0.1)

16. Train the model using `.fit()`

In [220]:
model.fit(X_train_transformed, y_train)

## D. Evaluate your model

17. Use `.predict()` to generate model predictions using the **validation dataset**


- Put all text validation data in **X_validation** variable

- Convert **X_validation** to its numerical form.

- Put the converted data to **X_validation_transformed**

- Put all predictions in **y_validation_pred** variable

In [221]:
X_validation = df_validation['text']
X_validation_transformed = vectorizer.transform(X_validation)
y_validation_pred = model.predict(X_validation_transformed)

18. Get the Accuracy, Precision, Recall and F1-Score of the model using the **validation dataset**

- Put all validation data labels in **y_validation** variable

In [222]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_validation = df_validation['label']

accuracy = accuracy_score(y_validation, y_validation_pred)
precision = precision_score(y_validation, y_validation_pred)
recall = recall_score(y_validation, y_validation_pred)
f1 = f1_score(y_validation, y_validation_pred)

print(f"Accuracy (Validation): {accuracy}")
print(f"Precision (Validation): {precision}")
print(f"Recall (Validation): {recall}")
print(f"F1 Score (Validation): {f1}")

Accuracy (Validation): 0.8167857142857143
Precision (Validation): 0.7998670212765957
Recall (Validation): 0.8501766784452297
F1 Score (Validation): 0.8242548818088387


19. Create a confusion matrix using the **validation dataset**

In [223]:
confusion_matrix(y_validation, y_validation_pred)

array([[1084,  301],
       [ 212, 1203]])

20. Use `.predict()` to generate the model predictions using the **test dataset**


- Put all text validation data in **X_test** variable

- Convert **X_test** to its numerical form.

- Put the converted data to **X_test_transformed**

- Put all predictions in **y_test_pred** variable

In [226]:
X_test = df_test['text']
X_test_transformed = vectorizer.transform(X_test)
y_test_pred = model.predict(X_test_transformed)

21. Get the Accuracy, Precision, Recall and F1-Score of the model using the **test dataset**

- Put all test data labels in **y_validation** variable



In [227]:
y_test = df_test['label']
accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)

print(f"Accuracy (Test): {accuracy_test}")
print(f"Precision (Test): {precision_test}")
print(f"Recall (Test): {recall_test}")
print(f"F1 Score (Test): {f1_test}")


Accuracy (Test): 0.8231316725978648
Precision (Test): 0.8070892978868439
Recall (Test): 0.8469241773962805
F1 Score (Test): 0.8265270506108202


22. Create a confusion matrix using the **test dataset**

In [228]:
confusion_matrix(y_test, y_test_pred)

array([[1129,  283],
       [ 214, 1184]])

## E. Test the model

23. Test the model by providing a non-hate speech input. The model should predict it as 0

In [234]:
new_input = ["mahalaga ka katulad ng pagpapahalaga sa mama ko"]
new_input_transformed = vectorizer.transform(new_input)
prediction = model.predict(new_input_transformed)
print("Prediction:", prediction)

Prediction: [0]


In [235]:
new_input = ["Ang init ng ulo ko ang gulo ng paligid ang sarap talaga sumigaw ng putangina"]
new_input_transformed = vectorizer.transform(new_input)
prediction = model.predict(new_input_transformed)
print("Prediction:", prediction)

Prediction: [1]


24. Test the model by providing a hate speech input. The model should predict it as 1

In [236]:
new_input = ["TANGINA MO NAMAN"]
new_input_transformed = vectorizer.transform(new_input)
prediction = model.predict(new_input_transformed)

print("Prediction:", prediction)

Prediction: [1]
