<a href="https://colab.research.google.com/github/akramlodi/Hate-Speech-Classification/blob/main/Hate_speech_Detection_implementing_stemming_and_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hate speech Recognition classification algorithm
Objective is to implement stemming and lemmatization techniques and compare their performance.
These two techniques deal with cleaning data, using Python NLTK (Natural Language Toolkit) Package.

In [None]:
import nltk
nltk.download('wordnet')

from pathlib import Path
import pandas as pd
import numpy as np
import re
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# from nltk.corpus import stopwords

df = pd.read_csv('sample_data/labeled_data.csv')

df['label'] = df.apply(lambda row: 'Hate speech' if row['hate_speech'] > 0 or row['offensive_language'] > 0 else 'none', axis=1)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

import re
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # to Remove URLs
    text = re.sub(r'@\w+', '', text)     # to Remove mentions
    text = re.sub(r'#\w+', '', text)     # to Remove hashtags
    text = re.sub(r'\d+', '', text)      # to Remove numbers
    text = text.lower()                  # to Convert to lowercase
    text = re.sub(r'\W+', ' ', text)     # to Remove special characters
    text = re.sub(r'\s+', ' ', text)     # Remove extra spaces
    text = text.strip()                  # Remove leading/trailing spaces
    text = re.sub(r'rt', '', text)       # Remove retweets
    text = re.sub(r'\n', '', text)       # Remove newline characters

    return text

# # Function to apply stemming
def stem_text(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])

# Function to apply lemmatization
def lemmatize_text(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

# Apply text cleaning
df['cleaned_tweet'] = df['tweet'].apply(clean_text)

# Apply stemming
df['stemmed_tweet'] = df['cleaned_tweet'].apply(stem_text)

# Apply lemmatization
df['lemmatized_tweet'] = df['cleaned_tweet'].apply(lemmatize_text)


# Select relevant columns
df = df[['cleaned_tweet', 'label','stemmed_tweet', 'lemmatized_tweet']]
df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,cleaned_tweet,label,stemmed_tweet,lemmatized_tweet
0,as a woman you shouldn t complain about clean...,none,as a woman you shouldn t complain about clean ...,a a woman you shouldn t complain about cleanin...
1,boy dats cold tyga dwn bad for cuffin dat hoe...,Hate speech,boy dat cold tyga dwn bad for cuffin dat hoe i...,boy dat cold tyga dwn bad for cuffin dat hoe i...
2,dawg you ever fuck a bitch and she sta to cr...,Hate speech,dawg you ever fuck a bitch and she sta to cri ...,dawg you ever fuck a bitch and she sta to cry ...
3,she look like a tranny,Hate speech,she look like a tranni,she look like a tranny
4,the shit you hear about me might be true or i...,Hate speech,the shit you hear about me might be true or it...,the shit you hear about me might be true or it...


#Vectorization for Stemmed tweets

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['stemmed_tweet'])
y = df['label']

NameError: name 'df' is not defined

#Splitting the dataset into training and testing

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Training Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

#Classification report and confusion Matrix for Stemming Method

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

 Hate speech       0.93      0.98      0.96      4384
        none       0.79      0.47      0.59       573

    accuracy                           0.92      4957
   macro avg       0.86      0.73      0.77      4957
weighted avg       0.92      0.92      0.92      4957

[[4314   70]
 [ 305  268]]


#Vectorization for Lemmatized Tweets

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorize the text data
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['lemmatized_tweet'])
y = df['label']

#Splitting the dataset into training and testing

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Training Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

#Classification Report and Confusion Matrix for Lemmatization Method

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

 Hate speech       0.93      0.98      0.96      4384
        none       0.79      0.46      0.58       573

    accuracy                           0.92      4957
   macro avg       0.86      0.72      0.77      4957
weighted avg       0.92      0.92      0.91      4957

[[4312   72]
 [ 307  266]]


##Result and Conclusion
Comparing the classification reports and confusion Matrix for the two techniques discussed above, one can infer that these two methods have similar performance and rather do not add much value to the Model.

In addition, when compared with cleaned data, which was fed to both the techniques, the results were very similar to each other.

Further testing is required.