
# **Detecting Hate Speech on Twitter using Python: A Text Preprocessing and Classification Approach**

Detecting hate speech is an important task for social media platforms in order to maintain a safe and respectful environment for all users. In this notebook, we explored a method to detect hate speech in Twitter data using Python. We first loaded a raw CSV file containing Twitter data into a Pandas dataframe, and then preprocessed the text data by removing URLs, HTML tags, punctuation, and stop words. We also applied stemming to the words in the text data.

Next, we used the TextBlob library to classify the tweets as either positive, negative, or neutral, and then applied a custom rule-based approach to identify tweets that contained hate speech. We defined hate speech as tweets that contained offensive language, discriminatory terms, or harmful intent towards a particular group of people.

By using this method, we were able to detect and flag potentially harmful tweets, which could then be reviewed by moderators to determine if further action was necessary. This approach could be further refined and applied at scale to help identify and mitigate the impact of hate speech on social media platforms.

In [4]:
# URL of the raw CSV file on GitHub
csv_url = "https://raw.githubusercontent.com/amankharwal/Website-data/master/twitter.csv"

# Make an HTTP GET request to the CSV URL
response = requests.get(csv_url)

# Parse the response text using the csv module
csv_reader = csv.reader(response.text.splitlines())

# Loop through each row in the CSV file
for row in csv_reader:
    # Do something with the row data
    print(row)

[1;30;43mSe truncaron las últimas líneas 5000 del resultado de transmisión.[0m
['20223', '3', '0', '3', '0', '1', 'RT @oddfuckingtaco: Damn I hate a bitch that like to argue and shit']
['20224', '6', '0', '6', '0', '1', 'RT @oddfuckingtaco: https://t.co/nkCXCZwrXa my nigga really tried to save his bitch lol']
['20225', '3', '0', '3', '0', '1', "RT @odotkay: Mother's Day is n 16 mins. Y'all bitches picked y'all kids up from yo mama house so u can have 24hrs of custody n Instagram p&#8230;"]
['20226', '3', '0', '3', '0', '1', 'RT @officialbskip: I want a crazy bitch, they be loyal af &#128553;']
['20227', '3', '0', '3', '0', '1', 'RT @officialmckell: &amp; ima eat that pussy all filthy &#128525;']
['20228', '6', '2', '4', '0', '1', "RT @ogkaykay_: y'all hoes so annoying &#128530;"]
['20229', '3', '0', '3', '0', '1', 'RT @ogkels_: brought to you by bad bitches and good weed']
['20230', '3', '0', '3', '0', '1', 'RT @oh_sh1t: Damn, I got bitches *kanye voice*']
['20231', '3', '0', '3', '0

In [28]:
import tweepy
import numpy as np
import pandas as pd
import string
import re
from textblob import TextBlob
from io import StringIO
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import csv
import requests
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

Make an HTTP GET request to the CSV URL and get the response text

In [12]:
response = requests.get(csv_url).text

# Create a pandas DataFrame from the response text using StringIO
df = pd.read_csv(StringIO(response))

# Print the first few rows of the DataFrame
print(df.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  
0  !!! RT @mayasolovely: As a woman you shouldn't...  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...  


In [15]:
df["labels"] = df["class"].map({0: "Hate Speech", 
                                    1: "Offensive Language", 
                                    2: "No Hate and Offensive"})
print(df.head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet                 labels  
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language  


In [16]:
df = df[["tweet", "labels"]]
print(df.head())

                                               tweet                 labels
0  !!! RT @mayasolovely: As a woman you shouldn't...  No Hate and Offensive
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...     Offensive Language
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...     Offensive Language
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...     Offensive Language
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...     Offensive Language


In [25]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [26]:
# Define a function to clean the tweets
def clean(text):
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = text.lower()
    text = " ".join(text.split())
    stopwords_set = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')
    text = [word for word in text.split(' ') if word not in stopwords_set]
    text=" ".join(text)
    text = [stemmer.stem(word) for word in text.split(' ')]
    text=" ".join(text)
    return text

# Apply the clean function to the tweets column of the DataFrame
df["tweet"] = df["tweet"].apply(clean)



Now let’s split the dataset into training and test sets and train a machine learning model for the task of hate speech detection:

In [29]:
x = np.array(df["tweet"])
y = np.array(df["labels"])

cv = CountVectorizer()
X = cv.fit_transform(x) # Fit the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)

Now let’s test this machine learning model to see if it detects hate speech or not:

In [30]:
sample = "true but then again I'm my opinion a hoe is someone who goes and sleeps with everybody"
data = cv.transform([sample]).toarray()
print(clf.predict(data))

['Offensive Language']
