# Problem Statement: Hate Speech Classification

Dataset using Twitter data, is was used to research hate-speech detection. The text is classified as: hate-speech, offensive language, and neither. Due to the nature of the study, it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.

## Column Description:
- count:
number of CrowdFlower users who coded each tweet (min is 3, sometimes more users coded a tweet when judgments were determined to be unreliable by CF)

- hate_speech:
number of CF users who judged the tweet to be hate speech

- offensive_language:
number of CF users who judged the tweet to be offensive

- neither:
number of CF users who judged the tweet to be neither offensive nor non-offensive

- class:
class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither

- tweet:
text tweet


## Target Column:
Need to predict the class column for test data by applying suitable NLP based algorithms.

- class:
class label for majority of CF users. 0 - hate speech 1 - offensive language 2 - neither


In [1]:
import pandas as pd
import numpy as np
import nltk

In [2]:
train_dataset = pd.read_csv('speech_train.csv', dtype={'count':int, 'hate_speech':int, 
                                                       'offensive_language':int,'neither':int, 
                                                       'class':int, 'tweet':str})
test_dataset = pd.read_excel('speech_test.xlsx', dtype={'class':float, 'tweet': str})

In [3]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 499 entries, 0 to 498
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   count               499 non-null    int32 
 1   hate_speech         499 non-null    int32 
 2   offensive_language  499 non-null    int32 
 3   neither             499 non-null    int32 
 4   class               499 non-null    int32 
 5   tweet               499 non-null    object
dtypes: int32(5), object(1)
memory usage: 13.8+ KB


In [4]:
test_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 164 entries, 0 to 163
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   class   0 non-null      float64
 1   tweet   164 non-null    object 
dtypes: float64(1), object(1)
memory usage: 2.7+ KB


In [5]:
train_dataset['class'].value_counts() / len(train_dataset)

1    0.763527
2    0.168337
0    0.068136
Name: class, dtype: float64

In [6]:
train_dataset.head(5)

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,1,2,0,1,RT @HENNERGIZED: &#128557;&#128557;&#128557;&#...
1,3,0,3,0,1,We are back bitches! @vnpacheco21 @xoxoclaire_...
2,3,0,3,0,1,"RT @SuperrrrMcNasty: Lmfao , this bitch was gi..."
3,3,0,3,0,1,@bradley_eckman don't say shit like that lil n...
4,3,1,2,0,1,Young jezzy dat nigguh


In [7]:
train_dataset.tweet.head(20).to_list()

['RT @HENNERGIZED: &#128557;&#128557;&#128557;&#8220;@StopBeingSober: Yall say "Why you dating lil girls" like mature hoes just on a rampage outside.&#8221;',
 'We are back bitches! @vnpacheco21 @xoxoclaire_ @treyhunter_ @Medgmon @delaney_jade @sean_steezy @alexistiefa',
 'RT @SuperrrrMcNasty: Lmfao , this bitch was giving head in the back of a classroom she better go tf ahead !',
 "@bradley_eckman don't say shit like that lil nig .. Or I'll give you a &#127812; stamp",
 'Young jezzy dat nigguh',
 'Same shit RT @Che_TheYello1: But your bm tho? ... "@viaNAWF: If my ex wanna go fuck my enemy, may God be with her. Ain\'t my hoe no more."',
 'RT @SumGurl07: So cute! :) RT @iTweetFacts: Shy bunny... http://t.co/z4u6NpORdz',
 'This bitch always got shit to say about what other people like',
 "RT @drewbillionaire: I just wanna pimp these hoe$ &amp; Ride Rarri's",
 'I heard them same pussy niggas hatin !',
 "RT @michael4h2o: An #Ohio inmate sums up what #LeBronJames' return means for #Clevelan

In [8]:
import html
train_dataset['tweet'].apply(html.unescape).head(20).to_list()

['RT @HENNERGIZED: 😭😭😭“@StopBeingSober: Yall say "Why you dating lil girls" like mature hoes just on a rampage outside.”',
 'We are back bitches! @vnpacheco21 @xoxoclaire_ @treyhunter_ @Medgmon @delaney_jade @sean_steezy @alexistiefa',
 'RT @SuperrrrMcNasty: Lmfao , this bitch was giving head in the back of a classroom she better go tf ahead !',
 "@bradley_eckman don't say shit like that lil nig .. Or I'll give you a 🍄 stamp",
 'Young jezzy dat nigguh',
 'Same shit RT @Che_TheYello1: But your bm tho? ... "@viaNAWF: If my ex wanna go fuck my enemy, may God be with her. Ain\'t my hoe no more."',
 'RT @SumGurl07: So cute! :) RT @iTweetFacts: Shy bunny... http://t.co/z4u6NpORdz',
 'This bitch always got shit to say about what other people like',
 "RT @drewbillionaire: I just wanna pimp these hoe$ & Ride Rarri's",
 'I heard them same pussy niggas hatin !',
 "RT @michael4h2o: An #Ohio inmate sums up what #LeBronJames' return means for #Cleveland: http://t.co/a68mtZQOeL http://t.co/bMrzFWstVA

In [9]:
# !conda install -c conda-forge langdetect
# !conda install -auto emoji

In [10]:
%%time
# detect language of the tweets
from langdetect import detect
train_dataset['lang'] = train_dataset.tweet.apply(detect)

Wall time: 6.47 s


In [11]:
train_dataset.loc[train_dataset['lang'] != 'en', ['lang', 'tweet']].head(10)

Unnamed: 0,lang,tweet
4,id,Young jezzy dat nigguh
21,nl,@Channnteeel pussy
23,af,Wish I got hoes like James Bond
26,cs,Fuk dat bitch lol @dropolo voice
34,tl,Fucking gook
59,id,"' Nah I ain't no hoe niggah , no bitch niggah ..."
88,cy,"I'll pay yall niggas to get lost, how much y'a..."
106,fi,Ronny kamm is a pussy
145,cy,@GoHard_Brown @your_daddy9 &amp; xavier you bi...
148,af,Big night of &#127936; #hoosiers #iubb #big10 ...


In [12]:
train_dataset['0'] = train_dataset['hate_speech'] / train_dataset['count']
train_dataset['1'] = train_dataset['offensive_language'] / train_dataset['count']
train_dataset['2'] = train_dataset['neither'] / train_dataset['count']
train_dataset['combined_class_votes'] = train_dataset['0'] * 0 + \
                                        train_dataset['1'] * 1 + \
                                        train_dataset['2'] * 2
train_dataset['class2'] = np.round(train_dataset['combined_class_votes'], 1).astype(str)
train_dataset.head(3)

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,lang,0,1,2,combined_class_votes,class2
0,3,1,2,0,1,RT @HENNERGIZED: &#128557;&#128557;&#128557;&#...,en,0.333333,0.666667,0.0,0.666667,0.7
1,3,0,3,0,1,We are back bitches! @vnpacheco21 @xoxoclaire_...,en,0.0,1.0,0.0,1.0,1.0
2,3,0,3,0,1,"RT @SuperrrrMcNasty: Lmfao , this bitch was gi...",en,0.0,1.0,0.0,1.0,1.0


In [13]:
train_dataset.class2.unique()

array(['0.7', '1.0', '2.0', '0.8', '1.7', '0.3', '1.3', '1.8', '0.0',
       '1.2'], dtype=object)

In [14]:
# source: https://codereview.stackexchange.com/questions/163446/cleaning-and-extracting-meaningful-text-from-tweets
import re
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer
import unicodedata
import emoji
import html

twitter_stop_words = ['rt']
stopword_set = set(stopwords.words("english") + twitter_stop_words)
# stemmer_func = WordNetLemmatizer().lemmatize
stemmer_func = PorterStemmer().stem
# stemmer_func = SnowballStemmer(language='english').stem
# stemmer_func = lambda x: x

def preprocess(raw_text, stop_words=stopword_set, stemmer_func=stemmer_func):
    # normalize unicode text
    normalized_text = unicodedata.normalize('NFKD', raw_text)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')

    # convert html entities to unicode
    text = html.unescape(normalized_text)

    # convert emojis to their corresponding text
    emojiwords_text = emoji.demojize(text)

    # remove hyperlinks
    link_free_text = re.sub("(https?:\/\/)(\s)?(www\.)?(\s?)(\w+\.)*([\w\-\s]+\/)*([\w-]+)\/?", " ", emojiwords_text)

    # remove user mentions
    mention_free_text = re.sub("@([A-Za-z0-9_]+)", " ", link_free_text)

    # remove unwanted characters
    text = re.sub("[^a-zA-Z_]", " ", mention_free_text)

    # convert to lower case and split 
    words = text.lower().split()

    # remove stopwords
    stemmed_words = [stemmer_func(w) for w in words if w not in stop_words]

    # join the cleaned words in a list
    cleaned_word_list = " ".join(stemmed_words)

    return cleaned_word_list

train_dataset.tweet.apply(preprocess).head(20).to_list()

['loudly_crying_fac loudly_crying_fac loudly_crying_fac yall say date lil girl like matur hoe rampag outsid',
 'back bitch',
 'lmfao bitch give head back classroom better go tf ahead',
 'say shit like lil nig give mushroom stamp',
 'young jezzi dat nigguh',
 'shit bm tho ex wanna go fuck enemi may god hoe',
 'cute shi bunni',
 'bitch alway got shit say peopl like',
 'wanna pimp hoe ride rarri',
 'heard pussi nigga hatin',
 'ohio inmat sum lebronjam return mean cleveland leb',
 'want money ian trippin hoe',
 'men need watch cream pie bitch yeast infect day monistat treatment know differ',
 'gradeschool danc tri worm impress bitch slam dick unforgiv ceram',
 'perhap diehard yanke fan point watch debacl yanke tiger mlb els lol',
 'homi reallli love fat bitch face_with_tears_of_joy loudly_crying_fac face_with_tears_of_joy like love',
 'orr tht lor bae tho wyd u fuck hoe',
 'time bounc bronx letsgoyanke walkoff',
 'bitch rather petti hoe instead post bail ah realli fck em psa',
 'piss lad p

In [15]:
clean_train_dataset = train_dataset.drop(axis=1, columns=
['hate_speech', 'offensive_language', 'neither', 'count', 'tweet', 'combined_class_votes', 'lang'])
clean_test_dataset = test_dataset.drop(axis=1, columns=['tweet'])
clean_train_dataset['tweet'] = train_dataset['tweet'].apply(preprocess)
clean_test_dataset['tweet'] = test_dataset['tweet'].apply(preprocess)
clean_train_dataset.head(5)

Unnamed: 0,class,0,1,2,class2,tweet
0,1,0.333333,0.666667,0.0,0.7,loudly_crying_fac loudly_crying_fac loudly_cry...
1,1,0.0,1.0,0.0,1.0,back bitch
2,1,0.0,1.0,0.0,1.0,lmfao bitch give head back classroom better go...
3,1,0.0,1.0,0.0,1.0,say shit like lil nig give mushroom stamp
4,1,0.333333,0.666667,0.0,0.7,young jezzi dat nigguh


# Supervised Learning

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import numpy as np

corpus = clean_train_dataset['tweet'].to_list() + clean_test_dataset['tweet'].to_list()
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

X = vectorizer.transform(clean_train_dataset['tweet'])
y = clean_train_dataset[['class', 'class2']]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

# model = LogisticRegression(multi_class='multinomial', max_iter=700)
model = RandomForestClassifier()

predict_use_more_classes = True

if not predict_use_more_classes:
    model.fit(X_train, y_train['class'])
    y_pred = model.predict(X_valid)
    print(accuracy_score(y_pred, y_valid['class']))
else:
    model.fit(X_train, y_train['class2'])
    y_pred = model.predict(X_valid)
    print(y_pred[:15])
    y_pred = np.round(y_pred.astype(float), 0).astype(int)
    print(accuracy_score(y_pred, y_valid['class']))

['1.0' '1.0' '1.0' '1.0' '1.0' '2.0' '1.0' '1.0' '1.0' '1.0' '0.7' '1.0'
 '1.0' '1.0' '2.0']
0.82


In [17]:
X_test = vectorizer.transform(clean_test_dataset['tweet'])
if not predict_use_more_classes:
    clean_test_dataset['class'] = model.predict(X_test)
else:
    predictions = model.predict(X_test)
    clean_test_dataset['class'] = np.round(predictions.astype(float), 0).astype(int)
    
clean_test_dataset.to_csv('speech_predictions.csv')

# LDA Topic Modeling

In [18]:
import gensim

def get_dictionary(train_docs, test_docs):
    dictionary = gensim.corpora.Dictionary(train_docs)
    dictionary.merge_with(gensim.corpora.Dictionary(test_docs))
    return dictionary

train_docs = clean_train_dataset['tweet'].apply(str.split)
test_docs = clean_test_dataset['tweet'].apply(str.split)
dictionary = get_dictionary(train_docs, test_docs)
bow_corpus = [dictionary.doc2bow(doc) for doc in train_docs]



In [None]:
%%time
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 3, 
                                   id2word = dictionary,                                    
                                   passes = 100,
                                   workers = 2)


In [None]:
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")