# Graduation Project - Hate Speech on Reddit

### Data Pre-processing
#### We will perform the following steps:

- ['deleted'] values from text itself and author will be removed.
- Lowercase the words and remove punctuation.
- Replace some custom words.
- Tokenization: Split the text into sentences and the sentences into words. 
- Stop words removing.
- Words are stemmedâ€Šâ€”â€Šwords are reduced to their root form.

Import all necessary packages - Load Gensim, nltk and all other libraries

In [12]:
import pandas as pd
import numpy as np
import re
import gensim
from gensim import models, corpora
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from stop_words import get_stop_words
import pyLDAvis.gensim as gensimvis
import pyLDAvis

Custom dataframe cleaning from CSV file and removing 'deleted' values.

In [2]:
#Get DataFrame
def dataframe():
    df = pd.read_csv('GRADUATE PROJECT/df.csv', sep=',', engine='python')
    df = df[df.Author != '[deleted]']
    df = df[df.Text !='[deleted]']
    df.sort_values('Author', inplace=True)
    return df

df = dataframe()
df.head()

Unnamed: 0,Subreddit,Author,Text
288291,"""The_Donald""","""------_______""","""I agree I hate that russian dick! he is such..."
255138,"""BlackPeopleTwitter""","""-----_------_---""","""No, but most people who can say \""x isn't rac..."
255072,"""BlackPeopleTwitter""","""-----_------_---""","""It's not as bad as it sounds. Although somewh..."
207009,"""dankmemes""","""-----_------_---""","""Get fucked, bitches"""
252805,"""BlackPeopleTwitter""","""-----_------_---""","""ðŸ”¥ðŸ˜¡ fuck these bitches making the rest of us l..."


Text preprocessing

In [3]:
def preprocess_text(df):
    # Lower case
    df.Text = df.Text.str.lower()
    # Custom replacing
    replace_list = ['haha','hahaha','lol']
    for i in replace_list:
        df.Text = df.Text.str.replace(i, 'funny')
    # Remove punctuation
    tokenizer = RegexpTokenizer(r'\w+')
    df.Text = df.Text.apply(lambda x: tokenizer.tokenize(x))
    #df.loc[:,"Text"] = df.Text.apply(lambda x : " ".join(re.findall('[\w]+',x)))
    #Tokenizing
    #df.Text = df.apply(lambda row: nltk.word_tokenize(row['Text']), axis=1)
    return df

def stop_words(df):
    # Customized stop words list
    custom_stop = set(['Heh','heh','language','know', 'wow', 'hah', 'hey','really','year'
                       , 'yeah','wtf', 'tfw', 'meh', 'oops', 'nah', 'yea','doesnt','dont','make'
                       , 'huh', 'mar', 'umm', 'like', 'think','right', 'duh', 'sigh', 'wheres'
                       , 'hmm', 'interesting', 'article','good','know','say', 'hello', 'yup'
                       ,'im', 'ltsarcasmgt', 'hehe', 'blah', 'nope', 'ouch', 'uh', 'to', 'is'
                       , 'are', 's', 't'] + stopwords.words('english') + get_stop_words('en'))
    # Remove it
    df.Text = df.Text.apply(lambda x: [item for item in x if item not in custom_stop])
    return df

# Remove quotes from whole dataframe
df['Subreddit'] = df['Subreddit'].str.replace('"', '')
df['Author'] = df['Author'].str.replace('"', '')

df = preprocess_text(df)
df = stop_words(df)
df.head()

Unnamed: 0,Subreddit,Author,Text
288291,The_Donald,------_______,"[agree, hate, russian, dick, basterd, idiot, p..."
255138,BlackPeopleTwitter,-----_------_---,"[people, x, racist, nigger, fascists, phrase, ..."
255072,BlackPeopleTwitter,-----_------_---,"[bad, sounds, although, somewhere, north, muni..."
207009,dankmemes,-----_------_---,"[get, fucked, bitches]"
252805,BlackPeopleTwitter,-----_------_---,"[fuck, bitches, making, rest, us, look, bad]"


Stemming function

In [4]:
# Stemmer function
def stemmer_func(df):
    stemmer = SnowballStemmer("english")
    #df.Text = df.Text.map(lambda x: ' '.join([stemmer.stem(y) for y in x.split(' ')]))
    df.Text = df.Text.apply(lambda x: [stemmer.stem(y) for y in x])
    return df
df = stemmer_func(df)
df.head()

Unnamed: 0,Subreddit,Author,Text
288291,The_Donald,------_______,"[agre, hate, russian, dick, basterd, idiot, pe..."
255138,BlackPeopleTwitter,-----_------_---,"[peopl, x, racist, nigger, fascist, phrase, sm..."
255072,BlackPeopleTwitter,-----_------_---,"[bad, sound, although, somewher, north, munici..."
207009,dankmemes,-----_------_---,"[get, fuck, bitch]"
252805,BlackPeopleTwitter,-----_------_---,"[fuck, bitch, make, rest, us, look, bad]"


### Bag of Words on the Data set

Create a dictionary from â€˜textsâ€™ containing the number of times a word appears in the data set.

In [5]:
texts = df.Text.tolist()
dictionary = gensim.corpora.Dictionary(texts)
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 agre
1 antisemit
2 basterd
3 bitch
4 dick
5 facist
6 fuck
7 go
8 hate
9 hell
10 idiot


#### Gensim filter_extremes

#### Filter out tokens that appear in

- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not absolute number).
- after the above two steps, keep only the first 100000 most frequent tokens.

In [6]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

### Gensim doc2bow

For each document we create a dictionary reporting how many words and how many times those words appear. Afterwards, save this to â€˜bow_corpusâ€™.

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

### Running LDA using Bag of Words

Train our LDA model using gensim.models.LdaMulticore and save it to â€˜lda_modelâ€™. For each topic, we will explore the words occuring in that topic and its relative weight.

In [9]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=30, id2word=dictionary, passes=2, workers=2)
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.041*"anonym" + 0.035*"peopl" + 0.032*"use" + 0.032*"word" + 0.028*"hate" + 0.023*"call" + 0.018*"op" + 0.017*"say" + 0.016*"someon" + 0.012*"mean"
Topic: 1 
Words: 0.024*"polic" + 0.020*"law" + 0.017*"u" + 0.013*"report" + 0.013*"crime" + 0.010*"court" + 0.010*"state" + 0.010*"arrest" + 0.009*"offic" + 0.008*"np"
Topic: 2 
Words: 0.028*"homosexu" + 0.028*"peopl" + 0.010*"gay" + 0.008*"person" + 0.008*"sexual" + 0.007*"one" + 0.007*"accept" + 0.006*"communiti" + 0.006*"way" + 0.006*"thing"
Topic: 3 
Words: 0.109*"white" + 0.072*"black" + 0.045*"peopl" + 0.025*"race" + 0.021*"racist" + 0.015*"latino" + 0.014*"asian" + 0.012*"american" + 0.011*"racism" + 0.010*"negro"
Topic: 4 
Words: 0.044*"watch" + 0.033*"video" + 0.024*"show" + 0.016*"play" + 0.014*"game" + 0.013*"movi" + 0.011*"music" + 0.010*"tv" + 0.010*"youtub" + 0.009*"listen"
Topic: 5 
Words: 0.076*"fuck" + 0.028*"train" + 0.027*"dead" + 0.022*"secret" + 0.020*"top" + 0.020*"littl" + 0.018*"mayb" + 0.016*"confi

## Visualization for LDA with TF

In [10]:
vis_data = gensimvis.prepare(lda_model, bow_corpus, dictionary)
pyLDAvis.display(vis_data)