In [17]:
import pandas as pd
import numpy as np

from matplotlib import figure
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('wordnet')

import string


mt = pd.read_csv('MeTooHate.csv')

[nltk_data] Downloading package stopwords to /Users/alex/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/alex/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
return false;
}

<IPython.core.display.Javascript object>

## To start:

This dataset is too big for me to upload to GitHub. So, I cleaned the dataset first then I extracted the cleaned dataset and started a new notebook to continue the work. The reason behind this is because the cleaning takes a few minutes. 

# Me Too Hate Comments
***

The goal of this project is to seperate hateful and non-hateful tweets.

## Step 1: Cleaning the dataset

which we always do by first taking a look at the big picture:
- small view of the dataset
- check the size of the dataset

In [5]:
mt.head()

Unnamed: 0,status_id,text,created_at,favorite_count,retweet_count,location,followers_count,friends_count,statuses_count,category
0,1046207313588236290,"Entitled, obnoxious, defensive, lying weasel. ...",2018-09-30T01:17:15Z,5,1,"McAllen, TX",2253,2303,23856,0
1,1046207328113086464,Thank you and for what you did for the women...,2018-09-30T01:17:19Z,5,2,"Tampa, FL",2559,4989,19889,0
2,1046207329589493760,Knitting (s) &amp; getting ready for January 1...,2018-09-30T01:17:19Z,0,0,"St Cloud, MN",16,300,9,0
3,1046207341283168256,Yep just like triffeling women weaponized thei...,2018-09-30T01:17:22Z,1,0,flyover country,3573,3732,38361,1
4,1046207347016826880,"No, the President wants to end movement posin...",2018-09-30T01:17:23Z,0,0,World,294,312,7635,0


In [6]:
mt.shape

(807174, 10)

Ok, so now we've seen what the data is about and that we have a big dataset to work with.
***
Next step is looking for
### missing values.

In [7]:
IsNull = mt.isnull().sum()
print(IsNull)

status_id               0
text                 3536
created_at              0
favorite_count          0
retweet_count           0
location           190768
followers_count         0
friends_count           0
statuses_count          0
category                0
dtype: int64


### My plan cleaning:
***
We'll remove some of the useless columns first to get rid of useless information, "Location" is also going to be removed because it has too many missing values and is also not usefull for finding hate and non-hate comments. Next, we remove the isnull rows from the "text" because these are also not usefull for our model. Without any input those rows cannot help predict.

As you can see in the code below we remove every column except "Text". All the other columns are irrelevant for our model

In [8]:
#Remove useless columns
mt = mt.drop(['status_id','location', 'created_at',
        'followers_count', 'friends_count', 'statuses_count', 'category', 'retweet_count', 'favorite_count'
       ], axis=1)

In [9]:
mt = mt.dropna()
mt.shape

(803638, 1)

Well, that looks a little smaller :)
***
Next up we use the NLT (Natural Language Toolkit) to clean off punctuation, stopwords. first step is cleaning the punctuations. 

### Cleaning punctuation

In [10]:
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    text_without_punct = text.translate(translator)
    return text_without_punct

mt['text_without_punct'] = mt['text'].apply(remove_punctuation)
mt['text_without_punct']

0         Entitled obnoxious defensive lying weasel This...
1         Thank you  and  for what you did for the women...
2         Knitting s amp getting ready for January 19 20...
3         Yep just like triffeling women weaponized thei...
4         No the President wants to end  movement posing...
                                ...                        
807169    Let’s not forget that this “iconic kiss” was u...
807170    DEFINITELYthe only one any of us should suppor...
807171    Did the  movement count the dollars of Erin An...
807172    This is one of my all time fav songs amp video...
807173     I watched your news on the death of the sailo...
Name: text_without_punct, Length: 803638, dtype: object

We tokenize the text here to break it down into individual words. This is very usefull when youre preprocessing text.

***


Next step
### Removing stopwords

"Stop words are commonly used words (e.g., "a", "an", "the") that often do not carry significant meaning and are typically removed during text processing."

In [11]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = nltk.word_tokenize(text)
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]  #.lower() -> lowercase
    text_without_stopwords = ' '.join(filtered_tokens) #joins words back together in a string
    return text_without_stopwords

mt['text_without_stopwords'] = mt['text_without_punct'].apply(remove_stopwords)
mt['text_without_stopwords'] 

0         Entitled obnoxious defensive lying weasel thin...
1                                Thank women survivors week
2                Knitting amp getting ready January 19 2019
3         Yep like triffeling women weaponized poon Wond...
4              President wants end movement posing movement
                                ...                        
807169    Let ’ forget “ iconic kiss ” uninvited sexual ...
807170    DEFINITELYthe one us support unconditionally G...
807171        movement count dollars Erin Andrews wondering
807172    one time fav songs amp videos brutally honest ...
807173    watched news death sailor famous WW2 Kiss phot...
Name: text_without_stopwords, Length: 803638, dtype: object

### Stemming and Lemmatization

To give you a better understanding ( and myself ) im going to give a small explanation on what these are and what they mean.

Stemming:

Stemming involves reducing words to their base or root form by removing suffixes or prefixes. It follows a set of predefined rules to transform words. For example, words like "running," "runs," and "ran" would all be reduced to their stem "run." 

plural noun: suffixes
1. a morpheme added at the end of a word to form a derivative (e.g. -ation, -fy, -ing, -itis ).
"for the last few decades we've appended the suffix ‘gate’ to basically any scandal"

Lemmatization:
Lemmatization is a more advanced technique compared to stemming. It aims to determine the base form of words using vocabulary and morphological analysis. For example, the word "better" would be lemmatized to "good" rather than the simple stem "bet." 

In [13]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


def preprocess_text(text):
    # Tokenization
    tokens = nltk.word_tokenize(text)
    
    # Stemming
    stemmed_words = [stemmer.stem(word) for word in tokens]
    
    # Lemmatization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
    
    return lemmatized_words

mt['processed_text'] = mt['text_without_stopwords'].apply(preprocess_text)
mt['processed_text']

0         [entitl, obnoxi, defens, lie, weasel, thing, m...
1                            [thank, woman, survivor, week]
2                [knit, amp, get, readi, januari, 19, 2019]
3         [yep, like, triffel, woman, weapon, poon, wond...
4             [presid, want, end, movement, pose, movement]
                                ...                        
807169    [let, ’, forget, “, icon, kiss, ”, uninvit, se...
807170    [definitelyth, one, u, support, uncondit, godi...
807171      [movement, count, dollar, erin, andrew, wonder]
807172    [one, time, fav, song, amp, video, brutal, hon...
807173    [watch, news, death, sailor, famou, ww2, kiss,...
Name: processed_text, Length: 803638, dtype: object

***
Now that we have cleaned the datset we will extract the cleaned "mt" and create a new dataset with it.

In [15]:
new_dataset = pd.DataFrame({'processed_text': mt['processed_text']})
new_dataset.to_csv('new_dataset.csv', index=False)