## Data Cleaning, Pre-Processing, and Analysis

# Data Cleaning
This is the code for data cleaning, which means finding missing values and using the meaningful data for the classifier. 

In [16]:
# data analysis imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# NLP Imports
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

from sklearn.feature_extraction.text import CountVectorizer

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import wordninja

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91897\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91897\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
# data configurations
pd.set_option('display.max_columns', 100)
sns.set_style("darkgrid")

In [18]:
# initializing cvs files
depression = pd.read_csv('depression.csv')
suicide_watch = pd.read_csv('suicide_watch.csv')
casual_convo = pd.read_csv('casual_conversation_vs_suicide.csv')

In [19]:
#visualizing depression dataset
pd.set_option('display.max_columns', 500)
casual_convo.head()
casual_convo.columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'parent_whitelist_status', 'stickied', 'url', 'subreddit_subscribers',
       'created_utc', 'num_crossposts', 'media', 'is_video', 'author_cakeday',
       'is_suicide'],
      dtype='object', length=107)

### Relevant Data

After viewing the data, there are 100 columns, but barely any of them are really needed for our classifier. We will choose the proper columns and go from there. 

To start, we will look at the title, text body, author username, number of comments, and lastly the URL of the post. 

In [20]:
casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].head(5)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,r/CasualConversation Welcome Thread - Month of...,Welcome to r/CasualConversation! Thank you for...,AutoModerator,65,0,https://www.reddit.com/r/CasualConversation/co...
1,"So, we bought a bidet...","Yes, a bidet. And I would like to tell everyon...",Zappavishnu,592,0,https://www.reddit.com/r/CasualConversation/co...
2,I started the process of legally changing my n...,Tldr; Birth name holds trauma. After 8 years o...,Ox-Moi,48,0,https://www.reddit.com/r/CasualConversation/co...
3,I pushed myself out of my comfort zone and sta...,I always wanted to be an artist but never been...,QueenPersephoneia,14,0,https://www.reddit.com/r/CasualConversation/co...
4,I got the job!,After over two years (about 4-5 months were af...,Fristi2147,15,0,https://www.reddit.com/r/CasualConversation/co...


From these rows, we can see a few posts that people posted. The second post looks like it is from a moderator, as it is a checkin and they have lots of comments. Posts like these potentially have to be removed. 

In [21]:
suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].head(5)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,New wiki on how to avoid accidentally encourag...,We've been seeing a worrying increase in pro-s...,SQLwitch,251,1,https://www.reddit.com/r/SuicideWatch/comments...
1,Please remember that NO ACTIVISM of any kind i...,"Activism, i.e. advocating or fundraising for s...",SQLwitch,43,1,https://www.reddit.com/r/SuicideWatch/comments...
2,"""iT gEtS bEtTeR"", ""yOu'Re LoVeD"" and other cli...",Sick of hearing it. It's all just platitudes t...,peteau89,52,1,https://www.reddit.com/r/SuicideWatch/comments...
3,My therapist fired me for being too depressed.,That’s it that’s the post lmao,bunnyeatscarbs,13,1,https://www.reddit.com/r/SuicideWatch/comments...
4,if i kill myself all my problems will go away.,i just need to kill myself.,Sufficient_Resist169,14,1,https://www.reddit.com/r/SuicideWatch/comments...


This is the suicide dataset. Just from the preview, the titles and posts are clearly different, but it is very hard to distinguish which is which and how to classify that. Post 5 has no body, which also could be problematic as it is a missing value.

In [22]:
# viewing shapes of datasets
print(casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].shape)
print(suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].shape)

(789, 6)
(990, 6)


they are a bit different in size, but after cleaning and processing this shouldn't be an issue. 

In [23]:
# reading post 118 from the depression dataset
print(depression["selftext"][80])
len(depression["selftext"][80])

I need to figure out how to be someone else. I’ve been happy with who I am my whole life. I used to. But it wasn’t enough for anyone else. 

What is so amazing about everyone else that people like them? Why do I have to live and pretend I’m happy when no one cares that I’m there? What did I do deserve to be like this?


321

In [24]:
# reading post 118 from the suicide dataset
print(suicide_watch["selftext"][118])
len(suicide_watch["selftext"][118])

I think I might actually do it now. I lost everything. It's time. I got nothing left to lose. I don't even have my heart anymore. They took that too.


149

just from reading them, they look pretty similar. However, the stories of both people are completely different and one of them is suicidal. The second person explicitly says they will die in 3 months, but the first post makes no such remarks. A classifier could do a good job of distinguishing between these two. Let's read two more. 

In [25]:
print(depression["selftext"][6])

It’s so weird that feeling nothing is the new normal for many , I used to think people were being super dramatic but that is really what it feels like . I think when I feel “ better “ , it’s not really like happiness , it’s more so that empty feeling that could pass for happiness / normality . I keep trying to force tears just to feel something ( as dramatic as it sounds ) , but nothing happens, I can never cry naturally , I have to force myself , and the minute I stop focusing on it , the tears dry up oops  . I feel isolated from everyone - like I can see my family / friends interacting but there’s like this chain keeping me rooted in place if that makes sense ? All communication is too much effort ? Let alone movement . The feeling of nothingness is almost safer and more comfortable and reliable ? I kinda prefer it than being happy I guess . Happy makes me feel vulnerable or uncomfortable . Everything good and bad in life comes from feelings .


In [26]:
print(suicide_watch["selftext"][100])

I can’t take the pain anymore. i’m going to either hang myself or throw myself in front of a train… i’m sorry to everyone. i just cant do it anymore.


between these two posts, there is a clear distinction, but it would be hard for a regular person reading it to be sure. Only a classifier generalized on thousands of peoples suffering could do this. But the first post talks about how their life is getting better, while the second explains that the person thinks that it is time to die. 

In [88]:
# casual_convo = casual_convo.rename(columns={'causal': 'is_suicide'})

In [27]:
# the 5 columns we chose seem good so lets shorten the datasets. 
dep_columns = depression[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]
sui_columns = suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]
cas_columns = casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]

# lets combine the datasets into one massive dataset. 
# combined_data = pd.concat([dep_columns,sui_columns, cas_columns],axis=0, ignore_index=True) 
# combined_data = pd.concat([dep_columns, cas_columns],axis=0, ignore_index=True)  
combined_data = pd.concat([sui_columns, cas_columns],axis=0, ignore_index=True)  
combined_data

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,New wiki on how to avoid accidentally encourag...,We've been seeing a worrying increase in pro-s...,SQLwitch,251,1,https://www.reddit.com/r/SuicideWatch/comments...
1,Please remember that NO ACTIVISM of any kind i...,"Activism, i.e. advocating or fundraising for s...",SQLwitch,43,1,https://www.reddit.com/r/SuicideWatch/comments...
2,"""iT gEtS bEtTeR"", ""yOu'Re LoVeD"" and other cli...",Sick of hearing it. It's all just platitudes t...,peteau89,52,1,https://www.reddit.com/r/SuicideWatch/comments...
3,My therapist fired me for being too depressed.,That’s it that’s the post lmao,bunnyeatscarbs,13,1,https://www.reddit.com/r/SuicideWatch/comments...
4,if i kill myself all my problems will go away.,i just need to kill myself.,Sufficient_Resist169,14,1,https://www.reddit.com/r/SuicideWatch/comments...
...,...,...,...,...,...,...
1774,The gym girl,"So what happened today was , I went to the Gym...",Maddy_Rock,4,0,https://www.reddit.com/r/CasualConversation/co...
1775,"Had a great evening, how are y'all?",Just got home from a really good evening. I wo...,LandSquiddy,10,0,https://www.reddit.com/r/CasualConversation/co...
1776,What would you bring to a syrup themed pot luc...,It can be any type of syrup. I'm thinking of m...,sumadviceplz,9,0,https://www.reddit.com/r/CasualConversation/co...
1777,"When flies die, do they die mid flight or land...",I thought of this when I saw a fly flying arou...,Spiritual-Clock5624,0,0,https://www.reddit.com/r/CasualConversation/co...


In [28]:
# saving the combined data in our datasets folder
combined_data.to_csv('suicide_vs_nothing.csv', index = False)

In [29]:
# checking for missing values
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1779 entries, 0 to 1778
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1779 non-null   object
 1   selftext      1724 non-null   object
 2   author        1779 non-null   object
 3   num_comments  1779 non-null   int64 
 4   is_suicide    1779 non-null   int64 
 5   url           1779 non-null   object
dtypes: int64(2), object(4)
memory usage: 83.5+ KB


it looks like the only missing values are in the text body, which makes sense. 

In [30]:
# looking at the posts with missing text values
combined_data[combined_data["selftext"].isnull()].head(10)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
8,I’m not going to tell anyone close to me that ...,,That_Girl30,41,1,https://www.reddit.com/r/SuicideWatch/comments...
10,I wish I can go back to being a kid again,,Forsaken_Ad9089,2,1,https://www.reddit.com/r/SuicideWatch/comments...
29,"48 year old, I am unable to take care of mysel...",,this_isntmy_bestwork,0,1,https://www.reddit.com/r/SuicideWatch/comments...
30,My teacher literally jokes about being suicida...,,Murky_Coat7627,5,1,https://www.reddit.com/r/SuicideWatch/comments...
57,"Genuinely find no purpose in life , literally ...",,1idekbro_,0,1,https://www.reddit.com/r/SuicideWatch/comments...
108,Hi i need so fucking help can someone message ...,,Pale_Improvement2629,7,1,https://www.reddit.com/r/SuicideWatch/comments...
112,I need solitudinous autonomy.,,26982537,1,1,https://www.reddit.com/r/SuicideWatch/comments...
120,Any reasons not to ghost all of my friends and...,,fuckme-emoboy,2,1,https://www.reddit.com/r/SuicideWatch/comments...
178,Im highkey tempted to off myself,,Keeeeeeeepppppp,0,1,https://www.reddit.com/r/SuicideWatch/comments...
187,It does not get better.It just doesn't.Keep tr...,,TheManWhoEatsWomen,1,1,https://www.reddit.com/r/SuicideWatch/comments...


The posts with missing values are either very concise in the title and to the point, or the main text is basically in the title. Luckily, there aren't that many posts with missing values. However, most of the null values are in the suicide dataset, which makes sense but also could be troublesome for our classifier. Maybe using the titles as the text would be a good approach. 

In [31]:
combined_data["is_suicide"][combined_data["selftext"].isnull()].value_counts()

is_suicide
1    55
Name: count, dtype: int64

In [32]:
# the best approach for the null values it to just fill them with "emptypost"
combined_data["selftext"].fillna("emptypost",inplace=True)

In [33]:
# checking if filling missing values worked
combined_data[combined_data["selftext"].isin(["emptypost"])].head()

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
8,I’m not going to tell anyone close to me that ...,emptypost,That_Girl30,41,1,https://www.reddit.com/r/SuicideWatch/comments...
10,I wish I can go back to being a kid again,emptypost,Forsaken_Ad9089,2,1,https://www.reddit.com/r/SuicideWatch/comments...
29,"48 year old, I am unable to take care of mysel...",emptypost,this_isntmy_bestwork,0,1,https://www.reddit.com/r/SuicideWatch/comments...
30,My teacher literally jokes about being suicida...,emptypost,Murky_Coat7627,5,1,https://www.reddit.com/r/SuicideWatch/comments...
57,"Genuinely find no purpose in life , literally ...",emptypost,1idekbro_,0,1,https://www.reddit.com/r/SuicideWatch/comments...


In [34]:
# checking entire dataset for missing values
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1779 entries, 0 to 1778
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1779 non-null   object
 1   selftext      1779 non-null   object
 2   author        1779 non-null   object
 3   num_comments  1779 non-null   int64 
 4   is_suicide    1779 non-null   int64 
 5   url           1779 non-null   object
dtypes: int64(2), object(4)
memory usage: 83.5+ KB


# Data Preprocessing
The posts are all written in different punctuation and capitalizations, so standardizing the data is an important first step. 

### Preprocessing Functions
Let's begin by removing capitalizations, reducing sentences to base words, and removing punctuation. We will add this as a new column to our data.

In [35]:
def processing_text(series_to_process):
    new_list = []
    tokenizer = RegexpTokenizer(r'(\w+)')
    lemmatizer = WordNetLemmatizer()
    for i in range(len(series_to_process)):
        # tokenized item in a new list
        dirty_string = (series_to_process)[i].lower()
        words_only = tokenizer.tokenize(dirty_string) # words_only is a list of only the words, no punctuation
        #Lemmatize the words_only
        words_only_lem = [lemmatizer.lemmatize(i) for i in words_only]
        # removing stop words
        words_without_stop = [i for i in words_only_lem if i not in stopwords.words("english")]
        # return seperated words
        long_string_clean = " ".join(word for word in words_without_stop)
        new_list.append(long_string_clean)
    return new_list

In [36]:
# checking to see if the new columns were added
combined_data["selftext_clean"] = processing_text(combined_data["selftext"])
combined_data["title_clean"] = processing_text(combined_data["title"])
pd.set_option("display.max_colwidth", 100)
combined_data.head(8)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean
0,"New wiki on how to avoid accidentally encouraging suicide, and how to spot covert incitement","We've been seeing a worrying increase in pro-suicide content showing up here and, and also going...",SQLwitch,251,1,https://www.reddit.com/r/SuicideWatch/comments/cz6nfd/new_wiki_on_how_to_avoid_accidentally_enco...,seeing worrying increase pro suicide content showing also going unreported undermines purpose wa...,new wiki avoid accidentally encouraging suicide spot covert incitement
1,Please remember that NO ACTIVISM of any kind is ever allowed here. No matter what day it is.,"Activism, i.e. advocating or fundraising for social change or raising awareness of social issues...",SQLwitch,43,1,https://www.reddit.com/r/SuicideWatch/comments/pl9suy/please_remember_that_no_activism_of_any_ki...,activism e advocating fundraising social change raising awareness social issue suicide inescapab...,please remember activism kind ever allowed matter day
2,"""iT gEtS bEtTeR"", ""yOu'Re LoVeD"" and other cliches",Sick of hearing it. It's all just platitudes that don't even help me and probably many others' s...,peteau89,52,1,https://www.reddit.com/r/SuicideWatch/comments/12k0015/it_gets_better_youre_loved_and_other_clic...,sick hearing platitude even help probably many others situation,get better loved cliche
3,My therapist fired me for being too depressed.,That’s it that’s the post lmao,bunnyeatscarbs,13,1,https://www.reddit.com/r/SuicideWatch/comments/12jwn0h/my_therapist_fired_me_for_being_too_depre...,post lmao,therapist fired depressed
4,if i kill myself all my problems will go away.,i just need to kill myself.,Sufficient_Resist169,14,1,https://www.reddit.com/r/SuicideWatch/comments/12jsv5k/if_i_kill_myself_all_my_problems_will_go_...,need kill,kill problem go away
5,Great actually everyone hates depressed ppl,They just don’t want to be the bad guy and admit the fact.\nThey just give u the cold shoulder a...,Initial_Pineapple942,18,1,https://www.reddit.com/r/SuicideWatch/comments/12jl52i/great_actually_everyone_hates_depressed_ppl/,want bad guy admit fact give u cold shoulder pretend care u say really give fuck sick constant c...,great actually everyone hate depressed ppl
6,"Why do people try to prevent you from suicide, won't help you in living?","Been trying my best lately to work hard, be positive, and do better. Some people around definite...",kindofmischief,1,1,https://www.reddit.com/r/SuicideWatch/comments/12k71c3/why_do_people_try_to_prevent_you_from_sui...,trying best lately work hard positive better people around definitely make tougher expect functi...,people try prevent suicide help living
7,"Another bad day 😔 I don't know, my parents did the worst thing to me by bringing me to this worl...",Pain. Suffering. Depression. Bad days. Mental torture. Finance problem.. Career problem. Social ...,soulofamonk,0,1,https://www.reddit.com/r/SuicideWatch/comments/12k1yz3/another_bad_day_i_dont_know_my_parents_di...,pain suffering depression bad day mental torture finance problem career problem social anxiety f...,another bad day know parent worst thing bringing world good sensitive weak like 7th year suicida...


Cleaning the titles and text worked, and that is important for our classifier to simplify the process and create a clearer distinction between the two datasets. 

In [37]:
# checking selftext_clean
pd.set_option("display.max_colwidth", 1000)
combined_data[["selftext","selftext_clean"]].tail(2)

Unnamed: 0,selftext,selftext_clean
1777,I thought of this when I saw a fly flying around my room. Do flies comprehend their mortality? Do any animal or insect comprehend their mortality? (I just discovered you can make a horrible mistake mixing up the ‘c’ and ‘s’ in insect),thought saw fly flying around room fly comprehend mortality animal insect comprehend mortality discovered make horrible mistake mixing c insect
1778,They are fun and friendly! I started to say good morning to them and always try to make them feel human. It is all too often do people dehumanize service workers. \n\nI always try to make quick casual conversations with baristas. It really makes their day! \n\nI just don't feel like treating them like subhuman trash.,fun friendly started say good morning always try make feel human often people dehumanize service worker always try make quick casual conversation baristas really make day feel like treating like subhuman trash


In [38]:
# testing wordninja
author_test = []
for i in range(10):
    splits_list = wordninja.split(combined_data["author"][i])
    combined_string = " ".join(splits_list)
    author_test.append(combined_string)
test_dict = {combined_data["author"][i]:author_test[i] for i in range(10)}
print(test_dict)

{'SQLwitch': 'SQL witch', 'peteau89': 'pete au 89', 'bunnyeatscarbs': 'bunny eats carb s', 'Sufficient_Resist169': 'Sufficient Resist 169', 'Initial_Pineapple942': 'Initial Pineapple 942', 'kindofmischief': 'kind of mischief', 'soulofamonk': 'soul of a monk', 'That_Girl30': 'That Girl 30', 'CatgirlSophie': 'Cat girl Sophie'}


In [40]:
# lets also clean the author names
def processing_author_names(series_to_process):
    author_split = []
    for i in range(len(series_to_process)):
        splits_list = wordninja.split(series_to_process[i])
        combined_string = " ".join(splits_list)
        author_split.append(combined_string)
    new_list = []
    tokenizer = RegexpTokenizer(r'(\w+)')
    lemmatizer = WordNetLemmatizer()
    for i in range(len(author_split)):
        #TOKENISED ITEM(LONG STRING) IN A LIST
        dirty_string = (author_split)[i].lower()
        words_only = tokenizer.tokenize(dirty_string) #WORDS_ONLY IS A LIST THAT DOESN'T HAVE PUNCTUATION
        #LEMMATISE THE ITEMS IN WORDS_ONLY
        words_only_lem = [lemmatizer.lemmatize(i) for i in words_only]
        #REMOVING STOP WORDS FROM THE LEMMATIZED LIST
        words_without_stop = [i for i in words_only_lem if i not in stopwords.words("english")]
        #RETURN SEPERATED WORDS INTO LONG STRING
        long_string_clean = " ".join(word for word in words_without_stop)
        new_list.append(long_string_clean)
    return new_list

In [41]:
combined_data["author_clean"]= processing_author_names(combined_data["author"])

# checking author_clean
pd.set_option("display.max_colwidth", 100)
combined_data[["author","author_clean"]].tail(10)

Unnamed: 0,author,author_clean
1769,Intelligent_Quote823,intelligent quote 823
1770,DonMartiniMacaroni,martini macaroni
1771,B_Nicoleo,b nicole
1772,DaddyCeiling,daddy ceiling
1773,Riskyredhead,risky redhead
1774,Maddy_Rock,maddy rock
1775,LandSquiddy,land squid dy
1776,sumadviceplz,sum advice plz
1777,Spiritual-Clock5624,spiritual clock 5624
1778,Thegrandcultivator,grand cultivator


so it doesn't work that well, but it isn't too big a deal because the author names don't matter as much, as long as it is simplified it is working well.

In [42]:
# Making sure there is no new missing values added
combined_data.isnull().sum()

title             0
selftext          0
author            0
num_comments      0
is_suicide        0
url               0
selftext_clean    0
title_clean       0
author_clean      0
dtype: int64

In [43]:
combined_data.to_csv('suicide_vs_nothing.csv', index = False)

## Data Preprocessing Complete
This was a relatively simple process because we only have a few attributes to adjust. We have 3 attributes to train our model on now.