## Data Cleaning, Pre-Processing, and Analysis

# Data Cleaning
This is the code for data cleaning, which means finding missing values and using the meaningful data for the classifier. There aren't that many, but dealing with them is still important

In [77]:
# data analysis imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# NLP Imports
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re

from sklearn.feature_extraction.text import CountVectorizer

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
import wordninja

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/ayaanhaque/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ayaanhaque/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [78]:
# data configurations
pd.set_option('display.max_columns', 100)
sns.set_style("darkgrid")

In [79]:
# initializing cvs files
depression = pd.read_csv('../data/depression.csv')
suicide_watch = pd.read_csv('../data/suicide_watch.csv')
casual_convo = pd.read_csv('../data/casual_conversation_vs_suicide.csv')

In [80]:
#visualizing depression dataset
pd.set_option('display.max_columns', 500)
casual_convo.head()
casual_convo.columns

Index(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved',
       'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext',
       ...
       'stickied', 'url', 'subreddit_subscribers', 'created_utc',
       'num_crossposts', 'media', 'is_video', 'poll_data', 'author_cakeday',
       'is_suicide'],
      dtype='object', length=107)

### Relevant Data

After viewing the data, there are 100 columns, but barely any of them are really needed for our classifier. We will choose the proper columns and go from there. 

To start, we will look at the title, text body, author username, number of comments, and lastly the URL of the post. 

In [81]:
casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].head(5)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,r/CasualConversation Lounge,Let's chat!,tizorres,1049,0,https://www.reddit.com/r/CasualConversation/comments/i9pr7m/rcasualconversation_lounge/
1,September Monthly Meta - r/CasualConversation Fireside Chat,Monthly Meta is back - [follow the collection](https://www.reddit.com/r/CasualConversation/colle...,tizorres,6,0,https://www.reddit.com/r/CasualConversation/comments/iko4ik/september_monthly_meta_rcasualconver...
2,I got a hotel room here in my town because I needed a time for myself.,"Look, I love my family and they're awesome. But, sometimes, I need to be alone. I like being alo...",nicedudefinallyhappy,161,0,https://www.reddit.com/r/CasualConversation/comments/inxmyy/i_got_a_hotel_room_here_in_my_town_b...
3,Your height is totally fine,Lately I’ve noticed many guys around my circle and on the internet that are very self conscious ...,PersianAss,1034,0,https://www.reddit.com/r/CasualConversation/comments/inlj6g/your_height_is_totally_fine/
4,I remember this conversation with an old acquaintance years ago and it changed me,"I was at a Halloween party with my best friend, Chelsea, and she had some other friends over. Th...",liverloo96,68,0,https://www.reddit.com/r/CasualConversation/comments/inlvis/i_remember_this_conversation_with_an...


From these rows, we can see a few posts that people posted. The second post looks like it is from a moderator, as it is a checkin and they have lots of comments. Posts like these potentially have to be removed. 

In [82]:
suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].head(5)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,"New wiki on how to avoid accidentally encouraging suicide, and how to spot covert incitement","We've been seeing a worrying increase in pro-suicide content showing up here and, and also going...",SQLwitch,260,1,https://www.reddit.com/r/SuicideWatch/comments/cz6nfd/new_wiki_on_how_to_avoid_accidentally_enco...
1,Reminder: Absolutely no activism of any kind is allowed here. Any day.,"If you want to recognise an occasion, please do so by offering extra support to those who've ask...",SQLwitch,124,1,https://www.reddit.com/r/SuicideWatch/comments/d2370x/reminder_absolutely_no_activism_of_any_kin...
2,To every single poster here i wanne say one thing,I really fucking feel you,NussNougatCroissant,46,1,https://www.reddit.com/r/SuicideWatch/comments/fe7bca/to_every_single_poster_here_i_wanne_say_on...
3,I just want it all to stop,Everyone ends up hating me eventually. \nMy psychologist of almost ten years blew up at me and k...,hda-SVN-njhdsx,5,1,https://www.reddit.com/r/SuicideWatch/comments/fee4k7/i_just_want_it_all_to_stop/
4,"Nobody gives a fuck until you die, and even then you're still not valid.",,lil_peemis,3,1,https://www.reddit.com/r/SuicideWatch/comments/fea9x1/nobody_gives_a_fuck_until_you_die_and_even...


This is the suicide dataset. Just from the preview, the titles and posts are clearly different, but it is very hard to distinguish which is which and how to classify that. Post 5 has no body, which also could be problematic as it is a missing value.

In [83]:
# viewing shapes of datasets
print(casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].shape)
print(suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]].shape)

(835, 6)
(980, 6)


they are a bit different in size, but after cleaning and processing this shouldn't be an issue. 

In [84]:
# reading post 118 from the depression dataset
print(depression["selftext"][80])
len(depression["selftext"][80])

Hi! I've been battling depression for the past two years, officially diagnosed in summer 2019, currently on medication &amp; therapy. Beating myself up every. single. day. for just being "lazy", and "not being able to get it together", as well as struggling with the thought that i somehow just tricked everybody (including the doctors) into thinking that i have depression when really i dont. 🥴  
Anyways, after another horrifying therapy session of me complaining that i am really just a piece of shit that only wants to sleep 24/7, wasting her life away not accomplishing anything, my therapist urged me to list all the things which i do DO and sort of look at them from afar and try to practice gratitude and appreciation. Here it is:

1) most days, I am able to make myself brush my teeth and on the good days even take a shower  
2) i could have given up on all work  to kind of "sleep through" my depression (my partner can support me financially with no trouble) however as tempting as it is 

1874

In [85]:
# reading post 118 from the suicide dataset
print(suicide_watch["selftext"][118])
len(suicide_watch["selftext"][118])

I'm an 18 year old with severe depression, anxiety, ADHD, borderline personality disorder and DID at a point in my life. But I'm slowly recovering.

I have always felt like I never belonged to this world, because I am "too imaginative" and "too kind". I behave "not like how people should". I'm mostly well-liked, but I know they think I'm a weirdo. Everyone does.

It's final exam time for us, and I have to secure good marks, otherwise I won't be able to enter any good streams in a good college. It would be the end of my future. My family is in a financial crunch, but they give up everything for me. I am dead-set on helping my friends get through their depression and suicidal tendencies, but I'm failing at that. I have had three attempts, but I was saved by a person, who himself was alexithymic (I think) and suicidal. He still is, and he says if he doesn't do well, he will commit suicide.

I don't know what I'm gonna do if I don't perform the way I should. But I certainly know that I wil

1172

just from reading them, they look pretty similar. However, the stories of both people are completely different and one of them is suicidal. The second person explicitly says they will die in 3 months, but the first post makes no such remarks. A classifier could do a good job of distinguishing between these two. Let's read two more. 

In [86]:
print(depression["selftext"][6])

why does it hurt so much? Why can’t I be happy without it? There’s this empty void in my heart that gets bigger everyday. I’m just waiting until it eats me up, since I’ll never have 2 sided love.


In [87]:
print(suicide_watch["selftext"][100])

I wanted to die starting in Jan 2018, but things have only gotten worse.

In summer 2018, those fucks on the Suicide Prevention chatroom called the police when I expressed suicidal ideations, so I was kidnapped and sent to a series of hospital-prisons with junk medical staff. I lost my job due to the hospital stay - and my apartment, car, and dog shortly followed.

Can't get a decent job because my resume is now all fucked up and I have no connections, and I refuse to go back to miserable jobs that pay horribly. I'd rather die than do that for life.

I am about to be sued on $4K debt, and then yesterday I was handed a $6.5K medical bill for treatment that would have been 100% unnecessary had I still had insurance and was able to go to regular check ups.

You fucks on the Suicide Prevention line made my life demonstrably worse. You destroyed the mechanisms that kept me going as an independent and self-sufficient human. Now I have nothing and am in a far worse position (logistically spea

between these two posts, there is a clear distinction, but it would be hard for a regular person reading it to be sure. Only a classifier generalized on thousands of peoples suffering could do this. But the first post talks about how their life is getting better, while the second explains that the person thinks that it is time to die. 

In [88]:
# casual_convo = casual_convo.rename(columns={'causal': 'is_suicide'})

In [89]:
# the 5 columns we chose seem good so lets shorten the datasets. 
dep_columns = depression[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]
sui_columns = suicide_watch[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]
cas_columns = casual_convo[["title", "selftext", "author",  "num_comments", "is_suicide","url"]]

# lets combine the datasets into one massive dataset. 
# combined_data = pd.concat([dep_columns,sui_columns, cas_columns],axis=0, ignore_index=True) 
# combined_data = pd.concat([dep_columns, cas_columns],axis=0, ignore_index=True)  
combined_data = pd.concat([sui_columns, cas_columns],axis=0, ignore_index=True)  
combined_data

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
0,"New wiki on how to avoid accidentally encouraging suicide, and how to spot covert incitement","We've been seeing a worrying increase in pro-suicide content showing up here and, and also going...",SQLwitch,260,1,https://www.reddit.com/r/SuicideWatch/comments/cz6nfd/new_wiki_on_how_to_avoid_accidentally_enco...
1,Reminder: Absolutely no activism of any kind is allowed here. Any day.,"If you want to recognise an occasion, please do so by offering extra support to those who've ask...",SQLwitch,124,1,https://www.reddit.com/r/SuicideWatch/comments/d2370x/reminder_absolutely_no_activism_of_any_kin...
2,To every single poster here i wanne say one thing,I really fucking feel you,NussNougatCroissant,46,1,https://www.reddit.com/r/SuicideWatch/comments/fe7bca/to_every_single_poster_here_i_wanne_say_on...
3,I just want it all to stop,Everyone ends up hating me eventually. \nMy psychologist of almost ten years blew up at me and k...,hda-SVN-njhdsx,5,1,https://www.reddit.com/r/SuicideWatch/comments/fee4k7/i_just_want_it_all_to_stop/
4,"Nobody gives a fuck until you die, and even then you're still not valid.",,lil_peemis,3,1,https://www.reddit.com/r/SuicideWatch/comments/fea9x1/nobody_gives_a_fuck_until_you_die_and_even...
...,...,...,...,...,...,...
1810,Do anyone else just sit in their car because it’s one of the places you feel safe and no longer ...,"Sitting in my car, is a place where I cry, think and decompress. \n\nMy car has seen me cry more...",mayoeater,764,0,https://www.reddit.com/r/CasualConversation/comments/i8r3pc/do_anyone_else_just_sit_in_their_car...
1811,"As a male, I’m so tired of the lack of unique clothing available.",As a male it’s insanely frustrating to browse women’s clothing and see all of the unique styles ...,Childish_Brandino,2416,0,https://www.reddit.com/r/CasualConversation/comments/i8bea3/as_a_male_im_so_tired_of_the_lack_of...
1812,My teenager made me so proud tonight with a simple gesture,"I have 4 boys ranging from ages 3 to 14. Tonight the 3 year old was very tired, which of course ...",SedativeCorpse,251,0,https://www.reddit.com/r/CasualConversation/comments/i7mcfx/my_teenager_made_me_so_proud_tonight...
1813,"After 30 years of being open, my family’s restaurant is closing tonight.",My family has owned a fine dining italian restaurant since before i was born. Most all of my chi...,retirereddit,769,0,https://www.reddit.com/r/CasualConversation/comments/i65fce/after_30_years_of_being_open_my_fami...


In [90]:
# saving the combined data in our datasets folder
combined_data.to_csv('../data/suicide_vs_nothing.csv', index = False)

In [91]:
# checking for missing values
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1815 entries, 0 to 1814
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1815 non-null   object
 1   selftext      1751 non-null   object
 2   author        1815 non-null   object
 3   num_comments  1815 non-null   int64 
 4   is_suicide    1815 non-null   int64 
 5   url           1815 non-null   object
dtypes: int64(2), object(4)
memory usage: 85.2+ KB


it looks like the only missing values are in the text body, which makes sense. 

In [92]:
# looking at the posts with missing text values
combined_data[combined_data["selftext"].isnull()].head(10)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
4,"Nobody gives a fuck until you die, and even then you're still not valid.",,lil_peemis,3,1,https://www.reddit.com/r/SuicideWatch/comments/fea9x1/nobody_gives_a_fuck_until_you_die_and_even...
6,"I have two brothers who have killed themselves, and that fact is the only thing keeping me from ...",,ArsenalOwl,1,1,https://www.reddit.com/r/SuicideWatch/comments/feenlk/i_have_two_brothers_who_have_killed_themse...
8,I want to die I want to die I want to die,,alynde,4,1,https://www.reddit.com/r/SuicideWatch/comments/fecvpg/i_want_to_die_i_want_to_die_i_want_to_die/
17,"I am so sorry, but it has gotten worse",,SmushyKidK,4,1,https://www.reddit.com/r/SuicideWatch/comments/feerry/i_am_so_sorry_but_it_has_gotten_worse/
20,I want to douse my body in gasoline and set myself on fire,,SalehRobbins,3,1,https://www.reddit.com/r/SuicideWatch/comments/fefk02/i_want_to_douse_my_body_in_gasoline_and_set/
24,I can't do this anymore,,sappy_banana,4,1,https://www.reddit.com/r/SuicideWatch/comments/feeq4w/i_cant_do_this_anymore/
40,"If I had a gun, I’d blow my fucking brains out right now",,CGM2004,1,1,https://www.reddit.com/r/SuicideWatch/comments/fecg6p/if_i_had_a_gun_id_blow_my_fucking_brains_out/
43,This world is a joke.,,crybaby1577,11,1,https://www.reddit.com/r/SuicideWatch/comments/fe3uns/this_world_is_a_joke/
47,I think I’m ready,,___horse___,4,1,https://www.reddit.com/r/SuicideWatch/comments/fedy44/i_think_im_ready/
56,IM ABOUT To kill my self help,,myusernameisunknown1,0,1,https://www.reddit.com/r/SuicideWatch/comments/fehrt0/im_about_to_kill_my_self_help/


The posts with missing values are either very concise in the title and to the point, or the main text is basically in the title. Luckily, there aren't that many posts with missing values. However, most of the null values are in the suicide dataset, which makes sense but also could be troublesome for our classifier. Maybe using the titles as the text would be a good approach. 

In [93]:
combined_data["is_suicide"][combined_data["selftext"].isnull()].value_counts()

1    64
Name: is_suicide, dtype: int64

In [94]:
# the best approach for the null values it to just fill them with "emptypost"
combined_data["selftext"].fillna("emptypost",inplace=True)

In [95]:
# checking if filling missing values worked
combined_data[combined_data["selftext"].isin(["emptypost"])].head()

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url
4,"Nobody gives a fuck until you die, and even then you're still not valid.",emptypost,lil_peemis,3,1,https://www.reddit.com/r/SuicideWatch/comments/fea9x1/nobody_gives_a_fuck_until_you_die_and_even...
6,"I have two brothers who have killed themselves, and that fact is the only thing keeping me from ...",emptypost,ArsenalOwl,1,1,https://www.reddit.com/r/SuicideWatch/comments/feenlk/i_have_two_brothers_who_have_killed_themse...
8,I want to die I want to die I want to die,emptypost,alynde,4,1,https://www.reddit.com/r/SuicideWatch/comments/fecvpg/i_want_to_die_i_want_to_die_i_want_to_die/
17,"I am so sorry, but it has gotten worse",emptypost,SmushyKidK,4,1,https://www.reddit.com/r/SuicideWatch/comments/feerry/i_am_so_sorry_but_it_has_gotten_worse/
20,I want to douse my body in gasoline and set myself on fire,emptypost,SalehRobbins,3,1,https://www.reddit.com/r/SuicideWatch/comments/fefk02/i_want_to_douse_my_body_in_gasoline_and_set/


In [96]:
# checking entire dataset for missing values
combined_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1815 entries, 0 to 1814
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1815 non-null   object
 1   selftext      1815 non-null   object
 2   author        1815 non-null   object
 3   num_comments  1815 non-null   int64 
 4   is_suicide    1815 non-null   int64 
 5   url           1815 non-null   object
dtypes: int64(2), object(4)
memory usage: 85.2+ KB


# Data Preprocessing
The posts are all written in different punctuation and capitalizations, so standardizing the data is an important first step. 

### Preprocessing Functions
Let's begin by removing capitalizations, reducing sentences to base words, and removing punctuation. We will add this as a new column to our data.

In [97]:
def processing_text(series_to_process):
    new_list = []
    tokenizer = RegexpTokenizer(r'(\w+)')
    lemmatizer = WordNetLemmatizer()
    for i in range(len(series_to_process)):
        # tokenized item in a new list
        dirty_string = (series_to_process)[i].lower()
        words_only = tokenizer.tokenize(dirty_string) # words_only is a list of only the words, no punctuation
        #Lemmatize the words_only
        words_only_lem = [lemmatizer.lemmatize(i) for i in words_only]
        # removing stop words
        words_without_stop = [i for i in words_only_lem if i not in stopwords.words("english")]
        # return seperated words
        long_string_clean = " ".join(word for word in words_without_stop)
        new_list.append(long_string_clean)
    return new_list

In [98]:
# checking to see if the new columns were added
combined_data["selftext_clean"] = processing_text(combined_data["selftext"])
combined_data["title_clean"] = processing_text(combined_data["title"])
pd.set_option("display.max_colwidth", 100)
combined_data.head(8)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean
0,"New wiki on how to avoid accidentally encouraging suicide, and how to spot covert incitement","We've been seeing a worrying increase in pro-suicide content showing up here and, and also going...",SQLwitch,260,1,https://www.reddit.com/r/SuicideWatch/comments/cz6nfd/new_wiki_on_how_to_avoid_accidentally_enco...,seeing worrying increase pro suicide content showing also going unreported undermines purpose wa...,new wiki avoid accidentally encouraging suicide spot covert incitement
1,Reminder: Absolutely no activism of any kind is allowed here. Any day.,"If you want to recognise an occasion, please do so by offering extra support to those who've ask...",SQLwitch,124,1,https://www.reddit.com/r/SuicideWatch/comments/d2370x/reminder_absolutely_no_activism_of_any_kin...,want recognise occasion please offering extra support asked good citizen community mindful tip g...,reminder absolutely activism kind allowed day
2,To every single poster here i wanne say one thing,I really fucking feel you,NussNougatCroissant,46,1,https://www.reddit.com/r/SuicideWatch/comments/fe7bca/to_every_single_poster_here_i_wanne_say_on...,really fucking feel,every single poster wanne say one thing
3,I just want it all to stop,Everyone ends up hating me eventually. \nMy psychologist of almost ten years blew up at me and k...,hda-SVN-njhdsx,5,1,https://www.reddit.com/r/SuicideWatch/comments/fee4k7/i_just_want_it_all_to_stop/,everyone end hating eventually psychologist almost ten year blew kicked last session doubting ev...,want stop
4,"Nobody gives a fuck until you die, and even then you're still not valid.",emptypost,lil_peemis,3,1,https://www.reddit.com/r/SuicideWatch/comments/fea9x1/nobody_gives_a_fuck_until_you_die_and_even...,emptypost,nobody give fuck die even still valid
5,I want to die,Dude I just want death. I’ve never actually wanted to die before but I just really really don’t ...,My21SabbathChemicals,6,1,https://www.reddit.com/r/SuicideWatch/comments/fee4yo/i_want_to_die/,dude want death never actually wanted die really really want everyone hate honestly hate everyon...,want die
6,"I have two brothers who have killed themselves, and that fact is the only thing keeping me from ...",emptypost,ArsenalOwl,1,1,https://www.reddit.com/r/SuicideWatch/comments/feenlk/i_have_two_brothers_who_have_killed_themse...,emptypost,two brother killed fact thing keeping looked found sub instead
7,"8 years ago I posted here, wanting to die. My life is so much better now.","When I was 15-16 years old, I posted here in my darkest moments. I couldn't see any reason to ke...",deppressionthrowaway,36,1,https://www.reddit.com/r/SuicideWatch/comments/fe6pma/8_years_ago_i_posted_here_wanting_to_die_m...,wa 15 16 year old posted darkest moment see reason keep living thought wa nothing special next 8...,8 year ago posted wanting die life much better


Cleaning the titles and text worked, and that is important for our classifier to simplify the process and create a clearer distinction between the two datasets. 

In [99]:
# checking selftext_clean
pd.set_option("display.max_colwidth", 1000)
combined_data[["selftext","selftext_clean"]].tail(2)

Unnamed: 0,selftext,selftext_clean
1813,"My family has owned a fine dining italian restaurant since before i was born. Most all of my childhood memories are in that restaurant. Everyday after school i’d go do my homework at the bar, i’d follow my dad around the kitchen and help with little things like making salads, and i will never forget making my first pizza at 4 years old. Whenever it stormed really bad and we lost power my family would go and sleep on the floor in the dining room — it was always a safe space. It was always somewhere for us to go, something for us to do, something that needed constant watering and attention. It’s been my family’s livelihood for my entire existence. It’s kept my belly full as well as my heart. It’s my father’s lifelong work and it’s made me respect him sooo much after 30 years of being there to cook for 14+ hours a day. I don’t know who i’d be without this restaurant. It’s shaped me in ways that i couldn’t possibly explain over a reddit post. It’s made me confident, brave, not scared o...",family ha owned fine dining italian restaurant since wa born childhood memory restaurant everyday school go homework bar follow dad around kitchen help little thing like making salad never forget making first pizza 4 year old whenever stormed really bad lost power family would go sleep floor dining room wa always safe space wa always somewhere u go something u something needed constant watering attention family livelihood entire existence kept belly full well heart father lifelong work made respect sooo much 30 year cook 14 hour day know without restaurant shaped way possibly explain reddit post made confident brave scared little heat built incredible relationship staff lucky work long started taking seriously wa 15 wa busgirl first wa hostess got little older became server good one especially spilling red wine one many older woman embarrassing beyond compare realized 18 like father wa cook 24 cooking alongside dad everyday since realized knack beautiful exhausting exhilarating men...
1814,"First of all, I’m not Japanese. I came to Japan about 12 years ago to work as a programmer in Tokyo as a contractual job. A couple years of that and I felt I love it here, so I decided to live in Japan and took a full-time job.\n\nFortunately, my personality fits the culture and I feel like I’m thriving here. There are pros and cons in every single country, of course, and Japan is no exception. But I feel the pros outweigh the cons, so I’m a bit biased. \n\nI’ve lived in both the urban metropolis and the provincial rural towns. My current neighborhood is a suburb that’s somewhere in the middle. Less than an hour from central Tokyo, yet surrounded by forests and mountains, two rivers, and everything is walkable.\n\nI realized that there are a lot of people on Reddit who are interested in Japan, both lovers and haters. If you have anything you want to know, I’ll try to answer it with what I know and have experienced.\n\nIf you’re not comfortable in making your post public, you can ju...",first japanese came japan 12 year ago work programmer tokyo contractual job couple year felt love decided live japan took full time job fortunately personality fit culture feel like thriving pro con every single country course japan exception feel pro outweigh con bit biased lived urban metropolis provincial rural town current neighborhood suburb somewhere middle le hour central tokyo yet surrounded forest mountain two river everything walkable realized lot people reddit interested japan lover hater anything want know try answer know experienced comfortable making post public send private message


In [100]:
# testing wordninja
author_test = []
for i in range(10):
    splits_list = wordninja.split(combined_data["author"][i])
    combined_string = " ".join(splits_list)
    author_test.append(combined_string)
test_dict = {combined_data["author"][i]:author_test[i] for i in range(10)}
print(test_dict)

{'SQLwitch': 'SQL witch', 'NussNougatCroissant': 'Nuss Nougat Croissant', 'hda-SVN-njhdsx': 'hd a SVN nj hds x', 'lil_peemis': 'lil pee mis', 'My21SabbathChemicals': 'My 21 Sabbath Chemicals', 'ArsenalOwl': 'Arsenal Owl', 'deppressionthrowaway': 'dep press ion throwaway', 'alynde': 'a lyn de', 'Frocharocha': 'F rocha rocha'}


In [101]:
# lets also clean the author names
def processing_author_names(series_to_process):
    author_split = []
    for i in range(len(series_to_process)):
        splits_list = wordninja.split(series_to_process[i])
        combined_string = " ".join(splits_list)
        author_split.append(combined_string)
    new_list = []
    tokenizer = RegexpTokenizer(r'(\w+)')
    lemmatizer = WordNetLemmatizer()
    for i in range(len(author_split)):
        #TOKENISED ITEM(LONG STRING) IN A LIST
        dirty_string = (author_split)[i].lower()
        words_only = tokenizer.tokenize(dirty_string) #WORDS_ONLY IS A LIST THAT DOESN'T HAVE PUNCTUATION
        #LEMMATISE THE ITEMS IN WORDS_ONLY
        words_only_lem = [lemmatizer.lemmatize(i) for i in words_only]
        #REMOVING STOP WORDS FROM THE LEMMATIZED LIST
        words_without_stop = [i for i in words_only_lem if i not in stopwords.words("english")]
        #RETURN SEPERATED WORDS INTO LONG STRING
        long_string_clean = " ".join(word for word in words_without_stop)
        new_list.append(long_string_clean)
    return new_list

In [102]:
combined_data["author_clean"]= processing_author_names(combined_data["author"])

# checking author_clean
pd.set_option("display.max_colwidth", 100)
combined_data[["author","author_clean"]].tail(10)

Unnamed: 0,author,author_clean
1805,Cael450,ca el 450
1806,jaycub84,jay cub 84
1807,mirarom,mira rom
1808,SilyTheGoose,sil goose
1809,flawmyy,flaw
1810,mayoeater,mayo eater
1811,Childish_Brandino,childish brandi
1812,SedativeCorpse,sedative corpse
1813,retirereddit,retire reddit
1814,BeardedGlass,bearded glass


so it doesn't work that well, but it isn't too big a deal because the author names don't matter as much, as long as it is simplified it is working well.

In [103]:
# Making sure there is no new missing values added
combined_data.isnull().sum()

title             0
selftext          0
author            0
num_comments      0
is_suicide        0
url               0
selftext_clean    0
title_clean       0
author_clean      0
dtype: int64

In [104]:
combined_data.to_csv('../data/suicide_vs_nothing.csv', index = False)

## Data Preprocessing Complete
This was a relatively simple process because we only have a few attributes to adjust. We have 3 attributes to train our model on now.