# Suicide Detection - Data Wrangling

### Table of Contents
* [Step 1. Imports](#Step-1:--Imports) 
* [Step 2. Load the Data](#Step-2:--Load-the-Data)
* [Step 3.  Review Summary Data](#Step-3:--Look-at-Summary-of-Data)
* [Step 4.  Further Data Review (Head)](#Step-4:--Call-head-of-Data)
* [Step 5.  Identify Duplicates and Missing Data](#Step-5:--Identify-duplicates-and-missing-data)
* [Step 6.  Text Normalization](#Step-6:--Text-Normalization)
    * [Step 6a)  Convert Text to Lowercase](#Step-6a:--Convert-all-letters-to-lowercase.)
    * [Step 6b)  Convert urls and links to standardized text](#Step-6b:--Convert-urls-and-links-to-standardized-text)
    * [Step 6c)  Convert emojis and emoticons to text](#Step-6c:-Convert-emojis-and-emoticons-to-text)
    * [Step 6d)  Remove punctuation and numerals](#Step-6d:-Remove-punctuation-and-numerals)
    * [Step 6e)  Remove White Spaces](#Step-6e:-Remove-White-Spaces)
    * [Step 6f)  Language Detection](#Step-6f:-Language-Detection)
    * [Step 6g)  Convert Text to Lowercase](#Step-6g:--Expand-Contractions)
    * [Step 6h)  Spell Check](#Step-6h:--Spell-Check)
    * [Step 6i)  Filler Words](#Step:-6i:--Filler-Words)
    * [Step 6j)  Stop Words and Tokenization](#Step-6j:-Stop-Words-and-Tokenization)
    * [Step 6k)  Lemmatization](#Step-6k:-Lemmatization)
* [Step 7.  Save Cleaned File](#Step-7:--Save-Cleaned-File)

# Step 1:  Imports

In [1]:
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

import textstat
import re
import string
# !pip install watermark
#%load_ext watermark
from spacy.lang.en.stop_words import STOP_WORDS

import warnings
warnings.filterwarnings("ignore")



# Step 2:  Load the Data

In [2]:
df = pd.read_csv('Suicide_Detection.csv')

# Step 3:  Look at Summary of Data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232074 entries, 0 to 232073
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  232074 non-null  int64 
 1   text        232074 non-null  object
 2   class       232074 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.3+ MB


# Step 4:  Call head of Data

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


# Step 5:  Identify duplicates and missing data

In [5]:
df['text'].value_counts().sum()

232074

This basically tells us that each entry  in the 'text' column is unique

In [6]:
df['text'].isnull().value_counts()

False    232074
Name: text, dtype: int64

In [7]:
df['class'].isnull().value_counts()

False    232074
Name: class, dtype: int64

Great we do not appear to have any duplicates or missing data.  In this regard the data is very clean.

# Step 6:  Text Normalization

#### Step 6a:  Convert all letters to lowercase.

In [8]:
df['text'] = df['text'].str.lower()

#### Step 6b:  Convert urls and links to standardized text

In [9]:
def standardize_url(text):
    return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '_link_to_site_', text)

In [10]:
#  used code snippets from - https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python

for i in range(len(df)):
    df['text'].iloc[i] = standardize_url(df['text'].iloc[i])

#### Step 6c: Convert emojis and emoticons to text

First we will convert emojis to text.

In [11]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,ex wife threatening suiciderecently i left my ...,suicide
1,3,am i weird i don't get affected by compliments...,non-suicide
2,4,finally 2020 is almost over... so i can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"i’m so losthello, my name is adam (16) and i’v...",suicide


In [12]:
import emoji

In [13]:
def no_emoji(text):
    return emoji.demojize(text)

In [14]:
for i in range(len(df)):
    df['text'].iloc[i] = no_emoji(df['text'].iloc[i])

Next we will convert emoticons to text.

In [15]:
from emot.emo_unicode import UNICODE_EMOJI # For emojis
from emot.emo_unicode import EMOTICONS_EMO# For EMOTICONS

In [16]:
def convert_emoticons(text):
    for emot in EMOTICONS_EMO:
        text = text.replace(emot, EMOTICONS_EMO[emot].replace(" ","_"))
    return text

In [17]:
text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

In [18]:
for item in range(len(df)):
    df['text'].iloc[item] = convert_emoticons(df['text'].iloc[item])

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,ex wife threatening suiciderecently i left my ...,suicide
1,3,am i weird i don't get affected by compliments...,non-suicide
2,4,finally 2020 is almost over... so i can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"i’m so losthello, my name is adam (16) and i’v...",suicide


#### Step 6d: Remove punctuation and numerals

In [20]:
df['text'].str.contains('2').any()

True

In [21]:

punctuation = ['-', '-', '⠿','ᒷ','⣵ ⣿','┘','┌','ℸ','⢿ ⣷','⣁','⣵','⣳','⡉ ⡉','¿','↸','⡇', '!', '@', '?', '.', '#', '▀ ▀', '▀', '⢠', '$', '· -·-·-', '⠁' , '%', '^', '&', '€', '*', '(', ')', ':', ';', '<', '>', '"',  '/', ',', '?', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[', ']', '{', '}', '|', '⣿ ⢸', '⣶ ⣶', '⣿ ⣶', '⡉ ⡉', '⣿', '⠄', '⣦', '⣷', '⢉', '⠙', '⠟', '▓', '⠉', '⠋', '⠛', '⡟', 'ㅤ ㅤ', '·', '⣧']

for item in punctuation:
    df['text'] = df['text'].str.replace(item, '')
    
print(df['text'].iloc[3])

i need helpjust help me im crying so hard


In [22]:
df['text'].str.contains('2').any()

False

#### Step 6e: Remove White Spaces

In [23]:
df['text'] = df['text'].str.strip()

#### Step 6f: Language Detection

In [24]:
from langdetect import detect

for iter in range(len(df)):
    if detect(str(df['text'].iloc[1])) != 'en':
        df.drop(i, inplace = True)

In [25]:
df.shape

(232074, 3)

#### Step 6g:  Expand Contractions

In [26]:
def decontracted(phrase):
  """decontracted takes text and convert contractions into natural form.
     ref: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490"""

  # specific
  phrase = re.sub(r"won\'t", "will not", phrase)
  phrase = re.sub(r"can\'t", "can not", phrase)
  phrase = re.sub(r"won\’t", "will not", phrase)
  phrase = re.sub(r"can\’t", "can not", phrase)

  # general
  phrase = re.sub(r"n\'t", " not", phrase)
  phrase = re.sub(r"\'re", " are", phrase)
  phrase = re.sub(r"\'s", " is", phrase)
  phrase = re.sub(r"\'d", " would", phrase)
  phrase = re.sub(r"\'ll", " will", phrase)
  phrase = re.sub(r"\'t", " not", phrase)
  phrase = re.sub(r"\'ve", " have", phrase)
  phrase = re.sub(r"\'m", " am", phrase)

  phrase = re.sub(r"n\’t", " not", phrase)
  phrase = re.sub(r"\’re", " are", phrase)
  phrase = re.sub(r"\’s", " is", phrase)
  phrase = re.sub(r"\’d", " would", phrase)
  phrase = re.sub(r"\’ll", " will", phrase)
  phrase = re.sub(r"\’t", " not", phrase)
  phrase = re.sub(r"\’ve", " have", phrase)
  phrase = re.sub(r"\’m", " am", phrase)

  return phrase

In [27]:
for i in range(len(df)):
    df['text'].iloc[i] = decontracted(df['text'].iloc[i])

In [28]:
df.to_csv("Suicide_Detection_Interim_Clean8622.csv", index = False)

#### Step 6h:  Spell Check

In [29]:
pip install textblob




In [30]:
from textblob import TextBlob

In [31]:
def spell_check(text):
    return str((TextBlob(text)).correct())

In [32]:
print(df['text'].iloc[0])
print('')
print(spell_check(df['text'].iloc[0]))

ex wife threatening suiciderecently i left my wife for good because she has cheated on me twice and lied to me so much that i have decided to refuse to go back to her as of a few days ago she began threatening suicide i have tirelessly spent these paat few days talking her out of it and she keeps hesitating because she wants to believe i will come back i know a lot of people will threaten this in order to get their way but what happens if she really does what do i do and how am i supposed to handle her death on my hands i still love my wife but i cannot deal with getting cheated on again and constantly feeling insecure i am worried today may be the day she does it and i hope so much it does not happen

ex wife threatening suiciderecently i left my wife for good because she has created on me twice and lied to me so much that i have decided to refuse to go back to her as of a few days ago she began threatening suicide i have carelessly spent these part few days talking her out of it and 

##### Note
The spell check function does not seem perfect for these purposes.  In the above example it changed a word from 'cheated' to 'created' which really changes the meaning of the sentence.  Since I was reluctant to change the meaning of the text, I opted not to run a spell check function on the entire data set.  

However, there were a number of issues with the dataset particularly in joining the word i with the word prior to it.  So I have developed a list of these examples and decided to resolve these spelling mistakes.  These may have been original to the postings or may have been joined while scraping the text.

In [33]:
my_list = ['anymoremy','myselfmy', 'meit', 'diemy', 'domy', 'myselfso', 'helphi', 'helpmy', 'lifemy']

In [34]:
i_list = ['diei','suicidei', 'tiredi', 'tonighti', 'anymorei', 'thoughtsi', 'myselfi', 'endi', 'betteri', 'alonei', 'helpi', 'alivei',
'sooni', 'livingi', 'pleasei', 'suicidali', 'losti', 'pointi', 'hopei', 'tomorrowi', 'advicei', 'stopi', 'goingi', 
'hopelessi', 'livei', 'deathi', 'paini', 'personi', 'weeki', 'lifei', 'carei', 'donei', 'depressedi', 'lonelyi', 'attempti',
'everythingi', 'scaredi', 'worse', 'everyonei', 'optioni', 'readyi', 'longeri', 'plani', 'headi', 'here', 'goi',
'emptyi', 'worthlessi', 'dyingi', 'doi', 'failurei', 'worsei', 'herei', 'worldi', 'anyonei', 'thisi', 'goodbyei', 'notei', 
'backi', 'someonei', 'iti', 'lefti', 'whyi', 'outi', 'myselft', 'upi', 'nighti', 'todayi', 'yearsi', 'toi', 'mindi', 'anymoret', 'happyi',
         'depressioni', 'wayi', 'ssri', 'downi', 'overi', 'nowi', 'sorryi', 'enoughi', 'hospitali', 'optionsi', 'pillsi', 'titlei', 'trappedi', 
         'edgei', 'yourselfi', 'uselessi', 'ranti', 'okayi', 'tryingi', 'caresi', 'existi']


In [35]:
def remove_i_my(text):
    for i in i_list:
        text = text.replace(i, i[:-1])
        for j in my_list:
            text = text.replace(j, j[:-2])
    return text

In [36]:
for i in range(len(df)):
    df['text'].iloc[i] = remove_i_my(df['text'].iloc[i])

In [37]:
df['text'] = df['text'].str.replace('worsi', 'worst')
df['text'] = df['text'].str.replace('helpso', 'help so')
df['text'] = df['text'].str.replace('ther', 'the')
df['text'] = df['text'].str.replace('fuck fuck', 'fuck')
df['text'] = df['text'].str.replace('gtpoplt gtpoplt', 'gtpoplt')
df['text'] = df['text'].str.replace('cheese cheese', 'cheese')
df['text'] = df['text'].str.replace('cecil cecil', 'cecil')
df['text'] = df['text'].str.replace('sus sus', 'sus')
df['text'] = df['text'].str.replace('cum cum', 'cum')
df['text'] = df['text'].str.replace('ni ni', 'ni')

#### Step: 6i:  Filler Words

For whatever reason there are many iterations of the word 'filler' throughout the data.  See below for an example of scraped postings with this word repeated many times.  Since this word does not provide any insight into the intention of the author, I am going to remove the word from the data set.

In [38]:
for i in range(0, 600):
    if 'filler filler' in df['text'].iloc[i]:
        print('iteration' , i , df['text'].iloc[i])

iteration 253 can constant masturbation lead to disinterest in girls asking for a friend 
filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler
iteration 414 what are some good halloween movies filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler filler


In [39]:
df['text'] = df['text'].str.replace('filler', '')

In [40]:
df.to_csv('Suicide_Detection_Normalized.csv')

#### Step 6j: Stop Words and Tokenization

In [41]:
# df = pd.read_csv('Suicide_Detection_Normalized.csv')

In [42]:
df.dropna(inplace = True)

In [43]:
df['word'] = ''

In [44]:
def tokenize(text):
    token = word_tokenize(text)
    result = []
    for item in token:
        if item in STOP_WORDS:
            pass
        else:
            result.append(item)
    return result        

In [45]:
for i in range(len(df)):
    df['word'].iloc[i] = tokenize(df['text'].iloc[i])

In [46]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class,word
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,..."
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliments, coming, know, i..."
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god..."
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, crying, hard]"
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, years, afraid, p..."


#### Step 6k: Lemmatization

In [47]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

for iter in range(len(df)):
    result = []
    for word in df['word'].iloc[iter]:
        result.append(lemmatizer.lemmatize(word))
    df['word'].iloc[iter] = result

In [48]:
print(lemmatizer.lemmatize('struggling'))

struggling


In [49]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class,word
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,..."
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliment, coming, know, ir..."
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god..."
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, cry, hard]"
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, year, afraid, pa..."


In [50]:
df['clean_text'] = ''

In [51]:
for iter in range(len(df)):
    words = df['word'].iloc[iter]
    words = ' '.join(words)
    df['clean_text'].iloc[iter] = words

In [52]:
# one last replace of repeated or misspelled items found in the clean text

In [53]:
df['clean_text'] = df['clean_text'].str.replace('ther', 'the')
df['clean_text'] = df['clean_text'].str.replace('fuck fuck', 'fuck')
df['clean_text'] = df['clean_text'].str.replace('sus sus', 'sus')
df['clean_text'] = df['clean_text'].str.replace('cheese cheese', 'cheese')


In [54]:
df.head(9)

Unnamed: 0.1,Unnamed: 0,text,class,word,clean_text
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,...",ex wife threatening suiciderecently left wife ...
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliment, coming, know, ir...",weird affected compliment coming know irl feel...
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god...",finally hear bad year swear fucking god annoying
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, cry, hard]",need helpjust help im cry hard
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, year, afraid, pa...",losthello adam struggling year afraid past yea...
5,11,honetly idki dont know what im even dong her i...,suicide,"[honetly, idki, dont, know, im, dong, feel, li...",honetly idki dont know im dong feel like nowhe...
6,12,trigger warning excuse for self inflicted burn...,suicide,"[trigger, warning, excuse, self, inflicted, bu...",trigger warning excuse self inflicted burnsi k...
7,13,it ends tonight can not do it anymore \ni quit,suicide,"[end, tonight, anymore, quit]",end tonight anymore quit
8,16,everyone wants to be edgy and it is making me ...,non-suicide,"[want, edgy, making, self, conscious, feel, li...",want edgy making self conscious feel like stan...


# Step 7:  Save Cleaned File

In [55]:
df.to_csv('Suicide_Detection_DataWrangling.csv')