# Suicide Detection - Data Wrangling

### Table of Contents
* [Step 1. Imports](#Step-1:--Imports) 
* [Step 2. Load the Data](#Step-2:--Load-the-Data)
* [Step 3.  Review Summary Data](#Step-3:--Look-at-Summary-of-Data)
* [Step 4.  Further Data Review (Head)](#Step-4:--Call-head-of-Data)
* [Step 5.  Identify Duplicates and Missing Data](#Step-5:--Identify-duplicates-and-missing-data)
* [Step 6.  Text Normalization](#Step-6:--Text-Normalization)
    * [Step 6a)  Convert Text to Lowercase](#Step-6a:--Convert-all-letters-to-lowercase.)
    * [Step 6b)  Convert urls and links to standardized text](#Step-6b:--Convert-urls-and-links-to-standardized-text)
    * [Step 6c)  Convert emojis and emoticons to text](#Step-6c:-Convert-emojis-and-emoticons-to-text)
    * [Step 6d)  Remove punctuation and numerals](#Step-6d:-Remove-punctuation-and-numerals)
    * [Step 6e)  Remove White Spaces](#Step-6e:-Remove-White-Spaces)
    * [Step 6f)  Language Detection](#Step-6f:-Language-Detection)
    * [Step 6g)  Convert Text to Lowercase](#Step-6g:--Expand-Contractions)
    * [Step 6h)  Spell Check](#Step-6h:--Spell-Check)
    * [Step 6i)  Stop Words and Tokenization](#Step-6i:-Stop-Words-and-Tokenization)
    * [Step 6j)  Lemmatization](#Step-6j:-Lemmatization)
* [Step 7.  Save Cleaned File](#Step-7:--Save-Cleaned-File)

# Step 1:  Imports

In [1]:
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range
from sklearn.feature_extraction.text import CountVectorizer
from tqdm import tqdm 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

import textstat
import re
import string
# !pip install watermark
#%load_ext watermark
from spacy.lang.en.stop_words import STOP_WORDS

import warnings
warnings.filterwarnings("ignore")



# Step 2:  Load the Data

In [2]:
df = pd.read_csv('Suicide_Detection.csv')

# Step 3:  Look at Summary of Data

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232074 entries, 0 to 232073
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  232074 non-null  int64 
 1   text        232074 non-null  object
 2   class       232074 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.3+ MB


# Step 4:  Call head of Data

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


# Step 5:  Identify duplicates and missing data

In [5]:
df['text'].value_counts().sum()

232074

This basically tells us that each entry  in the 'text' column is unique

In [6]:
df['text'].isnull().value_counts()

False    232074
Name: text, dtype: int64

In [7]:
df['class'].isnull().value_counts()

False    232074
Name: class, dtype: int64

Great we do not appear to have any duplicates or missing data.  In this regard the data is very clean.

# Step 6:  Text Normalization

#### Step 6a:  Convert all letters to lowercase.

In [8]:
df['text'] = df['text'].str.lower()

#### Step 6b:  Convert urls and links to standardized text

In [9]:
#  used code snippets from - https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python


for i in range(len(df)):
    df['text'].iloc[i] = no_link_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '_link_to_site_', df['text'].iloc[i])

#### Step 6c: Convert emojis and emoticons to text

First we will convert emojis to text.

In [10]:
import emoji

In [11]:
text1 = "I won 🥇 in 🏏"
text2 = 'Python is the 💣'

print(emoji.demojize(text1))
print(emoji.demojize(text2))

I won :1st_place_medal: in :cricket_game:
Python is the :bomb:


In [12]:
  
for i in range(len(df)):
    df['text'].iloc[i] = emoji.demojize(df['text'].iloc[i])

Next we will convert emoticons to text.

In [13]:
from emot.emo_unicode import UNICODE_EMOJI # For emojis
from emot.emo_unicode import EMOTICONS_EMO# For EMOTICONS

In [14]:
def convert_emoticons(text):
    for emot in EMOTICONS_EMO:
        text = text.replace(emot, EMOTICONS_EMO[emot].replace(" ","_"))
    return text

In [15]:
text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

In [16]:
for iter in range(len(df)):
    df['text'].iloc[iter] = convert_emoticons(df['text'].iloc[iter])

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,ex wife threatening suiciderecently i left my ...,suicide
1,3,am i weird i don't get affected by compliments...,non-suicide
2,4,finally 2020 is almost over... so i can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"i’m so losthello, my name is adam (16) and i’v...",suicide


#### Step 6d: Remove punctuation and numerals

In [18]:
df['text'].str.contains('2').any()

True

In [19]:

punctuation = ['!', '@', '?', '.', '#', '$', '%', '^', '&', '€', '*', '(', ')', ':', ';', '<', '>', '"',  '/', ',', '?', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[', ']', '{', '}', '|']

for item in punctuation:
    df['text'] = df['text'].str.replace(item, '')
    
print(df['text'].iloc[3])

i need helpjust help me im crying so hard


In [20]:
df['text'].str.contains('2').any()

False

#### Step 6e: Remove White Spaces

In [21]:
df['text'] = df['text'].str.strip()

#### Step 6f: Language Detection

In [22]:
from langdetect import detect

for iter in range(len(df)):
    if detect(str(df['text'].iloc[1])) != 'en':
        df.drop(i, inplace = True)

In [23]:
df.shape


(232074, 3)

#### Step 6g:  Expand Contractions

In [24]:
def decontracted(phrase):
  """decontracted takes text and convert contractions into natural form.
     ref: https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python/47091490#47091490"""

  # specific
  phrase = re.sub(r"won\'t", "will not", phrase)
  phrase = re.sub(r"can\'t", "can not", phrase)
  phrase = re.sub(r"won\’t", "will not", phrase)
  phrase = re.sub(r"can\’t", "can not", phrase)

  # general
  phrase = re.sub(r"n\'t", " not", phrase)
  phrase = re.sub(r"\'re", " are", phrase)
  phrase = re.sub(r"\'s", " is", phrase)
  phrase = re.sub(r"\'d", " would", phrase)
  phrase = re.sub(r"\'ll", " will", phrase)
  phrase = re.sub(r"\'t", " not", phrase)
  phrase = re.sub(r"\'ve", " have", phrase)
  phrase = re.sub(r"\'m", " am", phrase)

  phrase = re.sub(r"n\’t", " not", phrase)
  phrase = re.sub(r"\’re", " are", phrase)
  phrase = re.sub(r"\’s", " is", phrase)
  phrase = re.sub(r"\’d", " would", phrase)
  phrase = re.sub(r"\’ll", " will", phrase)
  phrase = re.sub(r"\’t", " not", phrase)
  phrase = re.sub(r"\’ve", " have", phrase)
  phrase = re.sub(r"\’m", " am", phrase)

  return phrase

In [25]:

for i in range(len(df)):
    df['text'].iloc[i] = decontracted(df['text'].iloc[i])


In [26]:
df.to_csv("Suicide_Detection_Interim_Clean8622.csv", index = False)

#### Step 6h:  Spell Check

In [27]:
from textblob import TextBlob


In [28]:
df['text'].iloc[0]

'ex wife threatening suiciderecently i left my wife for good because she has cheated on me twice and lied to me so much that i have decided to refuse to go back to her as of a few days ago she began threatening suicide i have tirelessly spent these paat few days talking her out of it and she keeps hesitating because she wants to believe i will come back i know a lot of people will threaten this in order to get their way but what happens if she really does what do i do and how am i supposed to handle her death on my hands i still love my wife but i cannot deal with getting cheated on again and constantly feeling insecure i am worried today may be the day she does it and i hope so much it does not happen'

In [29]:
str(TextBlob(df['text'].iloc[0]).correct())

'ex wife threatening suiciderecently i left my wife for good because she has created on me twice and lied to me so much that i have decided to refuse to go back to her as of a few days ago she began threatening suicide i have carelessly spent these part few days talking her out of it and she keeps hesitating because she wants to believe i will come back i know a lot of people will threaten this in order to get their way but what happens if she really does what do i do and how am i supposed to handle her death on my hands i still love my wife but i cannot deal with getting created on again and constantly feeling insecure i am worried today may be the day she does it and i hope so much it does not happen'

In [30]:
# importing the nltk suite 
import nltk
  
# importing edit distance  
from nltk.metrics.distance  import edit_distance

nltk.download('words')
from nltk.corpus import words

[nltk_data] Downloading package words to C:\Users\Beth &
[nltk_data]     Andrew\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [31]:
correct_words = words.words()

incorrect_words=['thehappy', 'elephant', 'azmaing', 'intelliengt']

for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

therapy
elephant
aiming
intelligent


Note:  In this notebook I attempted two different methods for spell checking our dataset.  Neither method was perfect and while some improvements were made, it was observed that many of the changes made my the spell checker actually changed the meaning of the text.  It was decided that it would be better to leave any spelling errors than it would be to alter the meaning of the original text.  Therefore spell checker was NOT applied to the whole dataset.

In [32]:
df.to_csv('Suicide_Detection_Normalized.csv')

#### Step 6i: Stop Words and Tokenization

In [33]:
# df = pd.read_csv('Suicide_Detection_Normalized.csv')

In [34]:
df.dropna(inplace = True)

In [35]:
df['word'] = ''

In [36]:

for iter in range(len(df)):
    
    token = word_tokenize(df['text'].iloc[iter])
    result = []
    for item in token:
        if item in STOP_WORDS:
            pass
        else:
            result.append(item)
        
    df['word'].iloc[iter] = result

In [37]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class,word
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,..."
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliments, coming, know, i..."
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god..."
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, crying, hard]"
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, years, afraid, p..."


#### Step 6j: Lemmatization

In [38]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

for iter in range(len(df)):
    result = []
    for word in df['word'].iloc[iter]:
        result.append(lemmatizer.lemmatize(word))
    df['word'].iloc[iter] = result

In [39]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class,word
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,..."
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliment, coming, know, ir..."
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god..."
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, cry, hard]"
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, year, afraid, pa..."


In [40]:
df['clean_text'] = ''

In [43]:
for iter in range(len(df)):
    words = df['word'].iloc[iter]
    words = ' '.join(words)
    df['clean_text'].iloc[iter] = words

In [44]:
df.head(9)

Unnamed: 0.1,Unnamed: 0,text,class,word,clean_text
0,2,ex wife threatening suiciderecently i left my ...,suicide,"[ex, wife, threatening, suiciderecently, left,...",ex wife threatening suiciderecently left wife ...
1,3,am i weird i do not get affected by compliment...,non-suicide,"[weird, affected, compliment, coming, know, ir...",weird affected compliment coming know irl feel...
2,4,finally is almost over so i can never hear h...,non-suicide,"[finally, hear, bad, year, swear, fucking, god...",finally hear bad year swear fucking god annoying
3,8,i need helpjust help me im crying so hard,suicide,"[need, helpjust, help, im, cry, hard]",need helpjust help im cry hard
4,9,i am so losthello my name is adam and i have ...,suicide,"[losthello, adam, struggling, year, afraid, pa...",losthello adam struggling year afraid past yea...
5,11,honetly idki dont know what im even doing here...,suicide,"[honetly, idki, dont, know, im, feel, like, fe...",honetly idki dont know im feel like feel unbea...
6,12,trigger warning excuse for self inflicted burn...,suicide,"[trigger, warning, excuse, self, inflicted, bu...",trigger warning excuse self inflicted burnsi k...
7,13,it ends tonighti can not do it anymore \ni quit,suicide,"[end, tonighti, anymore, quit]",end tonighti anymore quit
8,16,everyone wants to be edgy and it is making me ...,non-suicide,"[want, edgy, making, self, conscious, feel, li...",want edgy making self conscious feel like stan...


# Step 7:  Save Cleaned File

In [45]:
df.to_csv('Suicide_Detection_DataWrangling.csv')