# **Insert Title Here**
**DATA103 S11 Group 4**
- GOZON, Jean Pauline D.
- JAMIAS, Gillian Nicole A.
- MARCELO Andrea Jean C. 
- REYES, Anton Gabriel G.
- VICENTE, Francheska Josefa

## **Introduction**

### **Requirements and Imports**

#### Imports

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [None]:
import numpy as np
import pandas as pd

**Visualization Libraries**

* `matplotlib.pyplot` contains functions to create interactive plots
* `seaborn` is a library based on matplotlib that allows for data visualization
* `spacy` is a Python-based open-source library used in processing text data. 
* `wordcloud` contains functions for generating wordclouds from text data 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from wordcloud import WordCloud
from wordcloud import ImageColorGenerator
from wordcloud import WordCloud, STOPWORDS

**Natural Language Processing Libraries**
* `re` is a module that allows the use of regular expressions
* `nltk` provides functions for processing text data
* `stopwords` is a corpus from NLTK, which includes a compiled list of stopwords
* `Counter` is from Python's collections module, which is helpful for tokenization
* `string` contains functions for string operations
* `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features
* `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk import ngrams

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('vader_lexicon')

from collections import Counter
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

**Machine Learning Libraries**

* `torch` this is an open source ML library for deep neural network creation
* `transformers` contains pre-trained models


In [None]:
#!pip install transformers

In [None]:
import torch
from transformers import BertTokenizer, BertTokenizerFast, BertModel 
from transformers import AutoTokenizer

**Google Drive**
* `google.colab` a library that allows the colab notebook to mount the google drive

#### Datasets and Files

The following `.csv` files were used in this project:
* `Suicide_Detection.csv` contains the text itself and the two classes namely suicide and non-suicide. Retrieved from the "Suicide and Depression Detection" in Kaggle
* `twitter-suicidal-intention-dataset.csv` similar to `Suicide_Detection.csv` but intention is numbered. For the intention column, 1 means tweet is suicidal and 0 means it is not. Retrieved from github.
* `500_anonymized_Reddit_users_posts_labels.csv` contains text from a post and the label of intention. (4 labels available)
* `suicide notes.csv` contains text of suicide notes but it does not have a column labelling the notes as suicidal

## **Data Collection**

```
# This is formatted as code
```



In [None]:
#importing the .csv file from kaggle
watch_df = pd.read_csv('data/Suicide_Detection.csv')
watch_df.head()

In [None]:
print(watch_df["class"].unique())

In [None]:
# importing the twitter dataset
url = "https://raw.githubusercontent.com/laxmimerit/twitter-suicidal-intention-dataset/master/twitter-suicidal_data.csv"
twit_df = pd.read_csv(url)
twit_df.head()

In [None]:
print(twit_df["intention"].unique())

In [None]:
anon_df = pd.read_csv("data/500_anonymized_Reddit_posts.csv")
anon_df.head()

In [None]:
print(anon_df["Label"].unique())

In [None]:
notes_df = pd.read_csv("data/suicide notes.csv")
notes_df.head()

## **Description of the Dataset**

In [None]:
#getting the shape of the four datasets
display(watch_df.shape, twit_df.shape, anon_df.shape, notes_df.shape)

In [None]:
print(watch_df.info(), twit_df.info(), anon_df.info(), notes_df.info())

After seeing the number of features in each column per dataset, dataframes `watch_df`, `twit_df`, and `anon_df` are complete. However, dataframe `notes_df` contains null values. 

## **Data Preprocessing**

### **Pre-Processing**

#### DataFrames

Since the dataframe `notes_df` has null values, we will get rid of those rows using panda's dropna() function. Setting the axis to 0 allows us to drop rows which contain missing values. Additionally, the how parameter set to any causes that row to be removed if there is at least 1 null value present in that row. Having inplace equal to true modifies the exisiting Dataframe.

In [None]:
notes_df.dropna(axis = 0, how = "any", inplace=True)

After checking the total number of null values  in the whole `notes_df` we can see it is equal to zero.

In [None]:
notes_df.isnull().sum().sum()

For some of the dataframes, the `user`, `id`, and `unnamed column` would not be needed and therefore would be dropped.

In [None]:
anon_df = anon_df.drop("User", axis = 1)
notes_df = notes_df.drop("id", axis = 1)
watch_df = watch_df.drop("Unnamed: 0", axis = 1)

In [None]:
display("anon_df",anon_df.columns, "notes_df", notes_df.columns, "watch_df", watch_df.columns)

In [None]:
display("anon_df",anon_df.head(), 
        "notes_df", notes_df.head(), 
        "twit_df", twit_df.head(), 
        "watch_df", watch_df.head())

After dropping the unecessary columns, it was then time to convert the values for the labeling columns so that once the dataframes are joined, there wouldn't be any further complications. 

Reviewing the columns of all the datasets we imported

In [None]:
display("anon_df columns", list(anon_df.columns),"notes_df columns", list(notes_df.columns),
        "twit_df columns", list(twit_df.columns), "watch_df columns", list(watch_df.columns))

Creating a copy of watch_df before modifying the values to match twit_df (1 means text is suicidal and 0 means it is not)

In [None]:
integerwatch_df = watch_df.copy(deep=True)

Using pandas replace() function to change multiple values with multiple new values for an individual DataFrame column

In [None]:
integerwatch_df['class'] = integerwatch_df['class'].replace(['suicide', 'non-suicide'], ['1', '0'])

In [None]:
 integerwatch_df.head() #checking if the replace function reflected

Using .info() to check the datatypes of the dataframe intgerwatch_df

In [None]:
 integerwatch_df.info()

Using panads astype() function allows us to convert the obj data type in the class column to integer for uniformity with other dataframes

In [None]:
integerwatch_df['class'] = integerwatch_df['class'].astype('int')

Creating a copy of notes_df before modifying the values to match twit_df (1 means text is suicidal and 0 means it is not)

In [None]:
 integerwatch_df.info()

In [None]:
integernotes_df = notes_df.copy(deep=True)

Creating a new column named class and setting it to have a constant value of 1 since all texts are posted by users with  suicidal thoughts

In [None]:
integernotes_df['class'] = 1

In [None]:
integernotes_df.head() #checking if the new class column was addded

Creating a copy of twit_df before modifying the column names to match integerwatch_df and integernotes_df

In [None]:
new_twit = twit_df.copy(deep=True)

Renaming using pandas rename() function using a dictionary of new and old column names. Inplace set to true modified the existing Dataframe

In [None]:
new_twit.rename(columns={"tweet": "text", "intention": "class"}, inplace=True)

In [None]:
new_twit.head() #checking if column names are renamed

For `anon_df`, there are five unique values with their respective counts:

In [None]:
anon_df.Label.value_counts()

Columns for `anon_df` were renamed for consistency with the other dataframes.

In [None]:
anon_df.rename(columns={"Post": "text", "Label": "class"}, inplace=True)

Copying `anon_df` before modifying other values.

In [None]:
intanon_df = anon_df.copy(deep =  True)

Using pandas `replace()` function to change multiple values with multiple new values for an individual DataFrame column. For `intanon_df` the 5 values were replaced with a corrosponding `1` or `0` value.

In [None]:
intanon_df['class'] = intanon_df['class'].replace(['Ideation', 'Indicator','Behavior','Attempt','Supportive'], ['1','1','1','1','0'])

Using the `.astype()` function to convert the `class` value's types into `int` or an `integer` type. 

In [None]:
intanon_df['class'] = intanon_df['class'].astype('int')

Looking at the tail of `intanon_df` to check if the `class` values were replaced and converted accordingly. 

In [None]:
intanon_df.tail()

After changing column names for `anon_df` the following values were changed:

* Ideation = 1
* Indicatior = 1
* Behavior = 1
* Attempt = 1
* Supportive = 0

#### All dataframes

Displaying the dataframes we have now

In [None]:
display("intanon_df",intanon_df.head(), 
        "integernotes_df", integernotes_df.head(), 
        "new_twit", new_twit.head(), 
        "integerwatch_df", integerwatch_df.head())

In [None]:
#getting the shape of the four datasets 
display(intanon_df.shape, integernotes_df.shape, new_twit.shape, integerwatch_df.shape)

Using the concat() function with an axis set to 0 allows us to stitch Dataframes along the rows. We will first combine the Dataframes integernotes_df and new_twit to one Dataframe.

In [None]:
concat = pd.concat([intanon_df, integernotes_df, new_twit, integerwatch_df], axis=0)

We can check if the number of rows are equal to the total of the two combined Dataframes. 9586 rows is the sum of 467 rows and 9119 rows.

As well as checking if the unique values are still integers 1 and 0.

In [None]:
display(concat.shape , concat["class"].unique())

We switch the placement of the `text` and `class` columns in all four dataframes.

In [None]:
new_column_order = ['class', 'text'] #making the new column order

intanon_df = intanon_df[new_column_order]
integernotes_df = integernotes_df[new_column_order]
new_twit = new_twit[new_column_order]
integerwatch_df = integerwatch_df[new_column_order]

### **Data Cleaning**

#### Removing unnecessary character sequences

We created a RegEx function to remove unnecessary character sequences that might potentially interfere with the next steps before modeling.

In [None]:
def remove_unnecessary(text):
    text = re.sub('RT', '', text) #RT
    text = re.sub('@[^\s]+', '', text) #usernames
    text = re.sub('http[^\s]+','',text) #media links
    text = re.sub(r'\[|\]', '', text) #square brackets
    text = re.sub('#[^ ]+', '', text) #hashtags
    return text

But before applying the function, a copy of the `concat` dataframe was created.

In [None]:
master = concat.copy(deep = True)

Here, the function was applied to the `master` dataframe.

In [None]:
master['text'] = master['text'].apply(remove_unnecessary)

Checking if the `remove_unnecessary` function was applied.

In [None]:
master.head()

Checking the shape of the `master` dataframe.

In [None]:
master.shape

#### Functions for Feature Engineering

**Batch Processing Function**

Because the `master` dataframe is big, the four dataframes will be processed and cleaned by batch with a function.

In [None]:
def batch_processing_bert(df):
    
    def tokenize_and_remove_stopwords(sentence):
        tokens = tokenizer.tokenize(sentence)
        return tokens

    df['token'] = df['text'].apply(tokenize_and_remove_stopwords)
    df['string'] = df['token'].apply(lambda x: ' '.join([item for item in x if len(item)>2]))

    return df

**Removing UNK Function**

In [None]:
unk_pattern = re.compile(r'\bUNK\b')

In [None]:
def remove_UNK(text):
    return re.sub(r"\bUNK\b", "", text)

### **Feature Engineering**

#### **Tokenizing with Bert**

We get the tokenizer for BERT

In [None]:
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_wordpiece=True)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

Making copies of all the four dataframes for easier access without affecting the original dataframes

In [None]:
bert_df1 = intanon_df.copy(deep = True)
bert_df2 = integernotes_df.copy(deep = True)
bert_df3 = new_twit.copy(deep = True)
bert_df4 = integerwatch_df.copy(deep = True)

Merging all the copied dataframes

In [None]:
bert_concat = pd.concat([bert_df1, bert_df2, bert_df3, bert_df4], axis=0)

In [None]:
bert_text_data = bert_concat['text'].tolist()

# Initialize the tokenizer
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

# Tokenize the text data
tokenized_data = bert_tokenizer(bert_text_data, padding=True, truncation=True, return_tensors='pt')

# Add the tokenized data as columns to the DataFrame
token_columns = ['token_{}'.format(i) for i in range(tokenized_data.input_ids.shape[1])]
df_tokens = pd.DataFrame(tokenized_data.input_ids.numpy(), columns=token_columns)
bert_master = pd.concat([bert_concat, df_tokens], axis=1)

##### **Tokenized Text**

  Creating a list of the copied dataframes

In [None]:
bert_dflist = [bert_df1, bert_df2, bert_df3, bert_df4]

Looping through the list to batch process

In [None]:
for df in bert_dflist:
    df = batch_processing_bert(df)

After the loop, we display the first 5 rows of all the dataframes:

In [None]:
display(bert_df1.head(), bert_df2.head(), bert_df3.head(), bert_df4.head())

Merging all the looped dataframes into one master dataframe.

After that, we make a copy of the dataframe and use the `remove_unnecessary` function.

In [None]:
bert_concat_text = pd.concat([bert_df1, bert_df2, bert_df3, bert_df4], axis=0)

bert_master_text = bert_concat_text.copy(deep = True)

bert_master_text['string'] = bert_master_text['string'].apply(remove_unnecessary)

bert_master_text.head()

#### **Tokenizing with NLTK**

Creating copies and concatenating the copied dataframes

In [None]:
nltk_df1 = intanon_df.copy(deep = True)
nltk_df2 = integernotes_df.copy(deep = True)
nltk_df3 = new_twit.copy(deep = True)
nltk_df4 = integerwatch_df.copy(deep = True)

nltk_concat = pd.concat([nltk_df1, nltk_df2, nltk_df3, nltk_df4], axis=0)

We get the `RegexpTokenizer` by creating a tokenizer.

In [None]:
nltk_concat['text'] = nltk_concat['text'].astype(str).str.lower()
regexp = RegexpTokenizer('\w+')

We create a new column in the `nltk_concat` dataframe to apply the tokenized text.

In [None]:
nltk_concat['text_token']=nltk_concat['text'].apply(regexp.tokenize)

Creating a copy of the `nltk_concat` dataframe and renaming it to be consistent with the other tokenized dataframes.

In [None]:
nltk_master = nltk_concat.copy(deep = True)

We display the head of the dataframe to see the result of the tokenizer.

In [None]:
nltk_master.head()

#### **Tokenizing with TfidfVectorizer**

Creating copies and concatenating the copied dataframes

In [None]:
tfidf_df1 = intanon_df.copy(deep = True)
tfidf_df2 = integernotes_df.copy(deep = True)
tfidf_df3 = new_twit.copy(deep = True)
tfidf_df4 = integerwatch_df.copy(deep = True)

tfidf_concat = pd.concat([tfidf_df1, tfidf_df2, tfidf_df3, tfidf_df4], axis=0)

We extract the text data into a list.

In [None]:
tfidf_text_data = tfidf_concat['text'].tolist()

We then create a `TfidfVectorizer` object.

In [None]:
tfidf_vectorizer = TfidfVectorizer()

We fit the `tfidf_vectorizer` onto the text data

In [None]:
tfidf_vectorizer.fit(tfidf_text_data)

We transform the text data into the TF_IDF matrix

In [None]:
tfidf_matrix = tfidf_vectorizer.transform(tfidf_text_data) 

Lastly, we add the TF-IDF matrix as columns in the `tfidf_master` dataframe

In [None]:
tfidf_columns = ['tfidf_{}'.format(i) for i in range(tfidf_matrix.shape[1])]
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_columns)
tfidf_master = pd.concat([tfidf_concat, df_tfidf], axis=1)

We display the head of the dataframe to see the result.

In [None]:
tfidf_master.head()

#### **Tokenizing with CountVectorizer**

Creating copies and concatenating the copied dataframes

In [None]:
count_df1 = intanon_df.copy(deep = True)
count_df2 = integernotes_df.copy(deep = True)
count_df3 = new_twit.copy(deep = True)
count_df4 = integerwatch_df.copy(deep = True)

count_concat = pd.concat([count_df1, count_df2, count_df3, count_df4], axis=0)

We extract the text data into a list.

In [None]:
count_text_data = count_concat['text'].tolist()

We then create a `CountVectorizer` object.

In [None]:
count_vectorizer = CountVectorizer()

We fit the `count_vectorizer` onto the text data

In [None]:
count_vectorizer.fit(count_text_data)

We transform the text data into bag-of-words matrix

In [None]:
bow_matrix = count_vectorizer.transform(count_text_data)

Lastly, we add the bag-of-words matrix as columns in the `count_master` dataframe

In [None]:
bow_columns = ['bow_{}'.format(i) for i in range(bow_matrix.shape[1])]
df_bow = pd.DataFrame(bow_matrix.toarray(), columns=bow_columns)
count_master = pd.concat([count_concat, df_bow], axis=1)

We display the head of the `count_master` dataframe

In [None]:
count_master.head()

## **Exploratory Data Analysis (EDA)**

### **EDA Questions:**
1. What are the most occurring words under the suicide class?
2. What are the most occurring words under the non-suicide class?

A copy of the dataframe containing the combined and tokenized dataset is created for the EDA.

In [None]:
eda = concat[['class', 'token','text']].copy(deep=True)


The eda dataframe is separated into their respective classes: ns for non-suicide (class = 0) and s for suicide (class = 1)

In [None]:
ns = eda[eda['class'] == 0]
s = eda[eda['class'] == 1]

#### **What are the most occurring words under the non-suicide class?**

In [None]:
text = " ".join(i for i in ns.text).lower()
wordcloud = WordCloud(background_color="white").generate(text)

In [None]:
plt.figure(figsize=(15,10))
plt.imshow( wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
sp = spacy.load('en_core_web_sm')
all_stopwords = sp.Defaults.stop_words
new_stopwords_ns=["filler", " ", "S", "t", "s", "m"]
comb_stopwords_ns=list(new_stopwords_ns)+list(all_stopwords)
wordcloud = WordCloud(stopwords=comb_stopwords_ns, background_color="white").generate(text)
print(new_stopwords_ns

In [None]:
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
txt_ns = " ".join(ns['text'])
words_ns = word_tokenize(txt_ns)

In [None]:
def cleaned_words(new_tokens):
	new_tokens = [t.lower() for t in new_tokens]
	new_tokens =[t for t in new_tokens if t not in stopwords.words('english') and comb_stopwords_ns]
	new_tokens = [t for t in new_tokens if t.isalpha()]
	lemmatizer = WordNetLemmatizer()
	new_tokens = [lemmatizer.lemmatize(t) for t in new_tokens]
	return new_tokens

In [None]:
lowered_ns = cleaned_words(words_ns)

In [None]:
bow_ns = Counter(lowered_ns)

In [None]:
data_ns = pd.DataFrame(bow_ns.items(),columns=['word','frequency']).sort_values(by='frequency',ascending=False)
data_ns = data_ns.head(20)
sns.barplot(x='frequency',y='word',data=data_ns)

#### **What are the most occurring words under the suicide class?**

In [None]:
text_s = " ".join(i for i in s.text).lower()
wordcloud_s = WordCloud(background_color="white").generate(text_s)

In [None]:
plt.figure(figsize=(15,10))
plt.imshow( wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
new_stopwords_s=["filler", " ", "S", "t", "s", "m"]
comb_stopwords_s=list(new_stopwords_s)+list(all_stopwords)
wordcloud_s = WordCloud(stopwords=comb_stopwords_s, background_color="white").generate(text)
print(new_stopwords_s)

In [None]:
plt.figure(figsize=(15,10))
plt.imshow(wordcloud_s, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
txt_s = " ".join(s['text'])
words_s = word_tokenize(txt_s)

In [None]:
def cleaned_words_s(new_tokens_s):
	new_tokens_s = [t.lower() for t in new_tokens_s]
	new_tokens_s =[t for t in new_tokens_s if t not in stopwords.words('english') and comb_stopwords_s]
	new_tokens_s = [t for t in new_tokens_s if t.isalpha()]
	lemmatizer = WordNetLemmatizer()
	new_tokens_s = [lemmatizer.lemmatize(t) for t in new_tokens_s]
	return new_tokens_s

In [None]:
lowered_s = cleaned_words_s(words_s)

In [None]:
bow_s = Counter(lowered_s)

In [None]:
data_s = pd.DataFrame(bow_s.items(),columns=['word','frequency']).sort_values(by='frequency',ascending=False)
data_s = data_s.head(20)
sns.barplot(x='frequency',y='word',data=data_s)

## **Modeling and Evaluation**

### **Modeling**

#### **Model Training**

#### **Hyperparameter Tuning**

### **Evaluation**

#### **Feature Importance**

## **Conclusion**

## **Try out our model!**

## References