# You Matter, Words Matter: A Suicide Prevention Tool
In this notebook, you can try out the different models trained in order to detect if the text inputted is suicidal or not.

**DATA103 S11 Group 4**
- GOZON, Jean Pauline D.
- JAMIAS, Gillian Nicole A.
- MARCELO Andrea Jean C. 
- REYES, Anton Gabriel G.
- VICENTE, Francheska Josefa

## Requirements and Imports

Before starting, the relevant libraries and files in building and training the model should be loaded into the notebook first.

**Basic Libraries**

Import `numpy`, `pandas`, and `datasets`.

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis
* `datasets` contains functions that allow easier pre-processing for datasets and smart caching for easier loading of data

In [1]:
import numpy as np
import pandas as pd
import datasets

**Natural Language Processing Libraries** 

The next imports are libraries that can implement feature engineering techniques on the text input. 
* `re` is a module that allows the use of regular expressions
* `TFidfVectorizer` converts the given text documents into a matrix, which has TF-IDF features
* `CountVectorizer` converts the given text documents into a matrix, which has the counts of the tokens

In [2]:
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

**Machine Learning Libraries**

`pickle` is a module that can serialize and deserialize objects. In this notebook, it is used to save and load models.

In [3]:
import pickle

The following classes are classifiers that implement different methods of classification.
* `LogisticRegression` is a class under the linear models module that implements regularized logistic regression
* `MultinomialNB` is a class under the Naive Bayes module that allows the classification of discrete features
* `RandomForestClassifier` is a class under the ensemble module that trains by fitting using a number of decision trees

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

Last, `requests` is a library that allows us to send requests to the Huggingface Hub.

In [5]:
import requests

## Model Imports
To use the model, let us import each of the model that we used, starting with the vectorizers.

In [6]:
with open ('./saved_models/trad_ml/vectorizers/count.pkl', "rb") as file:
    count_vectorizer = pickle.load(file)
    
with open ('./saved_models/trad_ml/vectorizers/tfidf.pkl', "rb") as file:
    tfidf_vectorizer = pickle.load(file)

Before we move on to importing the models, let us first declare the list that would hold the models that we will be using.

Now, we can continue with importing the models (that utilized traditional machine learning algorithms) that we will be using.

In [7]:
count_models = []
tfidf_models = []

In [8]:
file_names = ['logreg', 'logreg_tuned', 'mnb', 'mnb_tuned']

for temp in file_names:
    with open ('./saved_models/trad_ml/' + temp + '/count/model.pkl', "rb") as file:
        model = pickle.load(file)
        count_models.append(model)
        
    with open ('./saved_models/trad_ml/' + temp + '/tfidf/model.pkl', "rb") as file:
        model = pickle.load(file)
        tfidf_models.append(model)  

Next, let us declare the URL and the header that will allow us to send requests to the BERT and the RoBERTa models from the Huggingface Hub.

In [9]:
ROBERTA_API_URL = "https://api-inference.huggingface.co/models/francheska-vicente/data103-roberta-base-v1"
roberta_headers = {"Authorization": "Bearer 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"}

BERT_API_URL = "https://api-inference.huggingface.co/models/francheska-vicente/data103-bert-base-v2"
bert_headers = {"Authorization": "Bearer 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"}

Last, we will be defining the function that will send the request to the Hub and accept the response of the Hub.

In [10]:
def query(payload, url, header):
    response = requests.post(url, headers=header, json=payload)
    return response.json()

## Declaration of Functions 

Before we start with the prediction proper, we will have to define the functions that will be used.

First, the `remove_unnecessary` function will remove the words that we deem as unnecessary (i.e., retweets, usernames, media links, square brackets, and hashtags).

In [11]:
def remove_unnecessary(text):
    text = re.sub('RT', '', text) # RT
    text = re.sub('@[^\s]+', '', text) # usernames
    text = re.sub('http[^\s]+','',text) # media links
    text = re.sub(r'\[|\]', '', text) # square brackets
    text = re.sub('#[^ ]+', '', text) # hashtags
    return text

Next, the `clean_input` function is the function that will call all functions that will be used for pre-processing and cleaning. This will be used for the models that utilized traditional machine learning algorithms.

In [12]:
def clean_input (text):
    text = remove_unnecessary (text)
    
    return text

This is followed by the functions that will be used for formatting the output of the predictions. The `determine_output` is the function that will convert the response of the Huggingface Hub into the same format as the output of traditional machine learning models.

In [13]:
def determine_output (output):

    positive = output[0][1]['score']
    negative = output[0][0]['score']
    label = 0
    
    if output[0][0]['label'] == 'Suicidal':
        positive = output[0][0]['score']
        negative = output[0][1]['score']
    
    probability = negative 
    
    if positive >= 0.5:
        label = 1
        probability = positive
 
    return [label, probability * 100]

Last, the `output_probabilities` function will format the output of the models to make it more understandable and readable.

In [14]:
def output_probabilities (predictions, probabilities):
    labels = ['Logistic Regression (Count)', 'Tuned Logistic Regression  (Count)',
              'Multinomial Naive Bayes  (Count)', 'Tuned Multinomial Naive Bayes  (Count)',
              'Logistic Regression (TF-IDF)', 'Tuned Logistic Regression (TF-IDF)',
              'Multinomial Naive Bayes (TF-IDF)', 'Tuned Multinomial Naive Bayes (TF-IDF)']

    temp_index = 0
    
    for i in range (len(labels)):
        predict_label = 'non-suicidal'
        
        curr_probability = probabilities [i][0]
        probability_label = round(curr_probability [0] * 100, 2) 
        if (predictions [i] == 1):
            predict_label = 'suicidal'
            probability_label = round(curr_probability [1] * 100, 2)  
        
        print ('According to the ' + labels [i] + ' model, there is a ' + str(probability_label) + 
               '% chance that your text is ' + predict_label + '.')
        
        temp_index = i
    
    temp_index = temp_index + 1
    next_labels = ['BERT', 'RoBERTa']
    for temp in (next_labels):
        if temp_index != len(probabilities): 
            probability_label = round(probabilities [temp_index], 2)   
            predict_label = predictions [temp_index]
            
            if predict_label == 1:
                predict_label = 'suicidal'
            else:
                predict_label = 'non-suicidal'
            
            print ('According to the ' + temp + ' model, there is a ' + str(probability_label) + 
                   '% chance that your text is ' + predict_label + '.')
            temp_index = temp_index + 1
        
    print()

## Try out our model!
Now, you can use our models! Enter **STOP** if you want to stop.

In [17]:
text = input ('Enter text: ').strip ()

while (text.lower() != 'stop'):
    count_text = count_vectorizer.transform ([clean_input (text)])
    tfidf_text = tfidf_vectorizer.transform ([clean_input (text)])
    
    predictions = []
    probabilities = []
    
    for curr_model in count_models:
        predictions.append(curr_model.predict(count_text))
        probabilities.append(curr_model.predict_proba(count_text))
        
    for curr_model in tfidf_models:
        predictions.append(curr_model.predict(tfidf_text))
        probabilities.append(curr_model.predict_proba(tfidf_text))
    
    try:
        bert_output = determine_output(query(text, BERT_API_URL, bert_headers))
        roberta_output = determine_output(query(text, ROBERTA_API_URL, roberta_headers))
        
        predictions.append(bert_output [0])
        probabilities.append(bert_output [1])
        
        predictions.append(roberta_output [0])
        probabilities.append(roberta_output [1])
    except Exception as e:
        print(e)
        print("The BERT and/or RoBERTa models are still being loaded to the Huggingface Hub.")
    
    output_probabilities (predictions, probabilities)
    
    text = input ('Enter text: ').strip ()

Enter text: I don't want to commit suicide.
According to the Logistic Regression (Count) model, there is a 62.34% chance that your text is suicidal.
According to the Tuned Logistic Regression  (Count) model, there is a 70.93% chance that your text is suicidal.
According to the Multinomial Naive Bayes  (Count) model, there is a 99.38% chance that your text is suicidal.
According to the Tuned Multinomial Naive Bayes  (Count) model, there is a 99.32% chance that your text is suicidal.
According to the Logistic Regression (TF-IDF) model, there is a 99.92% chance that your text is suicidal.
According to the Tuned Logistic Regression (TF-IDF) model, there is a 99.92% chance that your text is suicidal.
According to the Multinomial Naive Bayes (TF-IDF) model, there is a 95.45% chance that your text is suicidal.
According to the Tuned Multinomial Naive Bayes (TF-IDF) model, there is a 94.58% chance that your text is suicidal.
According to the BERT model, there is a 97.51% chance that your text 