## Our Model
Here we will be creating different models to classify our data. 

In [12]:
import numpy as np
from numpy import array
from numpy import array
from numpy import asarray
from numpy import zeros
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set configurations
pd.set_option('display.max_columns', 100)
sns.set_style("white")

# keras imports
import keras
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer

# model imports
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsOneClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score
import pickle
import joblib

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)


In [20]:
model_data = pd.read_csv('../data/suicide_vs_nothing.csv', keep_default_na=False)

In [21]:
model_data.tail(3)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean,author_clean
1812,My teenager made me so proud tonight with a si...,I have 4 boys ranging from ages 3 to 14. Tonig...,SedativeCorpse,251,0,https://www.reddit.com/r/CasualConversation/co...,4 boy ranging age 3 14 tonight 3 year old wa t...,teenager made proud tonight simple gesture,sedative corpse
1813,"After 30 years of being open, my family’s rest...",My family has owned a fine dining italian rest...,retirereddit,769,0,https://www.reddit.com/r/CasualConversation/co...,family ha owned fine dining italian restaurant...,30 year open family restaurant closing tonight,retire reddit
1814,I’ve been living in Japan for more than a deca...,"First of all, I’m not Japanese. I came to Japa...",BeardedGlass,18,0,https://www.reddit.com/r/CasualConversation/co...,first japanese came japan 12 year ago work pro...,living japan decade anyone ha question life co...,bearded glass


In [22]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1815 entries, 0 to 1814
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   title           1815 non-null   object
 1   selftext        1815 non-null   object
 2   author          1815 non-null   object
 3   num_comments    1815 non-null   int64 
 4   is_suicide      1815 non-null   int64 
 5   url             1815 non-null   object
 6   selftext_clean  1815 non-null   object
 7   title_clean     1815 non-null   object
 8   author_clean    1815 non-null   object
dtypes: int64(2), object(7)
memory usage: 127.7+ KB


Establishing a baseline accuracy is important for evaluating the model's progression. If every prediction was 1, let's see what our accuracy would be.

In [23]:
model_data['is_suicide'].mean()

0.5399449035812672

Our baseline accuracy is about 51.7%

### Selection of Features and Attributes
Here we are attempting to create a model using CountVectorizer and Naive Bayes Model to determine which columns are the best to score and use for our model.

For our initial model, it will be a binary classifier. Once the user gets their classification, they will go to another model that will give them specific support. 

Label Encoding: r/SuicideWatch = 1, r/Depression = 0

TP: model predicts suicide, and it is correct

TN: model predicts depression, and it is correct

FP: model predicts suicide, but it is really depression, not good

FN: model predicts depression, but they really are suicidal, this is the worst, misses an at risk patient

In [24]:
# using the function
columns_list = ['selftext', "author", "title",'selftext_clean', "author_clean", "title_clean", "megatext_clean"]
model = "CountVec + MultinomialNB"
df_list=[]
# multi_modelling(columns_list, model)

Based on our results, using the megatext_clean column will be the best option for our model due its good metrics, such as its AUC value. The false negatives are the highest in our confusion matrix, meaning it is good for the model to learn from. 

### Production Model
We will test a few different models by running some perumatations and recording the data.

After reviewing all the performances, two models stand out. The Hashing Vectorizor and Multinomial Naive Bayes model had much better metrics, along with the TFID Vectorizor and Multinomial Naive Bayes Model. They both have good AUC values and high generalization.

### Tuning Hyperparameters

### TF-IDF Vectorizor + Multinomial Naive Bayes
This model has the best metrics and will work the best for our data. While it isn't the higher AUC, it generalizes better and has higher True Positives. 

## Running our Optimized Model
This model is a combination of TF-IDF("Term Frequency - Inverse Document" Frequency) Vectorizer and the Multinomial Naive Bayes. It assigns scores for the top 70 words in our selected feature. TF-IDF will penalize common words, helping the model find specific key words. The model makes a prediction based on a matrix of word scores and gives a probability of falling into a certain classification. 

In [18]:
# getting ready for training

X = model_data["selftext_clean"]
y = model_data['is_suicide']
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

tvec_optimised = TfidfVectorizer(max_df= 0.5, max_features=70, min_df=2, ngram_range=(1, 3),stop_words = 'english')
# tvec_optimised = TfidfVectorizer(max_df= 0.5, min_df=2, ngram_range=(1, 3),stop_words = 'english')
X_train_tvec = tvec_optimised.fit_transform(X_train).todense()
X_test_tvec = tvec_optimised.transform(X_test).todense()

In [19]:
# getting accuracies of the model
nb = MultinomialNB()
nb.fit(X_train_tvec, y_train)
accuracy = nb.score(X_test_tvec, y_test)

# calculating AUC

pred_proba = [i[1] for i in nb.predict_proba(X_test_tvec)] 
auc = roc_auc_score(y_test, pred_proba)

joblib.dump(nb, "suicide_vs_nothing.h5")

print("Accuracy: {}\nAUC Score: {}".format(accuracy, auc) )

Accuracy: 0.717948717948718
AUC Score: 0.8066258786774277


In [None]:
# OnevOneClassifier (mutliclass)
ovoc = OneVsOneClassifier(LinearSVC(random_state=0)).fit(X_train_tvec, y_train)# .predict(X_train_tvec)
accuracy = ovoc.score(X_test_tvec, y_test)
print(accuracy)

### Model Performace
Our Model performed well, having an accuracy of 70% and an AUC score of 75%. There are most likely ways to improve the classifier, and we will continue to train the model. 

For our new labeling scheme, we got an accuracy of 76% and an AUC score of 78%.

For the clustered labels, we got an accuracy of 75.3% and an AUC score of 81.1%.