## Our Model
Here we will be creating different models to classify our data. 

In [78]:
import numpy as np
from numpy import array
from numpy import array
from numpy import asarray
from numpy import zeros
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set configurations
pd.set_option('display.max_columns', 100)
sns.set_style("white")

# keras imports
import keras
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers.core import Activation, Dropout, Dense
from keras.layers import Flatten
from keras.layers import GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer

# model imports
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score
import pickle
import joblib

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)


In [79]:
model_data = pd.read_csv('../data/scheme1.csv', keep_default_na=False)

In [80]:
model_data.head(3)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean,author_clean,selftext_length,title_length,megatext_clean,Clustered Labels,New Labels
0,0,0,"Our most-broken and least-understood rules is ""helpers may not invite private contact as a first resort"", so we've made a new wiki to explain it","We understand that most people who reply immediately to an OP with an invitation to talk privately mean only to help, but this type of response usually leads to either disappointment or disaster....",SQLwitch,133,0,https://www.reddit.com/r/depression/comments/doqwow/our_mostbroken_and_leastunderstood_rules_is/,understand people reply immediately op invitation talk privately mean help type response usually lead either disappointment disaster usually work quite differently say pm anytime casual social con...,broken least understood rule helper may invite private contact first resort made new wiki explain,sql witch,4792,144,sql witch understand people reply immediately op invitation talk privately mean help type response usually lead either disappointment disaster usually work quite differently say pm anytime casual ...,1,1
1,1,1,Regular Check-In Post,Welcome to /r/depression's check-in post - a place to take a moment and share what is going on and how you are doing. If you have an accomplishment you want to talk about (these shouldn't be stand...,circinia,1644,0,https://www.reddit.com/r/depression/comments/exo6f1/regular_checkin_post/,welcome r depression check post place take moment share going accomplishment want talk standalone post sub violate role model rule welcome tough time prefer make post place share try best keep spa...,regular check post,c irc,650,21,c irc welcome r depression check post place take moment share going accomplishment want talk standalone post sub violate role model rule welcome tough time prefer make post place share try best ke...,1,1
2,2,2,"I hate it so much when you try and express your feelings to your parents, but they turn it around and compare your suffering with theirs.","I've been feeling really depressed and lonely lately from my job, I'm a full time late night janitor for a courthouse just 10 miles away from my hometown. Working there has been pretty easy, but I...",TheNewKiller69,8,0,https://www.reddit.com/r/depression/comments/fedwbi/i_hate_it_so_much_when_you_try_and_express_your/,feeling really depressed lonely lately job full time late night janitor courthouse 10 mile away hometown working ha pretty easy wound feeling super lonely lot considering really get talk friend ha...,hate much try express feeling parent turn around compare suffering,new killer 69,1866,137,new killer 69 feeling really depressed lonely lately job full time late night janitor courthouse 10 mile away hometown working ha pretty easy wound feeling super lonely lot considering really get ...,0,0


In [81]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1897 entries, 0 to 1896
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Unnamed: 0        1897 non-null   int64 
 1   Unnamed: 0.1      1897 non-null   int64 
 2   title             1897 non-null   object
 3   selftext          1897 non-null   object
 4   author            1897 non-null   object
 5   num_comments      1897 non-null   int64 
 6   is_suicide        1897 non-null   int64 
 7   url               1897 non-null   object
 8   selftext_clean    1897 non-null   object
 9   title_clean       1897 non-null   object
 10  author_clean      1897 non-null   object
 11  selftext_length   1897 non-null   int64 
 12  title_length      1897 non-null   int64 
 13  megatext_clean    1897 non-null   object
 14  Clustered Labels  1897 non-null   int64 
 15  New Labels        1897 non-null   int64 
dtypes: int64(8), object(8)
memory usage: 237.2+ KB


Establishing a baseline accuracy is important for evaluating the model's progression. If every prediction was 1, let's see what our accuracy would be.

In [82]:
model_data['is_suicide'].mean()

0.5166051660516605

Our baseline accuracy is about 51.7%

For our initial model, it will be a binary classifier. Once the user gets their classification, they will go to another model that will give them specific support. 

Label Encoding: r/SuicideWatch = 1, r/Depression = 0

TP: model predicts suicide, and it is correct

TN: model predicts depression, and it is correct

FP: model predicts suicide, but it is really depression, not good

FN: model predicts depression, but they really are suicidal, this is the worst, misses an at risk patient

## Running our Optimized Model
This model is a combination of TF-IDF("Term Frequency - Inverse Document" Frequency) Vectorizer and the Multinomial Naive Bayes. It assigns scores for the top 70 words in our selected feature. TF-IDF will penalize common words, helping the model find specific key words. The model makes a prediction based on a matrix of word scores and gives a probability of falling into a certain classification. 

In [85]:
# getting ready for training

X = model_data["selftext_clean"]
y = model_data["is_suicide"]
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

tvec_optimised = TfidfVectorizer(max_df= 0.5, max_features=70, min_df=2, ngram_range=(1, 3),stop_words = 'english')
X_train_tvec = tvec_optimised.fit_transform(X_train).todense()
X_test_tvec = tvec_optimised.transform(X_test).todense()

In [90]:
print('Maximum review length: {}'.format(
len(max((X_train + X_test), key=len))))

TypeError: object of type 'float' has no len()

In [89]:
from keras.preprocessing import sequence

vocabulary_size = 5000
max_words = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_words)
X_test = sequence.pad_sequences(X_test, maxlen=max_words)

ValueError: invalid literal for int() with base 10: ' want loose dont know healthy keep seeing hanging mean kinda depressed bad feeling thought come go also worried distance would intensify insecurites love another guy probalby stop seeing tho least to

In [88]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size= 32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

batch_size = 64
num_epochs = 10
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2, validation_data=(X_valid, y_valid), batch_size=batch_size, epochs=num_epochs)

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 500, 32)           160000    
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


ValueError: Error when checking input: expected embedding_11_input to have shape (500,) but got array with shape (1,)

In [None]:
scores = model.evaluate(X_valid, y_valid, verbose=0)
print('Test accuracy:', scores[1])

In [91]:
model.save("RNN.h5")

In [94]:
from keras.models import load_model
model = load_model("RNN.h5")
model.summary()

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 500, 32)           160000    
_________________________________________________________________
lstm_8 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


In [95]:
scores = model.evaluate(X_valid, y_valid, verbose=0)
print('Test accuracy:', scores[1])

ValueError: Error when checking input: expected embedding_11_input to have shape (500,) but got array with shape (1,)

In [None]:
embeddings_dictionary = dict()

embedding_matrix = zeros((vocab_size, 70))
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [None]:
model = Sequential()

embedding_layer = Embedding(vocab_size, 70, weights=[embedding_matrix], input_length=70 , trainable=False)
model.add(embedding_layer)

model.add(keras.layers.Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [None]:
model1 = Sequential()
model1.add(Dense(10, input_dim=70, activation='relu'))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
model2 = Sequential()

model2.add(Flatten())
model2.add(Dense(10, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
from keras.models import Sequential
from keras import layers
from keras.layers import Dense, Activation, Embedding, Flatten, GlobalMaxPool1D, Dropout, Conv1D

model3 = Sequential()
model3.add(Dense(1, activation='sigmoid'))

model3.add(Dense(10, activation='relu'))
model3.add(Dense(1, activation='sigmoid'))

model3.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
from keras.layers.convolutional import MaxPooling1D

model = Sequential()
model.add(Embedding(vocab_size, 100, input_length= 70))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = model.fit(X_train_tvec, y_train, batch_size=32, epochs=100, verbose=1, validation_data=(X_test_tvec, y_test))