## Our Model
Here we will be creating different models to classify our data. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# set configurations
pd.set_option('display.max_columns', 100)
sns.set_style("white")

# model imports
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score
import pickle

In [2]:
model_data = pd.read_csv('../data/data_for_model.csv', keep_default_na=False)

In [3]:
model_data.head(3)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean,author_clean,selftext_length,title_length,megatext_clean
0,Our most-broken and least-understood rules is ...,We understand that most people who reply immed...,SQLwitch,133,0,https://www.reddit.com/r/depression/comments/d...,understand people reply immediately op invitat...,broken least understood rule helper may invite...,sql witch,4792,144,sql witch understand people reply immediately ...
1,Regular Check-In Post,Welcome to /r/depression's check-in post - a p...,circinia,1644,0,https://www.reddit.com/r/depression/comments/e...,welcome r depression check post place take mom...,regular check post,c irc,650,21,c irc welcome r depression check post place ta...
2,I hate it so much when you try and express you...,I've been feeling really depressed and lonely ...,TheNewKiller69,8,0,https://www.reddit.com/r/depression/comments/f...,feeling really depressed lonely lately job ful...,hate much try express feeling parent turn arou...,new killer 69,1866,137,new killer 69 feeling really depressed lonely ...


In [4]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1897 entries, 0 to 1896
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            1897 non-null   object
 1   selftext         1897 non-null   object
 2   author           1897 non-null   object
 3   num_comments     1897 non-null   int64 
 4   is_suicide       1897 non-null   int64 
 5   url              1897 non-null   object
 6   selftext_clean   1897 non-null   object
 7   title_clean      1897 non-null   object
 8   author_clean     1897 non-null   object
 9   selftext_length  1897 non-null   int64 
 10  title_length     1897 non-null   int64 
 11  megatext_clean   1897 non-null   object
dtypes: int64(4), object(8)
memory usage: 178.0+ KB


Establishing a baseline accuracy is important for evaluating the model's progression. If every prediction was 1, let's see what our accuracy would be.

In [5]:
model_data['is_suicide'].mean()

0.5166051660516605

Our baseline accuracy is about 51.7%

### Selection of Features and Attributes
Here we are attempting to create a model using CountVectorizer and Naive Bayes Model to determine which columns are the best to score and use for our model.

In [7]:
# Creating a function to score different models from different columsn in the dataset

columns_list = ['column_1', "column_2", "column_3"]
model = "CountVec + MultinomialNB"
df_list=[] 

def multi_modelling(columns_list, model):
    for i in columns_list:
        
        X = model_data[i]
        y = model_data['is_suicide']
        
        # train/test split
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
        
        cvec = CountVectorizer()
        cvec.fit(X_train)
        
        # creating dataframes
        X_train = pd.DataFrame(cvec.transform(X_train).todense(),
                               columns=cvec.get_feature_names())
        X_test = pd.DataFrame(cvec.transform(X_test).todense(),
                               columns=cvec.get_feature_names())
        
        nb = MultinomialNB()
        nb.fit(X_train,y_train)
        
        # get predictions from model
        pred = nb.predict(X_test)
        
        # create confusion matrix
        tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
        
        # calculating AUC
        nb.predict_proba(X_test)
        pred_proba = [i[1] for i in nb.predict_proba(X_test)] 
        auc = roc_auc_score(y_test, pred_proba)

        # classification report
        classi_dict = (classification_report(y_test,pred, output_dict=True))

        # storing our results
        model_results = {}
        model_results['series used (X)'] = i
        model_results['model'] = model
        model_results['AUC Score'] = auc
        model_results['precision']= classi_dict['weighted avg']['precision']
        model_results['recall (sensitivity)']= classi_dict['weighted avg']['recall']
        model_results['confusion matrix']={"TP": tp,"FP":fp, "TN": tn, "FN": fn}
        model_results['train accuracy'] = nb.score(X_train, y_train)
        model_results['test accuracy'] = nb.score(X_test, y_test)
        model_results['baseline accuracy']=0.5166
        model_results['specificity']= tn/(tn+fp)  
        model_results['f1-score']= classi_dict['weighted avg']['f1-score']
        #model_results['support']= classi_dict['weighted avg']['support']
        model_results
        df_list.append(model_results) 

    pd.set_option("display.max_colwidth", 50)
    return (pd.DataFrame(df_list)).round(2)

For our initial model, it will be a binary classifier. Once the user gets their classification, they will go to another model that will give them specific support. 

Label Encoding: r/SuicideWatch = 1, r/Depression = 0

TP: model predicts suicide, and it is correct

TN: model predicts depression, and it is correct

FP: model predicts suicide, but it is really depression, not good

FN: model predicts depression, but they really are suicidal, this is the worst, misses an at risk patient

In [None]:
# using the function
columns_list = ['selftext', "author", "title",'selftext_clean', "author_clean", "title_clean", "megatext_clean"]
model = "CountVec + MultinomialNB"
df_list=[]
multi_modelling(columns_list, model)