# Naive Bayes Classifier

## Importing Libraries

In [1]:
import requests, json, time, re
import pandas as pd
import numpy as np

from sklearn.naive_bayes import GaussianNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('../data/df_combined.csv',index_col='Unnamed: 0')
df.head(2)

Unnamed: 0,author,author_fullname,subreddit_suicidewatch,id,created_utc,retrieved_on,permalink,url,num_comments,title,selftext,selftext_char_cnt,selftext_word_count
0,gothic_reality,t2_39c20e2f,0,b7vfje,1554081327,1554081328,/r/depression/comments/b7vfje/how_do_i_explain...,https://www.reddit.com/r/depression/comments/b...,0,How do i explain to my mom that liking dark st...,"I love my mom, but sometimes she overreacts to...",391,80
1,3453456346346,t2_335d0qqi,0,b7vf2d,1554081250,1554081251,/r/depression/comments/b7vf2d/i_want_to_lie_do...,https://www.reddit.com/r/depression/comments/b...,0,I want to lie down on a bed and curl up to sle...,I hate doing anything - i hate typing this but...,762,153


In [3]:
df.shape

(18102, 13)

## Calculating Baseline Accuracy

The baseline accuracy corresponds to the distribution of our response variable.

In [4]:
df.subreddit_suicidewatch.value_counts(normalize=True)

1    0.500884
0    0.499116
Name: subreddit_suicidewatch, dtype: float64

## Assigning X and Y Variables

Our y, 'subreddit_suicidewatch' is our response variable that we're trying to correctly classify. The 'selftext' will be the foundation of our predictor variables. We will vectorize this text using the TD-IDF method.

In [5]:
X = df[['selftext']]
y = df['subreddit_suicidewatch']

## Train/Test/Split

Next we'll use `train_test_split` to create a train and test set for our data. We'll train our model on the training data and test our fitted model on the test date to measure our accuracy. By default, we'll fit our model on 75% of the observations (training data) and use the remaining 25% to generate our predictions.

In [6]:
np.random.seed(23)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    stratify=y
                                                   )

## Instantiate TDIDF

TF-IDF stands 'Term Frequency — Inverse Data Frequency'. It provides both a measure for how often a word is present in a given document, while taking into account how often the word appears across the entire corpus of documents.

The 'TF' stands for 'Term Frequency' and is the ratio of number of times the word appears in a document relative to the total number of words in that document. If the word appears more frequently, the value will increase

The 'IDF' stands for Inverse Document Frequency, it applies a weight corresponding to how often a word occurs across all documents. For instance, words that are present in many documents will have a lower IDF value, while words that are more rare will have a higher IDF value.

In [8]:
tfidf = TfidfVectorizer(stop_words='english', min_df=5, max_df=.95)

In [9]:
train_raw = tfidf.fit_transform(X_train['selftext'])

## Creating a Sparse Matrix

We'll convert the tfidf matrix to a sparse data frame. Sparse data frames contain mostly zeros. Stored apprpriate data in a sparse matrix allow the matrix to be stored more efficiently, and typically speed up machine learning processes

In [10]:
train_df = pd.SparseDataFrame(train_raw, columns=tfidf.get_feature_names())
train_df.head()

Unnamed: 0,00,000,00am,01,03,10,100,1000,100k,100mg,...,yup,zaps,zealand,zen,zero,zoloft,zombie,zone,zoned,zoning
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


## Fill NAs with Zero

In [11]:
train_df.fillna(0, inplace=True)

In [12]:
train_df.head()

Unnamed: 0,00,000,00am,01,03,10,100,1000,100k,100mg,...,yup,zaps,zealand,zen,zero,zoloft,zombie,zone,zoned,zoning
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Replicate the Process for X-test
1. Creating a Sparse Matrix
1. Filling Nulls with Zero

In [13]:
test_df = pd.SparseDataFrame(tfidf.transform(X_test['selftext']), columns=tfidf.get_feature_names())
test_df.fillna(0, inplace=True)

In [14]:
test_df.head()

Unnamed: 0,00,000,00am,01,03,10,100,1000,100k,100mg,...,yup,zaps,zealand,zen,zero,zoloft,zombie,zone,zoned,zoning
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Fit Naive Bayes Classifier

The Naive Bayes CLassifier seeks to predict the probability of observing an event given a previously observed event or series or previously observed events. The previously observed events in this case refer to the vectorized text features.

Naive Bayes is a parametric modeling technique that has a number of advantages, as well as disadvantages

**Advantages**
- Trains/Predicts relatively fast
- Empirically, predictions are surprisingly accurate

**Disadvantages**
- Assumes the independence of predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent. This is particularly evident in the case of text analysis
- While Naive Bayes accuracy is competitive with other classification techniques, the generated predicted pobabilities are generally not useful

We'll fit our data to a Gaussian Naive Bayes model bc our features have been vectorized using the tfidf method, which produces continuous values

### Instantiate and Fit the Model

In [15]:
nb = GaussianNB()
nb.fit(train_df, y_train)

### Compute Accuracy on the Training/Test Data

In [17]:
nb.score(train_df, y_train)

0.7959634649381261

In [18]:
nb.score(test_df, y_test)

0.586389748121962

The training accuracy of 80% is slightly higher than that of the logistic regression model (77% train accuracy). Though the test accuracy of 59% is considerably lower compared to logistic regression (73% test accuracy). This suggests Naive Bayes is significantly overfitting the data.

Next we'll look at the confusion matrix to assess if our predictions are better/worse in a particular class.

## Create Confusion Matrix

The confusion matrix is a simple visual representation of our how our predictions in each class correspond to the actual classification.

In [21]:
from sklearn.metrics import confusion_matrix

Generating predictions based on the test data set

In [23]:
predictions = nb.predict(test_df)

Producing the Confusion Matrix

In [24]:
confusion_matrix(y_test, predictions)

array([[1153, 1106],
       [ 766, 1501]])

The confusion matrix can be interpreted as follows:
- **True Negative** (top left): observations we correctly identified as the negative class, which in this case refers to the depression subreddit (1153)
- **False Positive** (top right): observations we incorrectly identified as the positive class, meaning we predicted it from the suicidewatch subreddit but it was from depression (1106)
- **False Negative** (bottom left): observations we incorrectly identified as the negative class, meaning we predicted it was from the depression subreddit but it was from suicidewatch (766)
- **True Positive** (bottom right): observations we correctly identified as the positive class, which in this case refers to the suicidewatch subreddit (1501)

Assigning the confusion matrix values to variables

In [25]:
tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()

### Confusion Matrix Summary

In [43]:
print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")
print(f"Sensitivty: {tp/(tp+fn)}")
print(f"Specificty: {tn/(tn+fp)}")
print(f"Accuracy: {(tp+tn)/((tp+ tn+fp+fn))}")

True Negatives: 1153
False Positives: 1106
False Negatives: 766
True Positives: 1501
Sensitivty: 0.6621085134539039
Specificty: 0.5104028331119964
Precision: 0.5757575757575758
Accuracy: 0.586389748121962


Of the suicidewatch subreddit posts, we identified 66% of them accurately. This is denoted by the Sensitivity metric in the table above. Of the depression subreddit posts, we corectly identified 51% of them. 

Though our overall accuracy is 59%, you can see that we are more accurate in our suicidewatch predictions vs depression.