## Introduction

Reddit is a popular online discussion board where people can create groups, commonly known as 'subreddits', to chat and share content with like-minded people around the world. 

Today, we will be looking at how we can use machine learning to make an attempt in trying to predict sentiments, more specifically sarcasm/humour/irony.

## Problem Statement

Natural Language Processing (NLP) has been one of the longest running computing science fields, starting all the way back from the 1950s with Alan Turing's infamous Turing test - where a machine is said to pass the test if an evaluator cannot tell, behind the scenes, who is the one truly answering a posed question: Machine? or human?

Whilst it sounds like a rather trivial scenario, many decades on we've still yet to truly crack the language code, something that befuddles even the best of us mere humans. That is not to say we've made little progress - we've come far enough to now have chatbots being able to understand, guide, and assist us in services such as product support centres, as well as the more advanced voice assistants residing in each and every smartphone being made today.

Language, however, is not something static. It's not just a mere tool for simple communication. Language is a dynamic, fluid entity that can be moulded to resemble something, but yet truly mean something else. Things like humour, irony, and sarcasm can be complex and at times, thought-intensive to decipher, let alone create.

This project will use tools such as Scikit-Learn, Keras and the various feature selection tools out there to find out from Reddit, what differentiates a legitimate science question (r/askscience) from a humourous/sarcastic pseudoscience question (r/shittyaskscience).

In [1]:
# Importing all the required libraries for this project

import praw
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.python.keras import backend as k
from xgboost import XGBClassifier
%matplotlib inline

In [2]:
# Using PRAW as Reddit's API wrapper to start scraping.

r = praw.Reddit(client_id='nNc6VxDLF9gUtw',
                client_secret='mmtCGSil2ML5ncye6axK_gAO8wg',
                user_agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15')

sas = r.subreddit('shittyaskscience')
sci = r.subreddit('askscience')

In [3]:
# Initiating dataframe and getting all posts limited to 10k for subreddit. Scrape begins.

sas_df = pd.DataFrame((), columns=['id', 'title', 'author', 'score'])
sas_new = sas.new(limit=10000)
sas_list = [(post.id, post.title, post.author, post.score) for post in sas_new]
sas_df_list = pd.DataFrame(sas_list, columns=['id', 'title', 'author', 'score'])
sas_df = pd.concat([sas_df, sas_df_list], axis=0)
sas_df.reset_index()
sas_df.drop_duplicates(subset='id', keep='first', inplace=True)

sas_df.to_csv('./datasets/sas.csv', mode='w', header=True, index=False)

In [4]:
sas_df = pd.read_csv('./datasets/sas.csv') # Loading csv file back in

In [5]:
sas_df.head() # Head looks good.

Unnamed: 0,id,title,author,score
0,chpnge,If the two European heatwaves this year were c...,Asian_Canadaball,1
1,cho8v8,Who invented being gay?,TigerpanzerIV,2
2,chmghz,Why did gravity ignore this bird?,zlicht,0
3,chm4p1,Do cats hunt watermelons?,killerbunnyfamily,856
4,chkqxs,"If air is such a good insulator, why so I stil...",hudzell,6


In [6]:
sas_df.shape # Checking shape to see what we've got

(996, 4)

In [8]:
# Initiating dataframe and getting all posts limited to 10k for subreddit

sci_df = pd.DataFrame((), columns=['id', 'title', 'author', 'score'])
sci_new = sci.new(limit=10000)
sci_list = [(post.id, post.title, post.author, post.score) for post in sci_new]
sci_df_list = pd.DataFrame(sci_list, columns=['id', 'title', 'author', 'score'])
sci_df = pd.concat([sci_df, sci_df_list], axis=0)
sci_df.reset_index()
sci_df.drop_duplicates(subset='id', keep='first', inplace=True)

sci_df.to_csv('./datasets/sci.csv', mode='w', header=True, index=False)

In [9]:
sci_df = pd.read_csv('./datasets/sci.csv') # Loading csv file back in

In [10]:
sci_df.shape # Checking shape to see what we've got

(750, 4)

In [11]:
sci_df.head()

Unnamed: 0,id,title,author,score
0,chm70g,AskScience AMA Series: We're from the Pacific ...,AskScienceModerator,350
1,chkmci,Why are beta blockers restricted to prescripti...,dontknowhowtoprogram,5
2,chiol7,"To my best understanding, space is (for the mo...",ctcsoccer17,3
3,chij26,Is Ceres a dwarf planet or an asteroid?,louisprimaasamonkey,3
4,chh4nk,If the event horizon is the region in space wh...,bigmaxporter,3


In [12]:
# Assigning subreddit encoding to each dataframe

sci_df['type'] = 'sci'
sas_df['type'] = 'sas'

In [13]:
df = pd.concat([sas_df, sci_df], axis=0) # Joining both subreddits' data together

In [14]:
df.shape

(1746, 5)

In [15]:
df.drop(['id', 'score', 'author'], axis=1, inplace=True) # Dropping unneeded columns

In [16]:
df['type'] = df['type'].replace({'sas':1, 'sci':0}) # Creating binary elements for subreddit type

In [17]:
df.isnull().any() # Checking for null values

title    False
type     False
dtype: bool

In [18]:
df.drop_duplicates(keep=False, inplace=True) # Dropping any duplicate posts with identical titles

In [19]:
X = df['title'] # Setting up features and target columns
y = df['type']

In [20]:
# Splitting up into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=34, shuffle=True)

Right here we are, we've finally settled down with our training and test data sets.

So let's start things off with a simple CountVectorizer from scikit to tokenise words.

There'll be a few iterations for this, with the parameters being:

- lowercase (True/False): 
There might be a difference between having non-lowercase words and lowercase words 
between the 2 subs, this will be something that I'll switch and see if it improves.   

- stop_words (english/none): 
Having stop_words on 'english' might seem like a no-brainer, but we aren't dealing with
proper text here. These are all one-phrase post titles where intended "stylistic"/wrong phrasing of words
might make a difference between the 2 subs.   

- n-grams (default/(1,2)/(1,3)): 
Having different number of words grouped as features might make make a difference
since I'm expecting humour/sarcastic science questions to have more common question terms (since they don't need
to get to specifics eg "Why is the protein that binds VCAM-1 called Very Late Antigen-4 (VLA-4)?" vs 
"what kind of bird is this?"))

- min_df/max_df (default/0.25-0.5):
Somewhat counterintuitive if I'm switching stop_words to False, but this will also be tweaked around, mostly to see if there might be a certain proportion that hits an acceptable spot.

- max_features:
To be adjusted during model tweaking to get the right range.

In [21]:
cvec = CountVectorizer(lowercase=True, stop_words=None, ngram_range=(1,2), min_df=3, max_df=0.3)

X_train_cvec  = pd.DataFrame(cvec.fit_transform(X_train).todense(), columns=cvec.get_feature_names())
X_test_cvec = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

In [22]:
X_train_cvec.shape

(1219, 1351)

In [23]:
y_train.shape

(1219,)

In [24]:
# Running a simple Logistic Regression as a baseline

lr = LogisticRegression(solver='lbfgs')

lr_model = lr.fit(X_train_cvec, y_train)
lr_model_score = cross_val_score(lr_model, X_train_cvec, y_train, cv=5)

print("CV Score: " + str(lr_model_score.mean()))
print("Test Score: " + str(lr_model.score(X_test_cvec, y_test)) + "\n")

lr_pred = pd.DataFrame(lr_model.predict(X_test_cvec))

tn, fp, fn, tp = confusion_matrix(y_test, lr_pred).ravel()
lr_cvec_auc = roc_auc_score(y_test, lr_pred)
lr_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
lr_cvec_df = pd.DataFrame([[tn,fp,fn,tp,lr_cvec_acc,lr_cvec_auc]], index=['LR_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("LR CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(lr_cvec_acc))
print("AUC ROC: {}".format(lr_cvec_auc))

CV Score: 0.757144302772718
Test Score: 0.7667304015296367

LR CVEC Matrix:
True Negatives: 154
False Positives: 52
False Negatives: 70
True Positives: 247
Accuracy: 0.7667304015296367
AUC ROC: 0.7633763131297663


In [25]:
# Running a basic Multinomial Naive Bayes as a baseline

mnb = MultinomialNB(alpha=0.5)

mnb_model = mnb.fit(X_train_cvec, y_train)
mnb_model_score = cross_val_score(mnb_model, X_train_cvec, y_train, cv=5)

print("CV Score: " + str(mnb_model_score.mean()))
print("Test Score: " + str(mnb_model.score(X_test_cvec, y_test)) + "\n")

mnb_pred = mnb_model.predict(X_test_cvec)

tn, fp, fn, tp = confusion_matrix(y_test, mnb_pred).ravel()
mnb_cvec_auc = roc_auc_score(y_test, mnb_pred)
mnb_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
mnb_cvec_df = pd.DataFrame([[tn,fp,fn,tp,mnb_cvec_acc,mnb_cvec_auc]], index=['MNB_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("MultinomialNB CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(mnb_cvec_acc))
print("AUC ROC: {}".format(mnb_cvec_auc))

CV Score: 0.7563448694596235
Test Score: 0.7954110898661568

MultinomialNB CVEC Matrix:
True Negatives: 160
False Positives: 46
False Negatives: 61
True Positives: 256
Accuracy: 0.7954110898661568
AUC ROC: 0.7921350035220972


In [26]:
# Running a basic Bernoulli Naive Bayes as a baseline

bn = BernoulliNB(alpha=0.4)

bn_model = bn.fit(X_train_cvec, y_train)
bn_model_score = cross_val_score(bn_model, X_train_cvec, y_train, cv=5)

print("CV Score: " + str(bn_model_score.mean()))
print("Test Score: " + str(bn_model.score(X_test_cvec, y_test)) + "\n")

bn_pred = bn_model.predict(X_test_cvec)

tn, fp, fn, tp = confusion_matrix(y_test, bn_pred).ravel()
bn_cvec_auc = roc_auc_score(y_test, bn_pred)
bn_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
bn_cvec_df = pd.DataFrame([[tn,fp,fn,tp,bn_cvec_acc,bn_cvec_auc]], index=['BN_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("BernoulliNB CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(bn_cvec_acc))
print("AUC ROC: {}".format(bn_cvec_auc))

CV Score: 0.7612696485191932
Test Score: 0.7858508604206501

BernoulliNB CVEC Matrix:
True Negatives: 153
False Positives: 53
False Negatives: 59
True Positives: 258
Accuracy: 0.7858508604206501
AUC ROC: 0.7782992863924536


In [27]:
score = pd.concat([lr_cvec_df, mnb_cvec_df, bn_cvec_df], axis=0)
score.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_CVEC,160,46,61,256,0.795411,0.792135
BN_CVEC,153,53,59,258,0.785851,0.778299
LR_CVEC,154,52,70,247,0.76673,0.763376


## First Run

We now have the scores from our first batch run of 3 models. As expected, the Naive Bayes are doing better than the regular Logistic Regression model.   

It must be noted that due to the nature of the similarities in content, tracking the choice of model via its pure accuracy can be flawed, as we can be getting a lot of hits on True Positives due to our imbalanced data whilst clocking up a sizeable number of False Positives. I've gone with tracking the ranking via the AUC instead.

We'll use the TF-IDF vectoriser next for our run of our models once more.

## Using TF-IDF vectoriser 

In [28]:
tfidf = TfidfVectorizer(lowercase=True, stop_words=None, ngram_range=(1,2), max_df=0.3)
X_train_tfidf  = pd.DataFrame(tfidf.fit_transform(X_train).todense(), columns=tfidf.get_feature_names())
X_test_tfidf = pd.DataFrame(tfidf.transform(X_test).todense(), columns=tfidf.get_feature_names())

In [29]:
# Running logistic regression with TFIDF

lr_model1 = lr.fit(X_train_tfidf, y_train)
lr_model1_score = cross_val_score(lr_model1, X_train_tfidf, y_train, cv=7)

print("CV Score: " + str(lr_model1_score.mean()))
print("Test Score: " + str(lr_model1.score(X_test_tfidf, y_test)) + "\n")

lr1_pred = lr_model1.predict(X_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, lr1_pred).ravel()
lr_tfidf_auc = roc_auc_score(y_test, lr1_pred)
lr_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
lr_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,lr_tfidf_acc,lr_tfidf_auc]], index=['LR_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("LR TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(lr_tfidf_acc))
print("AUC ROC: {}".format(lr_tfidf_auc))

CV Score: 0.7243859584894352
Test Score: 0.7973231357552581

LR TF-IDF Matrix:
True Negatives: 138
False Positives: 68
False Negatives: 38
True Positives: 279
Accuracy: 0.7973231357552581
AUC ROC: 0.7750145477933293


In [30]:
# Running Multinomial NB with TFIDF

mnb1 = MultinomialNB(alpha=0.4)

mnb_model1 = mnb1.fit(X_train_tfidf, y_train)
mnb_model1_score = cross_val_score(mnb_model1, X_train_tfidf, y_train, cv=7)

print("CV Score: " + str(mnb_model1_score.mean()))
print("Test Score: " + str(mnb_model1.score(X_test_tfidf, y_test)) + "\n")

mnb1_pred = mnb_model1.predict(X_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, mnb1_pred).ravel()
mnb_tfidf_auc = roc_auc_score(y_test, mnb1_pred)
mnb_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
mnb_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,mnb_tfidf_acc,mnb_tfidf_auc]], index=['MNB_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("MultinomialNB TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(mnb_tfidf_acc))
print("AUC ROC: {}".format(mnb_tfidf_auc))

CV Score: 0.7613467136904593
Test Score: 0.8087954110898662

MultinomialNB TF-IDF Matrix:
True Negatives: 131
False Positives: 75
False Negatives: 25
True Positives: 292
Accuracy: 0.8087954110898662
AUC ROC: 0.7785289883923923


In [31]:
# Running Bernoulli NB with TFIDF

bn1 = BernoulliNB(alpha=0.2)

bn_model1 = bn1.fit(X_train_tfidf, y_train)
bn_model1_score = cross_val_score(bn_model1, X_train_tfidf, y_train,cv=7)

print("CV Score: " + str(bn_model1_score.mean()))
print("Test Score: " + str(bn_model1.score(X_test_tfidf, y_test)) + "\n")

bn1_pred = bn_model1.predict(X_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, bn1_pred).ravel()
bn_tfidf_auc = roc_auc_score(y_test, bn1_pred)
bn_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
bn_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,bn_tfidf_acc,bn_tfidf_auc]], index=['BN_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("BernoulliNB CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(bn_tfidf_acc))
print("AUC ROC: {}".format(bn_tfidf_auc))

CV Score: 0.7425712577237392
Test Score: 0.7992351816443595

BernoulliNB CVEC Matrix:
True Negatives: 123
False Positives: 83
False Negatives: 22
True Positives: 295
Accuracy: 0.7992351816443595
AUC ROC: 0.7638433738629751


In [32]:
score = pd.concat([lr_cvec_df,mnb_cvec_df,bn_cvec_df,lr_tfidf_df,mnb_tfidf_df,bn_tfidf_df])
score.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_CVEC,160,46,61,256,0.795411,0.792135
MNB_TFIDF,131,75,25,292,0.808795,0.778529
BN_CVEC,153,53,59,258,0.785851,0.778299
LR_TFIDF,138,68,38,279,0.797323,0.775015
BN_TFIDF,123,83,22,295,0.799235,0.763843
LR_CVEC,154,52,70,247,0.76673,0.763376


## Second Run

Checking out the results from the 2nd run, it's clear that the Multinomial NB model is our top model choice, for both the CVEC and TFIDF runs. The other 2 models seem to be middling at the moment, we'll proceed with feature engineering our data with lemmatisation, as well as doing hyperparameter tweakings.

### Proceeding with Lemmatisation with SpaCy

In [33]:
import spacy # Loading SpaCy and its large English core kit for lemmatisation.
nlp = spacy.load('en_core_web_lg')

def lem_sentences(sentence):
    doc = nlp(sentence)
    return" ".join([token.lemma_ for token in doc])

In [34]:
df['lem'] = df['title'].apply(lem_sentences) # Applying our function to lemmatise the post titles
X_lem = df['lem']

In [35]:
# Creating our training and testing data

X_lem_train, X_lem_test, y_train, y_test = train_test_split(X_lem, y, test_size=0.3, random_state=34, shuffle=True)

In [36]:
# Initialising our count vectoriser with our lemmatised data

cvec = CountVectorizer(lowercase=True, stop_words=None, ngram_range=(1,2), min_df=3, max_df=0.5)

X_lem_train_cvec  = pd.DataFrame(cvec.fit_transform(X_lem_train).todense(), columns=cvec.get_feature_names())
X_lem_test_cvec = pd.DataFrame(cvec.transform(X_lem_test).todense(), columns=cvec.get_feature_names())

In [37]:
X_lem_train_cvec.shape # Checking shape

(1219, 1358)

In [38]:
# Running logistic regression on lemmatised data

lr2_model = lr.fit(X_lem_train_cvec, y_train)
lr2_model_score = cross_val_score(lr2_model, X_lem_train_cvec, y_train, cv=7)

print("CV Score: " + str(lr2_model_score.mean()))
print("Test Score: " + str(lr2_model.score(X_lem_test_cvec, y_test)) + "\n")

lr2_pred = pd.DataFrame(lr2_model.predict(X_lem_test_cvec))

tn, fp, fn, tp = confusion_matrix(y_test, lr2_pred).ravel()
lr2_cvec_auc = roc_auc_score(y_test, lr2_pred)
lr2_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
lr2_cvec_df = pd.DataFrame([[tn,fp,fn,tp,lr2_cvec_acc,lr2_cvec_auc]], index=['LR2_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("LR2 CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(lr2_cvec_acc))
print("AUC ROC: {}".format(lr2_cvec_auc))

CV Score: 0.7449213354865566
Test Score: 0.7667304015296367

LR2 CVEC Matrix:
True Negatives: 154
False Positives: 52
False Negatives: 70
True Positives: 247
Accuracy: 0.7667304015296367
AUC ROC: 0.7633763131297663


In [39]:
# Running Multinomial NB on lemmatised data

mnb2 = MultinomialNB(alpha=0.5)

mnb_model2 = mnb2.fit(X_lem_train_cvec, y_train)
mnb_model2_score = cross_val_score(mnb_model2, X_lem_train_cvec, y_train, cv=5)

print("CV Score: "+ str(mnb_model2_score.mean()))
print("Test Score: "+ str(mnb_model2.score(X_lem_test_cvec, y_test)) + "\n")

mnb2_pred = mnb_model2.predict(X_lem_test_cvec)

tn, fp, fn, tp = confusion_matrix(y_test, mnb2_pred).ravel()
mnb2_cvec_auc = roc_auc_score(y_test, mnb2_pred)
mnb2_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
mnb2_cvec_df = pd.DataFrame([[tn,fp,fn,tp,mnb2_cvec_acc,mnb2_cvec_auc]], index=['MNB2_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("MultinomialNB2 CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(mnb2_cvec_acc))
print("AUC ROC: {}".format(mnb2_cvec_auc))

CV Score: 0.7727787897186804
Test Score: 0.7782026768642447

MultinomialNB2 CVEC Matrix:
True Negatives: 159
False Positives: 47
False Negatives: 69
True Positives: 248
Accuracy: 0.7782026768642447
AUC ROC: 0.7770895225261094


In [40]:
# Running Bernoulli NB on lemmatised data

bn2 = BernoulliNB(alpha=0.2)

bn_model2 = bn2.fit(X_lem_train_cvec, y_train)
bn_model2_score = cross_val_score(bn_model2, X_lem_train_cvec, y_train, cv=5)

print("CV Score: "+ str(bn_model2_score.mean()))
print("Test Score: " + str(bn_model2.score(X_lem_test_cvec, y_test)) + "\n")

bn2_pred = bn_model2.predict(X_lem_test_cvec)

tn, fp, fn, tp = confusion_matrix(y_test, bn2_pred).ravel()
bn2_cvec_auc = roc_auc_score(y_test, bn2_pred)
bn2_cvec_acc = (tn+tp)/(tn+fp+fn+tp)
bn2_cvec_df = pd.DataFrame([[tn,fp,fn,tp,bn2_cvec_acc,bn2_cvec_auc]], index=['BN2_CVEC'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("BernoulliNB CVEC Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(bn2_cvec_acc))
print("AUC ROC: {}".format(bn2_cvec_auc))

CV Score: 0.7719422519058221
Test Score: 0.7820267686424475

BernoulliNB CVEC Matrix:
True Negatives: 157
False Positives: 49
False Negatives: 65
True Positives: 252
Accuracy: 0.7820267686424475
AUC ROC: 0.7785443018590549


In [41]:
score2 = pd.concat([lr2_cvec_df, mnb2_cvec_df, bn2_cvec_df], axis=0).sort_values('AUC', ascending=False)
score2

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
BN2_CVEC,157,49,65,252,0.782027,0.778544
MNB2_CVEC,159,47,69,248,0.778203,0.77709
LR2_CVEC,154,52,70,247,0.76673,0.763376


In [42]:
# Running TFIDF vectoriser on lemmatised data

tfidf = TfidfVectorizer(lowercase=True, stop_words=None, ngram_range=(1,2), max_df=0.3)
X_lem_train_tfidf  = pd.DataFrame(tfidf.fit_transform(X_lem_train).todense(), columns=tfidf.get_feature_names())
X_lem_test_tfidf = pd.DataFrame(tfidf.transform(X_lem_test).todense(), columns=tfidf.get_feature_names())

In [43]:
X_lem_train_tfidf.shape # Checking shape of TFIDF data

(1219, 13879)

In [44]:
# Running logistic regression on TFIDF lemmatised data

lr_model3 = lr.fit(X_lem_train_tfidf, y_train)
lr_model3_score = cross_val_score(lr_model3, X_lem_train_tfidf, y_train, cv=7)

print("CV Score: " + str(lr_model3_score.mean()))
print("Test Score: " + str(lr_model3.score(X_lem_test_tfidf, y_test)) + "\n")

lr3_pred = lr_model3.predict(X_lem_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, lr3_pred).ravel()
lr3_tfidf_auc = roc_auc_score(y_test, lr3_pred)
lr3_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
lr3_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,lr3_tfidf_acc,lr3_tfidf_auc]], index=['LR3_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("LR3 TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(lr3_tfidf_acc))
print("AUC ROC: {}".format(lr3_tfidf_auc))

CV Score: 0.7400234847776064
Test Score: 0.780114722753346

LR3 TF-IDF Matrix:
True Negatives: 132
False Positives: 74
False Negatives: 41
True Positives: 276
Accuracy: 0.780114722753346
AUC ROC: 0.7557195797984746


In [45]:
# Running Multinomail NB on TFIDF lemmatised data

mnb3 = MultinomialNB(alpha=0.3)
mnb_model3 = mnb3.fit(X_lem_train_tfidf, y_train)
mnb_model3_score = cross_val_score(mnb_model3, X_lem_train_tfidf, y_train, cv=7)

print("CV Score: " + str(mnb_model3_score.mean()))
print("Test Score: " + str(mnb_model3.score(X_lem_test_tfidf, y_test)) + "\n")

mnb3_pred = mnb_model3.predict(X_lem_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, mnb3_pred).ravel()
mnb3_tfidf_auc = roc_auc_score(y_test, mnb3_pred)
mnb3_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
mnb3_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,mnb3_tfidf_acc,mnb3_tfidf_auc]], index=['MNB3_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("MultinomialNB3 TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(mnb3_tfidf_acc))
print("AUC ROC: {}".format(mnb3_tfidf_auc))

CV Score: 0.7654848074370271
Test Score: 0.7896749521988528

MultinomialNB3 TF-IDF Matrix:
True Negatives: 130
False Positives: 76
False Negatives: 34
True Positives: 283
Accuracy: 0.7896749521988528
AUC ROC: 0.7619062203301583


In [46]:
# Running Bernoulli NB on TFIDF lemmatised data

bn3 = BernoulliNB(alpha=0.2)

bn_model3 = bn3.fit(X_lem_train_tfidf, y_train)
bn_model3_score = cross_val_score(bn_model3, X_lem_train_tfidf, y_train, cv=7)


print("CV Score: " + str(bn_model3_score.mean()))
print("Test Score: " + str(bn_model3.score(X_lem_test_tfidf, y_test)) + "\n")

bn3_pred = bn_model3.predict(X_lem_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, bn3_pred).ravel()
bn3_tfidf_auc = roc_auc_score(y_test, bn3_pred)
bn3_tfidf_acc = (tn+tp)/(tn+fp+fn+tp)
bn3_tfidf_df = pd.DataFrame([[tn,fp,fn,tp,bn3_tfidf_acc,bn3_tfidf_auc]], index=['BN3_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("BernoulliNB TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(bn3_tfidf_acc))
print("AUC ROC: {}".format(bn3_tfidf_auc))

CV Score: 0.7606387262897983
Test Score: 0.7858508604206501

BernoulliNB TF-IDF Matrix:
True Negatives: 119
False Positives: 87
False Negatives: 25
True Positives: 292
Accuracy: 0.7858508604206501
AUC ROC: 0.7494027748001593


In [47]:
score3 = pd.concat([lr2_cvec_df, mnb2_cvec_df, bn2_cvec_df,lr3_tfidf_df,mnb3_tfidf_df,bn3_tfidf_df])
score3.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
BN2_CVEC,157,49,65,252,0.782027,0.778544
MNB2_CVEC,159,47,69,248,0.778203,0.77709
LR2_CVEC,154,52,70,247,0.76673,0.763376
MNB3_TFIDF,130,76,34,283,0.789675,0.761906
LR3_TFIDF,132,74,41,276,0.780115,0.75572
BN3_TFIDF,119,87,25,292,0.785851,0.749403


## 3rd Run

Amongst our model runs with lemmatisation, it looks like the TF-IDF vectorisers didn't do much to help the cause, perhaps it was dropping off too many rare terms that were indeed important to discern between the two subreddits.

## Hyperparameter tuning via Pipeline, GridSearchCV

The next step will be to run a pipeline to find the best parameters to use for each of our models as well as a corresponding suited vectoriser setting using our lemmatised data from before. 

Here we go.

In [48]:
# Setting up of pipeline and the parameters to be iterated through to find best settings for Multinomial NB

pipeline = Pipeline([
    ('cvec', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('mnb', MultinomialNB()),
])

parameters = {
    'cvec__lowercase': ('True', 'False'),
    'cvec__stop_words': (None, 'english'),    
    'cvec__min_df': (1, 2, 3),
    'cvec__max_df': (0.1, 0.2, 0.3, 0.4, 0.5),
    'cvec__ngram_range': ((1, 1), (1, 2), (1, 3)),
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'mnb__alpha': (0.1, 0.2, 0.3, 0.4, 0.5)
}

# Running pipeline to find best settings in 3 folds.

grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

grid_search.fit(X_lem_train, y_train)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 3600 candidates, totalling 10800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    1.9s
[Parallel(n_jobs=-1)]: Done 274 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 774 tasks      | elapsed:    9.5s
[Parallel(n_jobs=-1)]: Done 1474 tasks      | elapsed:   16.8s
[Parallel(n_jobs=-1)]: Done 2374 tasks      | elapsed:   27.8s
[Parallel(n_jobs=-1)]: Done 3474 tasks      | elapsed:   39.4s
[Parallel(n_jobs=-1)]: Done 4774 tasks      | elapsed:   54.4s
[Parallel(n_jobs=-1)]: Done 6274 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 7974 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 9874 tasks      | elapsed:  1.9min


Best parameters set:
	cvec__lowercase: 'True'
	cvec__max_df: 0.2
	cvec__min_df: 2
	cvec__ngram_range: (1, 2)
	cvec__stop_words: None
	mnb__alpha: 0.2
	tfidf__norm: 'l2'
	tfidf__use_idf: False


[Parallel(n_jobs=-1)]: Done 10800 out of 10800 | elapsed:  2.1min finished


In [49]:
# Setting up count vectoriser with settings from pipeline

cvec = CountVectorizer(lowercase=True, stop_words=None, ngram_range=(1,2), min_df=2, max_df=0.2)

X_lem_train_cvec  = pd.DataFrame(cvec.fit_transform(X_lem_train).todense(), columns=cvec.get_feature_names())
X_lem_test_cvec = pd.DataFrame(cvec.transform(X_lem_test).todense(), columns=cvec.get_feature_names())

trans = TfidfTransformer(norm='l2', use_idf=False)
X_lem_train_final = trans.fit_transform(X_lem_train_cvec)
X_lem_test_final = trans.transform(X_lem_test_cvec)

In [50]:
# Setting up Multinomial NB with settings from pipeline

mnb4 = MultinomialNB(alpha=0.2)
mnb_model4 = mnb4.fit(X_lem_train_final, y_train)

print("Test Score: " + str(mnb_model4.score(X_lem_test_final, y_test)) + "\n")

mnb4_pred = mnb_model4.predict(X_lem_test_final)

tn, fp, fn, tp = confusion_matrix(y_test, mnb4_pred).ravel()
mnb4_final_auc = roc_auc_score(y_test, mnb4_pred)
mnb4_final_acc = (tn+tp)/(tn+fp+fn+tp)
mnb4_final_df = pd.DataFrame([[tn,fp,fn,tp,mnb4_final_acc,mnb4_final_auc]], index=['MNB_FINAL'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("MultinomialNB3 TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(mnb4_final_acc))
print("AUC ROC: {}".format(mnb4_final_auc))

Test Score: 0.8087954110898662

MultinomialNB3 TF-IDF Matrix:
True Negatives: 153
False Positives: 53
False Negatives: 47
True Positives: 270
Accuracy: 0.8087954110898662
AUC ROC: 0.7972267311874063


In [51]:
# Setting up of pipeline and the parameters to be iterated through to find best settings for Bernoulli NB

pipeline = Pipeline([
    ('cvec', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('bn', BernoulliNB()),
])

parameters = {
    'cvec__lowercase': ('True', 'False'),
    'cvec__stop_words': (None, 'english'),    
    'cvec__min_df': (1, 2, 3),
    'cvec__max_df': (0.1, 0.2, 0.3, 0.4, 0.5),
    'cvec__ngram_range': ((1, 1), (1, 2), (1, 3)),
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'bn__alpha': (0.1, 0.2, 0.3, 0.4, 0.5)
}

# Running pipeline to find best settings in 3 folds.

grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

grid_search.fit(X_lem_train, y_train)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 3600 candidates, totalling 10800 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  32 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done 1632 tasks      | elapsed:   15.8s
[Parallel(n_jobs=-1)]: Done 3032 tasks      | elapsed:   31.0s
[Parallel(n_jobs=-1)]: Done 4832 tasks      | elapsed:   51.1s
[Parallel(n_jobs=-1)]: Done 7032 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 9632 tasks      | elapsed:  1.8min


Best parameters set:
	bn__alpha: 0.5
	cvec__lowercase: 'True'
	cvec__max_df: 0.4
	cvec__min_df: 2
	cvec__ngram_range: (1, 3)
	cvec__stop_words: None
	tfidf__norm: 'l1'
	tfidf__use_idf: True


[Parallel(n_jobs=-1)]: Done 10800 out of 10800 | elapsed:  2.0min finished


In [52]:
# Setting up count vectoriser with settings from pipeline

cvec = CountVectorizer(lowercase=True, stop_words=None, ngram_range=(1,3), min_df=2, max_df=0.4)

X_lem_train_cvec  = pd.DataFrame(cvec.fit_transform(X_lem_train).todense(), columns=cvec.get_feature_names())
X_lem_test_cvec = pd.DataFrame(cvec.transform(X_lem_test).todense(), columns=cvec.get_feature_names())

trans = TfidfTransformer(norm='l1', use_idf='True')
X_lem_train_final = trans.fit_transform(X_lem_train_cvec)
X_lem_test_final = trans.transform(X_lem_test_cvec)

In [53]:
# Setting up Bernoulil NB with settings from pipeline

bn4 = BernoulliNB(alpha=0.5)

bn_model4 = bn4.fit(X_lem_train_final, y_train)

print("Test Score: " + str(bn_model4.score(X_lem_test_final, y_test)) + "\n")

bn4_pred = bn_model4.predict(X_lem_test_final)

tn, fp, fn, tp = confusion_matrix(y_test, bn4_pred).ravel()
bn4_final_auc = roc_auc_score(y_test, bn4_pred)
bn4_final_acc = (tn+tp)/(tn+fp+fn+tp)
bn4_final_df = pd.DataFrame([[tn,fp,fn,tp,bn4_final_acc,bn4_final_auc]], index=['BN_FINAL'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("BernoulliNB Final TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(bn4_final_acc))
print("AUC ROC: {}".format(bn4_final_auc))

Test Score: 0.7858508604206501

BernoulliNB Final TF-IDF Matrix:
True Negatives: 154
False Positives: 52
False Negatives: 60
True Positives: 257
Accuracy: 0.7858508604206501
AUC ROC: 0.7791491837922268


In [54]:
score_bayes_final = pd.concat([mnb4_final_df, bn4_final_df], axis=0)
score_bayes_final.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_FINAL,153,53,47,270,0.808795,0.797227
BN_FINAL,154,52,60,257,0.785851,0.779149


## 4th Run

There's a good improvement in both models, with the Multinomial NB model now hitting 0.80 accuracy with a close-enough AUC of 0.79.

## XGBoost

The XGBoost classifier can be used here as well, and it'll be used to compare to how the other models we have above has done.

In [55]:
# Setting up XGBoost run using our TFIDF data, this can take up to 2 mins to run

xgb = XGBClassifier()

xgb = xgb.fit(X_lem_train_tfidf, y_train)
xgb_score = cross_val_score(xgb, X_lem_train_tfidf, y_train, cv=3)

print("CV Score: " + str(xgb_score.mean()))
print("Test Score: " + str(xgb.score(X_lem_test_tfidf, y_test)) + "\n")

xgb_pred = xgb.predict(X_lem_test_tfidf)

tn, fp, fn, tp = confusion_matrix(y_test, xgb_pred).ravel()
xgb_final_auc = roc_auc_score(y_test, xgb_pred)
xgb_final_acc = (tn+tp)/(tn+fp+fn+tp)
xgb_final_df = pd.DataFrame([[tn,fp,fn,tp,xgb_final_acc,xgb_final_auc]], index=['XGB_TFIDF'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("XGBClassifier Final TF-IDF Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(xgb_final_acc))
print("AUC ROC: {}".format(xgb_final_auc))

CV Score: 0.69977164804751
Test Score: 0.6959847036328872

XGBClassifier Final TF-IDF Matrix:
True Negatives: 140
False Positives: 66
False Negatives: 93
True Positives: 224
Accuracy: 0.6959847036328872
AUC ROC: 0.6931181280818351


In [56]:
# Setting up of pipeline and the parameters to be iterated through to find best settings for XGBoost, using our
# lemmatised data this time.

pipeline = Pipeline([
    ('xgb', XGBClassifier()),
])

parameters = {    
    'xgb__eta': (0.1, 0.2),
    'xgb__gamma': (0, 1, 2, 3, 4),
    'xgb__max_depth': (2, 3, 4),
    'xgb__alpha': (0, 0.1, 0.2, 0.3)
}

# Running pipeline to find best settings in 3 folds

grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

grid_search.fit(X_lem_train_final, y_train)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   31.7s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:  1.0min finished


Best parameters set:
	xgb__alpha: 0
	xgb__eta: 0.1
	xgb__gamma: 2
	xgb__max_depth: 4


In [57]:
# Running XGBoost with tuned parameters

xgb1 = XGBClassifier(eta=0.1, alpha=0, gamma=2, max_depth=4)

xgb1 = xgb1.fit(X_lem_train_final, y_train)
xgb1_score = cross_val_score(xgb1, X_lem_train_final, y_train, cv=3)

print("CV Score: " + str(xgb1_score.mean()))
print("Test Score: " + str(xgb1.score(X_lem_test_final, y_test)) + "\n")

xgb1_pred = xgb1.predict(X_lem_test_final)

tn, fp, fn, tp = confusion_matrix(y_test, xgb1_pred).ravel()
xgb1_final_auc = roc_auc_score(y_test, xgb1_pred)
xgb1_final_acc = (tn+tp)/(tn+fp+fn+tp)
xgb1_final_df = pd.DataFrame([[tn,fp,fn,tp,xgb1_final_acc,xgb1_final_auc]], index=['XGB_FINAL'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("XGBClassifier Final Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(xgb1_final_acc))
print("AUC ROC: {}".format(xgb1_final_auc))

CV Score: 0.7120828844966777
Test Score: 0.7036328871892925

XGBClassifier Final Matrix:
True Negatives: 140
False Positives: 66
False Negatives: 89
True Positives: 228
Accuracy: 0.7036328871892925
AUC ROC: 0.6994272763468193


In [58]:
final_scores = pd.concat([xgb_final_df, score_bayes_final, score3, score])
final_scores.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_FINAL,153,53,47,270,0.808795,0.797227
MNB_CVEC,160,46,61,256,0.795411,0.792135
BN_FINAL,154,52,60,257,0.785851,0.779149
BN2_CVEC,157,49,65,252,0.782027,0.778544
MNB_TFIDF,131,75,25,292,0.808795,0.778529
BN_CVEC,153,53,59,258,0.785851,0.778299
MNB2_CVEC,159,47,69,248,0.778203,0.77709
LR_TFIDF,138,68,38,279,0.797323,0.775015
BN_TFIDF,123,83,22,295,0.799235,0.763843
LR2_CVEC,154,52,70,247,0.76673,0.763376


## XGBoost Run

Our quick run on XGBoost didn't turn out the results we hoped for, and that can be down to not having enough parameters for it to be tweaked due to computational limits. The TF-IDF data would've taken too long as well for hyperparameter tuning due to the number of folds required to have a good gauge of which settings to use.

Next, we'll also include in a simple LSTM model from Keras via Tensorflow to compare against the models we already have.

## Keras LSTM model run (Tensorflow 2.0 beta)

In [59]:
X = df['title'].values
y = df['type'].values

In [60]:
# Running the same procedures before to tokenise our text data, but using Keras' processing instead

from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

tokeniser_obj = Tokenizer(lower=True)
total_posts = X
tokeniser_obj.fit_on_texts(X)

Using TensorFlow backend.


In [61]:
# Typically, max_length should be the maximum length of the longest post title, but due to computing constraints
# a length of 20 is used, which is mostly fine as our post titles don't usually exceed 20 words.

max_length = 20 
vocab_size = len(tokeniser_obj.word_index) + 1
vocab_size

4923

In [62]:
# Setting up our training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=34, shuffle=True)

In [63]:
# Setting up our tokenised words into sequences, and then padding them for equal length to be fed into our model

X_train_tokens = tokeniser_obj.texts_to_sequences(X_train)
X_test_tokens = tokeniser_obj.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_tokens, maxlen=max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen=max_length, padding='post')

In [159]:
"""
The model will be running a single LSTM layer with 200 units in it.
"""

embedding_dim = 200 # Word dimension matrix of 200 used
batch_size = 24 # Model sees 24 of our training samples at a time

model = Sequential()

model.add(Embedding(vocab_size, embedding_dim, input_length=max_length))
model.add(LSTM(200))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) 
print(model.summary()) 

Model: "sequential_29"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_29 (Embedding)     (None, 20, 200)           984600    
_________________________________________________________________
lstm_29 (LSTM)               (None, 200)               320800    
_________________________________________________________________
dense_29 (Dense)             (None, 1)                 201       
Total params: 1,305,601
Trainable params: 1,305,601
Non-trainable params: 0
_________________________________________________________________
None


In [160]:
model.fit(X_train_pad, y_train, epochs=3, batch_size=batch_size, validation_data=(X_test_pad, y_test), verbose=2)

Train on 1219 samples, validate on 523 samples
Epoch 1/3
1219/1219 - 3s - loss: 0.6876 - accuracy: 0.5570 - val_loss: 0.6504 - val_accuracy: 0.7151
Epoch 2/3
1219/1219 - 1s - loss: 0.3967 - accuracy: 0.8376 - val_loss: 0.4776 - val_accuracy: 0.7782
Epoch 3/3
1219/1219 - 1s - loss: 0.1336 - accuracy: 0.9606 - val_loss: 0.4981 - val_accuracy: 0.7897


<tensorflow.python.keras.callbacks.History at 0x1ff0cad30>

In [161]:
tf_pred = model.predict(X_test_pad) # Getting predictions and converting them into binary from their probabilities

tf_df = pd.DataFrame(tf_pred, columns=['y_hat'])
tf_df['y_hat'] = tf_df['y_hat'].apply(lambda x: 1 if x >= 0.5 else 0)

In [162]:
tn, fp, fn, tp = confusion_matrix(y_test, tf_df).ravel()
tf_final_auc = roc_auc_score(y_test, tf_df)
tf_final_acc = (tn+tp)/(tn+fp+fn+tp)
tf_final_df = pd.DataFrame([[tn,fp,fn,tp,tf_final_acc,tf_final_auc]], index=['LSTM_FINAL'], columns=['TN','FP','FN','TP','ACC','AUC'])

print("LSTM Final Matrix:")
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)
print("Accuracy: {}".format(tf_final_acc))
print("AUC ROC: {}".format(tf_final_auc))

LSTM Final Matrix:
True Negatives: 154
False Positives: 52
False Negatives: 58
True Positives: 259
Accuracy: 0.7896749521988528
AUC ROC: 0.7823037579247191


In [163]:
final_scorers = pd.concat([final_scores, tf_final_df])
final_scorers.sort_values('AUC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_FINAL,153,53,47,270,0.808795,0.797227
MNB_CVEC,160,46,61,256,0.795411,0.792135
LSTM_FINAL,154,52,58,259,0.789675,0.782304
BN_FINAL,154,52,60,257,0.785851,0.779149
BN2_CVEC,157,49,65,252,0.782027,0.778544
MNB_TFIDF,131,75,25,292,0.808795,0.778529
BN_CVEC,153,53,59,258,0.785851,0.778299
MNB2_CVEC,159,47,69,248,0.778203,0.77709
LR_TFIDF,138,68,38,279,0.797323,0.775015
BN_TFIDF,123,83,22,295,0.799235,0.763843


## Final Run

##### AUC ranking (above):   

After running all our models with various settings and hyperparameter tweaks, it's clear that our hyperparameter-tuned Multinomial NB model did the best, with the cvec version of it coming in second. Keras' LSTM model rounds it off at 3rd with a single LSTM layer.

##### ACC ranking (below):   

Similar story on the accuracy ranking, with the Multinomial NB models taking top 2, albeit the TF-IDF version in 2nd place instead of the CVEC version. Looks like the TF-IDF versions of the models did well in the accuracy section, with non-TFIDF versions taking up the other spots below the top 5. 

In [166]:
final_scorers.sort_values('ACC', ascending=False)

Unnamed: 0,TN,FP,FN,TP,ACC,AUC
MNB_FINAL,153,53,47,270,0.808795,0.797227
MNB_TFIDF,131,75,25,292,0.808795,0.778529
BN_TFIDF,123,83,22,295,0.799235,0.763843
LR_TFIDF,138,68,38,279,0.797323,0.775015
MNB_CVEC,160,46,61,256,0.795411,0.792135
MNB3_TFIDF,130,76,34,283,0.789675,0.761906
LSTM_FINAL,154,52,58,259,0.789675,0.782304
BN_FINAL,154,52,60,257,0.785851,0.779149
BN3_TFIDF,119,87,25,292,0.785851,0.749403
BN_CVEC,153,53,59,258,0.785851,0.778299


## Conclusions and Limitations

It's evident that trying to discern whether a post was made in jest or made as a serious question is no easy task. Even though our highest model had an accuracy of 0.80, it still generated a significant number of false positives, and our AUC lags behind the accuracy as a result.

Given this, however, it's still interesting that the model still managed to do as well as it did, considering that even as a human reading the post titles (that would be me here), I couldn't tell at times which subreddit a post belongs without checking for any reference elsewhere. 

Which brings me to the part on limitations. It might have been easier to get a better picture and a higher predictive score if we could also add in data within each post itself, for example, clicking into some of the posts from AskScience leads to long walls of text which actually discuss the core science question at hand.

Within ShittyAskScience, it tends to be memes, funny images, or just short replies, since there was never really anything to discuss of in the first place. The key part in all of the limitations would be context - something even us humans need in order to be able to classify things the analog way.

### Future Work

Adding in comments within each post is definitely beneficial, as it provides a closer link to each post and hence allowing the machine to be able to learn and identify better. 