## 0.2 Bot or Human?: Binary Classification of Reddit Posts


### Contents:
- [Import Libraries](#Import-Libraries)
- [Read in data scrapped from subreddits](#Read-in-data-scrapped-from-subreddits)
- [Data Processing](#Data-Processing)
- [NLP Modelling](#NLP-Modelling)
- [Models Evaluation & Recommendation](#Models-Evaluation-and-Recommendation)

## Import Libraries

In [29]:
import requests
import time
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.corpus import stopwords
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,HashingVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## Read in data scrapped from subreddits

In [2]:
df_bot = pd.read_csv('../Data/bot.csv')

In [3]:
df_human = pd.read_csv('../Data/human.csv')

## Data Processing

In [4]:
#Categorises posts by indicting posts from Subredditsimulator as 1
df_bot['bot'] = 1

In [5]:
df_bot.head()

Unnamed: 0,post_title,name,bot
0,What is /r/SubredditSimulator?,t3_3g9ioz,1
1,"Only bots can post in /r/SubredditSimulator, c...",t3_3g9k92,1
2,Cops tow golden Porsche for being 'rude' claim...,t3_cg1wva,1
3,Are you happy now that I’m currently pet sitting,t3_cfpfnu,1
4,Harold's drinking problem hindered his love li...,t3_cg17b6,1


In [6]:
#Categorises posts by indicting posts from Showerthoughts as 0
df_human['bot'] = 0

In [7]:
df_human.head()

Unnamed: 0,post_title,name,bot
0,Your Essential Guide to Showerthoughts,t3_bg71oh,0
1,"The Quintessential Showerthought, Issue #1 - O...",t3_bpql00,0
2,A clear toothpaste tube would make so much sense.,t3_cg37tn,0
3,"When someone dies doing a dangerous hobby, it'...",t3_cg0v70,0
4,Your bed has probably seen you go through more...,t3_cg4dqt,0


In [8]:
#Combines all data scrapped into one frame
df = pd.concat([df_bot[2:], df_human[2:]])

In [9]:
#Identifies posts with empty title as NaN
df = df.replace('',np.nan)

In [10]:
#Checks for NaN cells
df.isnull().sum()

post_title    0
name          0
bot           0
dtype: int64

In [11]:
#Checks for representation of catogories
df.bot.value_counts()

0    857
1    850
Name: bot, dtype: int64

In [12]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [13]:
#Captures punctuation that specifically escapes string.punctuation
#Converts to lowercase
df['post_title'] = df['post_title'].str.lower().str.replace('’',':').str.replace('‘',':').str.replace('“',':')  

In [14]:
#Removes punctuation
df['post_title_clean'] = df['post_title'].map(lambda x : ''.join(k for k in x if k not in string.punctuation))

In [15]:
df.head()

Unnamed: 0,post_title,name,bot,post_title_clean
2,cops tow golden porsche for being 'rude' claim...,t3_cg1wva,1,cops tow golden porsche for being rude claims ...
3,are you happy now that i:m currently pet sitting,t3_cfpfnu,1,are you happy now that im currently pet sitting
4,harold's drinking problem hindered his love li...,t3_cg17b6,1,harolds drinking problem hindered his love lif...
5,i:ve read so many women embarassed about buyin...,t3_cg7qpo,1,ive read so many women embarassed about buying...
6,what is with that one weird weight bar that do...,t3_cg6k1o,1,what is with that one weird weight bar that do...


In [16]:
#Shuffles dataset before modelling
df = shuffle(df)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,post_title,name,bot,post_title_clean
0,it would really suck if you had to charge your...,t3_cg0wms,0,it would really suck if you had to charge your...
1,someone probably had a fortune cookie with a w...,t3_cg4ql3,0,someone probably had a fortune cookie with a w...
2,maybe you are the only real human on the world...,t3_cg89eu,0,maybe you are the only real human on the world...
3,what should my pay actually be? i7 7700k overc...,t3_cc5unz,1,what should my pay actually be i7 7700k overcl...
4,comment on a vacation where he was high hopes ...,t3_byy6bs,1,comment on a vacation where he was high hopes ...


## NLP Modelling
Posts title chosen as features in design metrics. 
<br>Model target is the 'bot' column which indicates if post is from Subredditsimulator (is bot i.e. 1) or Showerthoughts (is not bot i.e. 0)

In [17]:
X = df['post_title_clean']
y = df['bot']

In [18]:
#Splits data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42,stratify=y)

<b>The following classifiers are tested:
<br>(1) Multinomial Naive Bayes (MNB) Classifier
<br>(2) Bernoulli Naive Bayes (BNB) Classifier
<br>(3) Logistic Regression (LR) Classifier</b>

## Count Vectorizer 

Converts texts from post titles to a matrix of token counts.

<b>Using MNB on count vectorised features</b>

In [19]:
pipe = Pipeline([
    ('cvec', CountVectorizer(max_features=500, stop_words='english')),
    ('nb', MultinomialNB()),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [20]:
cvec_nb_score = pipe.score(X_test,y_test)
cvec_nb_score 

0.6346604215456675

In [21]:
cross_val_score(pipe, X, y, cv=5).mean()

0.6784037317144278

<b>Using BNB on count vectorised features</b>

In [22]:
pipe = Pipeline([
    ('cvec', CountVectorizer(max_features=500, stop_words='english')),
    ('bn', BernoulliNB()),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('bn', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

In [23]:
cvec_bn_score = pipe.score(X_test,y_test)
cvec_bn_score 

0.6838407494145199

In [24]:
cross_val_score(pipe, X, y, cv=5).mean()

0.7082883160981635

<b>Using LR on count vectorised features</b>

In [25]:
pipe = Pipeline([
    ('cvec', CountVectorizer(max_features=500, stop_words='english')),
    ('lr', LogisticRegression(solver='lbfgs')),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        s...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [26]:
cvec_lr_score = pipe.score(X_test,y_test)
cvec_lr_score

0.65807962529274

In [27]:
cross_val_score(pipe, X, y, cv=5).mean()

0.6936307043267995

## TFidf Vectorizer
Converts texts from post titles to a matrix of term frequency–inverse document frequency features. This numerical statistic reflects the importance a word to the collection of words.

<b>Using MNB on tf-idf vectorised features</b>

In [35]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer(max_features=500, stop_words='english')),
    ('nb', MultinomialNB()),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...True,
        vocabulary=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [36]:
tvec_nb_score = pipe.score(X_test,y_test)
tvec_nb_score

0.629976580796253

In [37]:
cross_val_score(pipe, X, y, cv=5).mean()

0.6737065047761142

<b>Using BNB on tf-idf vectorised features</b>

In [38]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer(max_features=500, stop_words='english')),
    ('bn', BernoulliNB()),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...  vocabulary=None)), ('bn', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

In [39]:
tvec_bn_score = pipe.score(X_test,y_test)
tvec_bn_score

0.6838407494145199

In [40]:
cross_val_score(pipe, X, y, cv=5).mean()

0.7082883160981635

<b>Using LR on tf-idf vectorised features</b>

In [41]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer(max_features=500, stop_words='english')),
    ('lr', LogisticRegression(solver='lbfgs')),
])

pipe.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('tvec', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=500, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [42]:
tvec_lr_score = pipe.score(X_test,y_test)
tvec_lr_score 

0.6604215456674473

In [43]:
cross_val_score(pipe, X, y, cv=5).mean()

0.6936307043267995

## Hash Vectorizer
Converts texts from post titles to a matrix of token occurrences. Hash function is applied to features; hash values are used as indices.

<b>Using MNB on hash vectorised features</b>

In [44]:
pipe = Pipeline([
    ('hvec', HashingVectorizer(stop_words='english', non_negative='total')),
    ('nb', MultinomialNB()),
])

pipe.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('hvec', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative='total',
         norm='l2', preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None)), ('nb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [45]:
hvec_nb_score = pipe.score(X_test,y_test)
hvec_nb_score



0.7096018735362998

In [46]:
cross_val_score(pipe, X, y, cv=5).mean()



0.7228859048893005

<b>Using BNB on hash vectorised features</b>

In [47]:
pipe = Pipeline([
    ('hvec', HashingVectorizer(stop_words='english', non_negative='total')),
    ('bn', BernoulliNB()),
])

pipe.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('hvec', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative='total',
         norm='l2', ...   tokenizer=None)), ('bn', BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True))])

In [48]:
hvec_bn_score = pipe.score(X_test,y_test)
hvec_bn_score



0.5011709601873536

In [49]:
cross_val_score(pipe, X, y, cv=5).mean()



0.5020493560391693

<b>Using LR on tf-idf vectorised features</b>

In [50]:
pipe = Pipeline([
    ('hvec', HashingVectorizer(stop_words='english', non_negative='total')),
    ('lr', LogisticRegression(solver='lbfgs')),
])

pipe.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('hvec', HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative='total',
         norm='l2', ...enalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))])

In [51]:
hvec_lr_score = pipe.score(X_test,y_test)
hvec_lr_score 



0.7236533957845434

In [52]:
cross_val_score(pipe, X, y, cv=5).mean()



0.71764332630207

## Models Evaluation & Recommendation