# Contents
- [Imports](#import)
- [Punctuation and Numeric Values](#clean)
- [Models](#model)
- [Tf-Idf](#tfidf)
- [Max Features](#maxft)
- [2-gram Count Vectorization](#2gram)

---
# Imports<a id=import></a>

In [127]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

lemmatizer = WordNetLemmatizer()
stop=stopwords.words('english')

We import the csv created from the earlier steps in our process.

In [7]:
cleaned=pd.read_csv(r'..\datasets\cleaned.csv')

---
# Punctuation and Numeric Values<a id=clean></a>

We create a function which allows us to specify if we wish to clean punctuation and numeric values.

In [122]:
def furtherclean(df,feature,stop_words=True,digits=True,punct=True):
    df1=df.copy()
    df1[feature]=df1['text']
    # clean with stop words
    if stop_words==True:
        df1[feature]=df1[feature].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    # drop numeric values
    if digits==True:
        df1[feature]=df1[feature].str.replace('\d+', ' ')
    # drop all punctuation
    if punct==True:
        df1[feature]=df1[feature].str.replace(r'[^\s\w]+', ' ')
    return df1

We create 4 DataFrames with different forms of cleaning and check their output.

In [9]:
cleaned_stop=furtherclean(cleaned,'stoptext',digits=False,punct=False) # only stop words are removed
cleaned_stoppunct=furtherclean(cleaned,'stoptext',digits=False) # stop words and punctuation are removed
cleaned_stopdigit=furtherclean(cleaned,'stoptext',punct=False) # stop words and numeric values are removed
cleaned_all=furtherclean(cleaned,'stoptext') # stop words, punctuation and numeric values are removed 

In [8]:
print('cleaned: ' + cleaned.text[5]+'\n')
print('cleaned_stop: ' + cleaned_stop.stoptext[5]+'\n')
print('cleaned_stoppunct: ' + cleaned_stoppunct.stoptext[5]+'\n')
print('cleaned_stopdigit: ' + cleaned_stopdigit.stoptext[5]+'\n')
print('cleaned_all: ' + cleaned_all.stoptext[5]+'\n')

cleaned: does pressure in a sealed container rise as it ascends in altitude?the source of this discussion is talking about inflating an inflatable stand up paddleboard at lower altitude and then driving it up to a mountain lake. these paddle boards have a stiff strong structure that holds their shape. they are designed to hold aprox. 15psi. if someone was to fill the paddleboard to 15psi say at 4,000ft altitude, then drive it to a mountain lake at say 8,000ft altitude, will the pressure in the paddleboard change from the change in altitude or is 15psi in a container, 15psi regardless of ambient pressure?

cleaned_stop: pressure sealed container rise ascends altitude?the source discussion talking inflating inflatable stand paddleboard lower altitude driving mountain lake. paddle boards stiff strong structure holds shape. designed hold aprox. 15psi. someone fill paddleboard 15psi say 4,000ft altitude, drive mountain lake say 8,000ft altitude, pressure paddleboard change change altitude 1

We observe that our cleaning has taken place without any issues for the 4 dataframes.<br/>

Next, we create a function what will perform count vecorization and score a logistic regression on whatever dataframe we send as an input.

In [9]:
def logregr(df,framename):
    logreg=LogisticRegression()
    cvec = CountVectorizer(stop_words='english',strip_accents='unicode')
    X_train,X_test,y_train,y_test=train_test_split(df.stoptext,df.subreddit,random_state=42,stratify=df.subreddit)
    X_train=pd.DataFrame(cvec.fit_transform(X_train).todense(),columns=cvec.get_feature_names())
    X_test=pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())
    logreg.fit(X_train,y_train)
    score=logreg.score(X_test,y_test)
    print('The score for '+framename+' is '+str(score))

In [10]:
logregr(cleaned_stop,'cleaned_stop')
logregr(cleaned_stoppunct,'cleaned_stoppunct')
logregr(cleaned_stopdigit,'cleaned_stopdigit')
logregr(cleaned_all,'cleaned_all')



The score for cleaned_stop is 0.8734939759036144




The score for cleaned_stoppunct is 0.8734939759036144




The score for cleaned_stopdigit is 0.8714859437751004
The score for cleaned_all is 0.8714859437751004




There seems to be minimal difference in the methods of cleaning.<br/>
Following the logic that digits and punctuation should not be part of our classifier, we will utilize the 'cleaned_all' dataframe for the following model tests.

---
# Models<a id=model></a>

The following code has been adapted from David S. Batista from his [blog](http://www.davidsbatista.net/blog/2018/02/23/model_optimization/).<br/>
This class is written to allow us to perform a GridSearch on a predefined dictionary of models and their parameters and output their scores in a nicely formatted dataframe.<br/>

In [10]:
#code found from http://www.davidsbatista.net/blog/2018/02/23/model_optimization/
class EstimatorSelectionHelper:

    def __init__(self, models, params):
        if not set(models.keys()).issubset(set(params.keys())):
            missing_params = list(set(models.keys()) - set(params.keys()))
            raise ValueError("Some estimators are missing parameters: %s" % missing_params)
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv=3, n_jobs=-1, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print("Running GridSearchCV for %s." % key)
            model = self.models[key]
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, refit=refit,
                              return_train_score=True)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            print(k)
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns]


In [48]:
classifier_models = {
    'LogisticRegression' : LogisticRegression(random_state = 42),
    'KNN': KNeighborsClassifier(), 
    'NaiveBayes' : MultinomialNB(),
    'DecisionTree' : DecisionTreeClassifier(random_state = 42), 
    'BaggedDecisionTree' : BaggingClassifier(random_state = 42),
    'RandomForest' : RandomForestClassifier(random_state = 42), 
    'ExtraTrees' : ExtraTreesClassifier(random_state = 42), 
    'AdaBoost' : AdaBoostClassifier(random_state=42), 
    'GradientBoosting' : GradientBoostingClassifier(random_state = 42),
    'SVM' : SVC(random_state=42),
}

In [12]:
classifier_model_params = {
    'LogisticRegression' : {
        'penalty' : ['l1', 'l2'],
        'C' : np.arange(.05, 1, .05) },
    'KNN' : {
        'n_neighbors' : np.arange(3, 22, 2) },
    'NaiveBayes' : {
        'alpha' : np.arange(.05, 2, .05)},
    'DecisionTree': {
        'max_depth' : [None, 6, 10, 14], 
        'min_samples_leaf' : [1, 2],
        'min_samples_split': [2, 3] },
    'BaggedDecisionTree' : {
        'n_estimators' : [20, 60, 100] },
    'RandomForest' : {
        'n_estimators' : [20, 60, 100],
        'max_depth' : [None, 2, 6, 10],
        'min_samples_split' : [2, 3, 4] },
    'ExtraTrees' : {
        'n_estimators' : [20, 60, 100],
        'max_depth' : [None, 6, 10, 14],
        'min_samples_leaf' : [1, 2], 
        'min_samples_split' : [2, 3], },
    'AdaBoost' : {
        'n_estimators' : np.arange(100, 151, 25),
        'learning_rate' : np.linspace(0.05, 1, 10) },
    'GradientBoosting' : {
        'n_estimators' : np.arange(5, 150, 15),
        'learning_rate' : np.linspace(0.05, 1, 10),
        'max_depth' : [1, 2, 3] },
    'SVM' : {
        'C' : np.arange(0.05, 1, .05),
        'kernel' : ['rbf', 'linear'] },
        }

We count vectorize our 'cleaned_all' dataframe,

In [130]:
df=cleaned_all
cvec = CountVectorizer(stop_words='english',strip_accents='unicode')
X_train,X_test,y_train,y_test=train_test_split(df.stoptext,df.subreddit,random_state=42,stratify=df.subreddit)
X_train=pd.DataFrame(cvec.fit_transform(X_train).todense(),columns=cvec.get_feature_names())
X_test=pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())    

and we perform our GridSearchCV model fittings.<br/>
We use the 'f1' scoring method to grade the precision and recall of our classification models. 

In [14]:
search = EstimatorSelectionHelper(classifier_models, classifier_model_params)
search.fit(X_train, y_train, scoring='f1', n_jobs=-1)

Running GridSearchCV for LogisticRegression.
Fitting 3 folds for each of 38 candidates, totalling 114 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done 114 out of 114 | elapsed:   15.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for KNN.
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  3.3min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for NaiveBayes.
Fitting 3 folds for each of 39 candidates, totalling 117 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 117 out of 117 | elapsed:   10.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for DecisionTree.
Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:   24.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for BaggedDecisionTree.
Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Done   4 out of   9 | elapsed:  1.2min remaining:  1.5min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.4min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for RandomForest.
Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   30.2s
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   48.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for ExtraTrees.
Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:  2.8min finished


We then score our different models and output our results to a csv for archival.

In [15]:
score1=search.score_summary(sort_by='max_score')

LogisticRegression
KNN
NaiveBayes
DecisionTree
BaggedDecisionTree
RandomForest
ExtraTrees


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [23]:
score1.to_csv(r'..\datasets\score1.csv',index=False)

In [123]:
# run this to read in the different model scores on future runs
score1=pd.read_csv(r'..\datasets\score1.csv')

Next, we will sort our scores according to the highest mean 'f1' scores obtained for the models.

In [124]:
score1[score1['mean_score'] == score1.groupby('estimator')['mean_score'].transform('max')]

Unnamed: 0,C,alpha,estimator,kernel,learning_rate,max_depth,max_score,mean_score,min_samples_leaf,min_samples_split,min_score,n_estimators,n_neighbors,penalty,std_score
2,,1.15,NaiveBayes,,,,0.924603,0.912368,,,0.899804,,,,0.010127
39,0.05,,LogisticRegression,,,,0.900862,0.872827,,,0.847682,,,l2,0.021807
45,,,ExtraTrees,,,,0.89441,0.874189,1.0,2.0,0.851613,60.0,,,0.017551
66,,,RandomForest,,,,0.87965,0.864767,,4.0,0.849673,100.0,,,0.012239
78,,,BaggedDecisionTree,,,,0.858388,0.84216,,,0.827133,100.0,,,0.012788
103,,,DecisionTree,,,,0.810127,0.802633,1.0,2.0,0.794355,,,,0.006463
180,,,KNN,,,,0.438272,0.390985,,,0.355263,,3.0,,0.034861
190,,,AdaBoost,,0.577778,,0.872017,0.848122,,,0.828194,125.0,,,0.018109
194,0.7,,SVM,linear,,,0.87069,0.850494,,,0.83691,,,,0.014562
241,,,GradientBoosting,,0.366667,3.0,0.848214,0.842022,,,0.836689,110.0,,,0.004744


It seems that the Multinomial Naive Bayes model gives us the best mean f1 score at 0.91 as opposed to the other models.<br/>
In general, only the Multinomial (max f1=0.92), Logistic Regression (max f1=0.90) and Extra Trees model (max f1=0.89) performed better than our baseline model with an f1 score of 0.889.<br/>
The rest of the models still performed admirably well with the f1 scores sitting above 0.79, except for the k-Nearest Neighbouts model which gave us an absymal best score of 0.43.<br/>

In [131]:
mnb=MultinomialNB(alpha=1.15)
mnb.fit(X_train,y_train)
pred=mnb.predict(X_test)
f1_score(y_test,pred)

0.8988326848249028

Predicting on our test set, our Multinomial classifier with an alpha parameter of 1.15 gives us an f1 score of 0.89 which beats our baseline by 0.01.<br/>
MultinomialNB is our best classifier though the difference  from our baseline Logistic Regression classifier is minimal. 

---
# Tf-Idf <a id=tfidf></a>

Next, we investigate if weighting our tokens with the *term frequency.inverse document frequency* method will yield any significant improvements to our model. 

In [132]:
df=cleaned_all
tvec = TfidfVectorizer(stop_words='english',strip_accents='unicode')
X_train,X_test,y_train,y_test=train_test_split(df.stoptext,df.subreddit,random_state=42,stratify=df.subreddit)
X_train=pd.DataFrame(tvec.fit_transform(X_train).todense(),columns=tvec.get_feature_names())
X_test=pd.DataFrame(tvec.transform(X_test).todense(),columns=tvec.get_feature_names())    

We perform a GridSearch with all our different classifiers again,

In [49]:
search = EstimatorSelectionHelper(classifier_models, classifier_model_params)
search.fit(X_train, y_train, scoring='f1', n_jobs=-1)

Running GridSearchCV for AdaBoost.
Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  9.3min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for GradientBoosting.
Fitting 3 folds for each of 300 candidates, totalling 900 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 18.6min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed: 43.8min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 118.4min
[Parallel(n_jobs=-1)]: Done 900 out of 900 | elapsed: 132.9min finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Running GridSearchCV for SVM.
Fitting 3 folds for each of 38 candidates, totalling 114 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 114 out of 114 | elapsed: 13.7min finished


and we archive our GridSearch f1 scores externally in a csv.

In [None]:
score2=search.score_summary(sort_by='max_score')

In [56]:
score2.to_csv(r'..\datasets\score2.csv')

In [125]:
# run this to read in the different model scores on future runs
score2=pd.read_csv(r'..\datasets\score2.csv')

In [126]:
score2[score2['mean_score'] == score2.groupby('estimator')['mean_score'].transform('max')]

Unnamed: 0.1,Unnamed: 0,C,alpha,estimator,kernel,learning_rate,max_depth,max_score,mean_score,min_samples_leaf,min_samples_split,min_score,n_estimators,n_neighbors,penalty,std_score
5,79,,1.6,NaiveBayes,,,,0.921212,0.916491,,,0.909449,,,,0.005075
8,78,,1.55,NaiveBayes,,,,0.921212,0.916491,,,0.909449,,,,0.005075
39,37,0.95,,LogisticRegression,,,,0.91134,0.900907,,,0.883817,,,l2,0.012182
40,33,0.85,,LogisticRegression,,,,0.91134,0.900907,,,0.883817,,,l2,0.012182
41,35,0.9,,LogisticRegression,,,,0.91134,0.900907,,,0.883817,,,l2,0.012182
59,153,,,ExtraTrees,,,,0.89916,0.886625,2.0,3.0,0.868085,100.0,,,0.013378
60,150,,,ExtraTrees,,,,0.89916,0.886625,2.0,2.0,0.868085,100.0,,,0.013378
73,114,,,RandomForest,,,,0.869565,0.86319,,4.0,0.852747,100.0,,,0.007444
86,47,,,KNN,,,,0.858333,0.833795,,,0.806452,,21.0,,0.021273
101,105,,,BaggedDecisionTree,,,,0.844639,0.831524,,,0.823266,100.0,,,0.009377


Upon sorting our models by the highest mean score, we observe that the Multinomial Naive Bayes classifier still gives us the best results as expected.<br/>
However, the f1 score did not have any significant improvement with the mean score still sitting at 0.91.<br/>
We will use the best alpha parameter of 1.60 for our MultinomialNB to predict on our training set.

In [134]:
mnb=MultinomialNB(alpha=1.60)
mnb.fit(X_train,y_train)
pred=mnb.predict(X_test)
f1_score(y_test,pred)

0.8949416342412451

Predicting on our test set, our f1 score actually drops to 0.894 though the drop is minimal. <br/>
With worse results in our Tf-Idf vectorizer score, we choose to use count vectorization for all following steps.

---
# Max Features<a id=maxft></a>

Next, we attempt to limit our maximum features (tokens) to 1000, from 7000+ features.<br/>
We investigate if limiting the amount of features cause any significant detriment to the predictive power of our models.

In [128]:
df=cleaned_all
cvec = CountVectorizer(stop_words='english',strip_accents='unicode',max_features=1000) #limit maximum features to 1000
X_train,X_test,y_train,y_test=train_test_split(df.stoptext,df.subreddit,random_state=42,stratify=df.subreddit)
X_train=pd.DataFrame(cvec.fit_transform(X_train).todense(),columns=cvec.get_feature_names())
X_test=pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())    

We perform a GridSearch with only the Multinomial Naive Bayes classifier here.

In [73]:
gcv=GridSearchCV(MultinomialNB(),classifier_model_params['NaiveBayes'],scoring='f1',n_jobs=-1,verbose=1,cv=3)

In [74]:
gcv.fit(X_train,y_train)

Fitting 3 folds for each of 39 candidates, totalling 117 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done 117 out of 117 | elapsed:    1.1s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'alpha': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  , 1.05, 1.1 ,
       1.15, 1.2 , 1.25, 1.3 , 1.35, 1.4 , 1.45, 1.5 , 1.55, 1.6 , 1.65,
       1.7 , 1.75, 1.8 , 1.85, 1.9 , 1.95])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=1)

In [79]:
pd.DataFrame(gcv.cv_results_).sort_values('mean_test_score',ascending=False)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
26,0.038894,0.004308,0.017951,0.011313,1.35,{'alpha': 1.35},0.909091,0.888889,0.903491,0.900494,0.008519,1,0.942132,0.950655,0.93712,0.943302,0.005587
27,0.032581,0.002618,0.012632,0.003672,1.4,{'alpha': 1.4000000000000001},0.909091,0.888889,0.903491,0.900494,0.008519,1,0.942132,0.950655,0.93712,0.943302,0.005587
28,0.041221,0.011753,0.009974,0.000814,1.45,{'alpha': 1.4500000000000002},0.909091,0.888889,0.903491,0.900494,0.008519,1,0.942132,0.950655,0.93712,0.943302,0.005587
30,0.031913,0.002935,0.009973,0.00141,1.55,{'alpha': 1.55},0.909091,0.88668,0.903093,0.899625,0.009476,4,0.942132,0.950655,0.93401,0.942266,0.006796
32,0.041887,0.007236,0.01762,0.009152,1.65,{'alpha': 1.6500000000000001},0.909091,0.88668,0.903093,0.899625,0.009476,4,0.941176,0.950655,0.93401,0.941947,0.006817
31,0.04189,0.00881,0.009308,0.000469,1.6,{'alpha': 1.6},0.909091,0.88668,0.903093,0.899625,0.009476,4,0.941176,0.950655,0.93401,0.941947,0.006817
35,0.045211,0.010628,0.009642,0.002619,1.8,{'alpha': 1.8},0.909091,0.884462,0.904959,0.899507,0.010771,7,0.940102,0.949597,0.931841,0.940513,0.007254
38,0.023271,0.013164,0.00532,0.002618,1.95,{'alpha': 1.9500000000000002},0.91129,0.882236,0.904959,0.899499,0.012478,8,0.937945,0.950555,0.931841,0.940114,0.007792
18,0.034575,0.001695,0.009973,0.000814,0.95,{'alpha': 0.9500000000000001},0.905051,0.887574,0.905738,0.899454,0.008405,9,0.943205,0.951613,0.941414,0.945411,0.004446
25,0.038901,0.006366,0.009636,0.001242,1.3,{'alpha': 1.3},0.906883,0.887129,0.903491,0.89917,0.008626,10,0.942132,0.950655,0.93712,0.943302,0.005587


We obtain our best alpha parameter of 1.35 and predict on our test set in the following step.

In [129]:
mnb=MultinomialNB(alpha=1.35)
mnb.fit(X_train,y_train)
pred=mnb.predict(X_test)
f1_score(y_test,pred)

0.8651911468812877

Our f1 score drops from 0.89 to 0.86. The drop is not too significant for an 85% reduction in the dimensionality of our features.<br/>
It seems like the first 1000 features are sufficient predictors for our subreddit classification purposes, though we will still use our full set of features for our most optimized model.

---
# 2-gram Count Vectorization<a id=2gram></a>

Here, we investigate how including bigrams into our list of features will affect the predictive capabilities of our model. 

In [135]:
df=cleaned_all
cvec = CountVectorizer(stop_words='english',strip_accents='unicode',max_features=7236,ngram_range=(1, 2)) #set max features to 7236 for a fair comparison to our baseline model.
X_train,X_test,y_train,y_test=train_test_split(df.stoptext,df.subreddit,random_state=42,stratify=df.subreddit)
X_train=pd.DataFrame(cvec.fit_transform(X_train).todense(),columns=cvec.get_feature_names())
X_test=pd.DataFrame(cvec.transform(X_test).todense(),columns=cvec.get_feature_names())    

In [136]:
word_counts = X_train.sum(axis=0)
print(word_counts.sort_values(ascending = False).head(50))

amp           646
question      451
help          389
know          388
time          350
like          333
physics       294
answer        268
problem       267
force         256
energy        244
need          235
ve            196
equation      190
number        186
speed         183
thanks        183
different     176
way           173
use           172
right         172
point         167
understand    164
mass          163
velocity      158
work          156
solve         153
light         145
field         144
possible      136
say           134
math          131
make          130
really        130
want          127
ball          126
using         125
example       124
correct       121
object        120
gt            120
think         119
numbers       119
wrong         119
function      119
got           118
sure          118
given         117
calculate     115
water         109
dtype: int64


Printing the most recurring features, it seems that recurring bigrams are not necessarily common as they do not appear in our top 50 features.<br/>

We attempt to list the most recurring bigrams according to the subreddit they appear in.

In [137]:
# Most common words in AskPhysics
physcom=X_train.loc[:,(pd.DataFrame(y_train)==1).subreddit.tolist()].sum(axis=0).sort_values(ascending=False)
# Most common words in askmath
mathcom=X_train.loc[:,(pd.DataFrame(y_train)==0).subreddit.tolist()].sum(axis=0).sort_values(ascending=False)

In [138]:
# we extract a sorted list of the most recurring physics bigrams in descending order
phys2gram=[x for x in physcom.index.tolist() if re.match(r'\w+\s\w+',x)]
# and we mask our original dataframe to return the sum of counts of the bigram from our AskPhysics subreddit
physcom[physcom.index.isin(phys2gram)==1]

amp question               17
cos sin                    16
amp edit                   15
amp know                   13
correct answer             13
amp help                   12
appreciate help            10
bits bits                  10
amp nbsp                    9
cos cos                     9
answer amp                  9
currently working           9
buoyant force               8
cos theta                   8
angular velocity            7
constant speed              7
chance winning              7
amp sure                    7
chance getting              6
acceleration gravity        6
cannon ball                 6
cross product               6
confidence interval         6
dark matter                 6
arrow time                  6
able help                   6
charged particle            6
classical mechanics         5
car traveling               5
decimal places              5
                           ..
answer appreciate           2
consumed products           2
contact an

Observing the most recurring bigrams, it seems that our bigrams are not very helpful as they rarely deal with physics related topics.<br/>
The most recurring physics-related bigram is 'buoyant force' with a sum of count of 8. 

In [139]:
# we extract a sorted list of the most recurring math bigrams in descending order
math2gram=[x for x in mathcom.index.tolist() if re.match(r'\w+\s\w+',x)]
# and we mask our original dataframe to return the sum of counts of the bigram from our askmath subreddit
mathcom[mathcom.index.isin(math2gram)==1]

amp amp                     38
black hole                  27
amp thanks                  23
black holes                 15
amp thank                   13
angular momentum            12
charge density              11
answer question             11
alembert principle          10
answer key                  10
age universe                10
center mass                  9
answer answer                9
amp ve                       8
actual price                 8
amp need                     7
cos amp                      7
dark energy                  7
ap physics                   7
amp example                  6
ap calc                      6
ball hits                    6
amp sin                      6
average speed                5
circular loop                5
correct amp                  5
centre mass                  5
best way                     5
bar labeled                  5
amp distance                 5
                            ..
considering acceleration     2
constant

Surprisingly, many physics related terms appear in our askmath subreddit.<br/>
Examples of these terms include *black hole*, *angular momentum* and *charge density*.<br/>
Perhaps many of the posters in the askmath subreddit are asking for math help regarding physics questions.<br/>

Next, we perform our GridSearch to obtain our best Multinomial parameters.

In [119]:
gcv=GridSearchCV(MultinomialNB(),classifier_model_params['NaiveBayes'],scoring='f1',n_jobs=-1,verbose=1,cv=3)

In [120]:
gcv.fit(X_train,y_train)

Fitting 3 folds for each of 39 candidates, totalling 117 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 117 out of 117 | elapsed:    9.7s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'alpha': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  , 1.05, 1.1 ,
       1.15, 1.2 , 1.25, 1.3 , 1.35, 1.4 , 1.45, 1.5 , 1.55, 1.6 , 1.65,
       1.7 , 1.75, 1.8 , 1.85, 1.9 , 1.95])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1', verbose=1)

In [121]:
pd.DataFrame(gcv.cv_results_).sort_values('mean_test_score',ascending=False)



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
2,0.284904,0.016067,0.054521,0.006582,0.15,{'alpha': 0.15000000000000002},0.924901,0.900585,0.925852,0.917112,0.011693,1,0.988967,0.990991,0.982036,0.987331,0.003834
3,0.254985,0.029922,0.05452,0.012466,0.2,{'alpha': 0.2},0.922772,0.900196,0.927711,0.91689,0.011975,2,0.986961,0.98998,0.981019,0.985987,0.003723
38,0.277257,0.01017,0.041223,0.008553,1.95,{'alpha': 1.9500000000000002},0.927022,0.899408,0.924,0.916812,0.012368,3,0.974,0.978979,0.969093,0.974024,0.004036
32,0.280581,0.006324,0.051529,0.001245,1.65,{'alpha': 1.6500000000000001},0.927022,0.899408,0.923695,0.91671,0.01231,4,0.974,0.97996,0.971087,0.975016,0.003693
33,0.268613,0.012196,0.045213,0.008157,1.7,{'alpha': 1.7000000000000002},0.927022,0.899408,0.923695,0.91671,0.01231,4,0.974,0.97996,0.97012,0.974693,0.004047
29,0.276927,0.011866,0.051196,0.007021,1.5,{'alpha': 1.5000000000000002},0.927022,0.901186,0.921844,0.916687,0.011163,6,0.974,0.97996,0.971087,0.975016,0.003693
30,0.28457,0.010844,0.047872,0.008619,1.55,{'alpha': 1.55},0.927022,0.901186,0.921844,0.916687,0.011163,6,0.974,0.97996,0.971087,0.975016,0.003693
1,0.29521,0.021406,0.057513,0.003291,0.1,{'alpha': 0.1},0.931238,0.899225,0.918812,0.916433,0.013183,8,0.98998,0.990991,0.984032,0.988334,0.00307
35,0.284239,0.017522,0.053857,0.003732,1.8,{'alpha': 1.8},0.925197,0.899408,0.923695,0.916101,0.011819,9,0.974975,0.97996,0.97012,0.975018,0.004017
34,0.260303,0.006463,0.053524,0.002351,1.75,{'alpha': 1.7500000000000002},0.925197,0.899408,0.923695,0.916101,0.011819,9,0.974,0.97996,0.97012,0.974693,0.004047


In [140]:
mnb=MultinomialNB(alpha=0.15)
mnb.fit(X_train,y_train)
pred=mnb.predict(X_test)
f1_score(y_test,pred)

0.8875968992248061

Our score drops to 0.88 from 0.89. It is actually better to use only single words as our features in our classifier model.