In [1]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.corpus import stopwords
from utils import run_classifiers

In [2]:
#load source data

path="../data/combined.pickle"

try:
    with open(path,'rb') as handle:
        pickleload=pickle.load(handle)
except FileNotFoundError as e:
    e.strerror = "Pls run 01_scrape_reddit first to pull the data and 02_EDA to merge data."
    raise e

df=pd.DataFrame(pickleload)

df

Unnamed: 0,post,label
0,this is the nail in the coffin for the idea of...,0
1,"i’m closeted, always been, and always will be....",0
2,the fifa world cup in qatar should be a remind...,0
3,we moved from the dc metro area last year to t...,0
4,they spend so much time focusing on arbitrary ...,0
...,...,...
10078,"if jesus died for our sins, what's keeping u f...",1
10079,hello everybody it may seem like a dumb questi...,1
10080,today's readings: 1 corinthian 1:4-8 &gt;i tha...,1
10081,i don't propose this question in the sense of ...,1


### Prepare the dataset, pass through `CountVectorizer` and look at the features

In [3]:
X=df['post']
y=df['label']

In [4]:
X

0        this is the nail in the coffin for the idea of...
1        i’m closeted, always been, and always will be....
2        the fifa world cup in qatar should be a remind...
3        we moved from the dc metro area last year to t...
4        they spend so much time focusing on arbitrary ...
                               ...                        
10078    if jesus died for our sins, what's keeping u f...
10079    hello everybody it may seem like a dumb questi...
10080    today's readings: 1 corinthian 1:4-8 &gt;i tha...
10081    i don't propose this question in the sense of ...
10082    i'd like to preface this by saying that i've n...
Name: post, Length: 10083, dtype: object

In [5]:
cv=CountVectorizer(stop_words=stopwords.words("english"))
X=df['post']

cv.fit_transform(X)
feat_list=cv.get_feature_names_out()
#number of features
print(f"Number of features: {len(feat_list)}")
print("Features:")
print(feat_list)

Number of features: 33838
Features:
['00' '000' '000ish' ... '𝚖𝚊𝚔𝚎𝚛' '𝚗𝚘' '𝚠𝚊𝚝𝚌𝚑']


It is interesting to note that the list of features is 35k long, and have a list of numbers.  
Let's remove the numbers and see what remains.

In [6]:
cv=CountVectorizer(stop_words=stopwords.words("english"),token_pattern="[^\W\d_]+")
X=df['post']

cv.fit_transform(X)
feat_list=cv.get_feature_names_out()
#number of features
print(f"Number of features: {len(feat_list)}")
print("Features:")
print(feat_list)

Number of features: 32819
Features:
['aa' 'aaaaaaamen' 'aaaand' ... '𝚖𝚊𝚔𝚎𝚛' '𝚗𝚘' '𝚠𝚊𝚝𝚌𝚑']


The list of feature shortens to 34k

## Run Preliminary Classification on Naive Bayes

and run RandomizedSearchCV at the same time

Note that we use ROC AUC as the indicator instead of the usual Accuracy.  
ROC AUC is a more comprehensive indicator as it factors both true positive rate (TPR) and false positive rate (FPR).
ROC AUC is also a suitable candidate given that the labels are split equally i.e. not unbalanced.  
This is as compared to conventional accuracy which simply considers correctly predicted classes.

In [7]:
classifiers_list=[
    {
        'cls':MultinomialNB(),
        'name':'NaiveBayes',
        'float_params':{
            'cvec__max_features':range(3000,4000,100),
            'cvec__max_df':[0.6,0.7,0.8],
            'cvec__min_df':[0.05,0.1,0.15],
        }
    },
]
run_classifiers(classifiers_list,X,y)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best parameters and accuracy
{'cvec__min_df': 0.05, 'cvec__max_features': 3000, 'cvec__max_df': 0.8}
ROC AUC with CV=5: 0.8118935063167214


That's a 81% AUC score.  
However, `CountVectorizer` gives higher weightage to longer posts, so that is kind of unfair.  
In addition, if a word shows up in every post, then in has little significance in classification too.  
To resolve this, we apply a Term Frequency, Inverse Document Frequency (TF-IDF) transformation.

In [8]:
classifiers_list=[
    {
        'cls':MultinomialNB(),
        'name':'NaiveBayes',
        'float_params':{
            'tvec__max_features':range(3000,4000,100),
            'tvec__max_df':[0.6,0.7,0.8],
            'tvec__min_df':[0.05,0.1,0.15],
        }
    },
]
run_classifiers(classifiers_list,X,y,tfidf=True)

Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best parameters and accuracy
{'tvec__min_df': 0.05, 'tvec__max_features': 3000, 'tvec__max_df': 0.8}
ROC AUC with CV=5: 0.8233414478852168


That's a very slight 1% improvement.  
Let's try to tune some other parameters.

In [9]:
classifiers_list=[
    {
        'cls':MultinomialNB(),
        'name':'NaiveBayes',
        'fixed_params':{'min_df': 0.05, 'max_features': 3500, 'max_df': 0.8},
        'float_params':{'tvec__ngram_range':[(1,1),(1,2)],'tvec__use_idf':(True,False)}
    },
]
run_classifiers(classifiers_list,X,y,tfidf=True)

Fitting 5 folds for each of 4 candidates, totalling 20 fits




Best parameters and accuracy
{'tvec__use_idf': False, 'tvec__ngram_range': (1, 2)}
ROC AUC with CV=5: 0.8257169538682693


Interesting! In turns out **not** using inverse document frequency gives a better accuracy, albeit an improvement by 0.2% only.  
This means, we are using Term Frequency alone without Inverse Document Frequency.  
Not surprisingly, a ngram range of 1 to 2 gave better accuracy than 1.  