### Naive Bayes
We will investigate Naive Bayes methods since we assume conditional independence between every pair of features (words) given the class.

Going to do Bernoulli and Multinomial NB as is common for text classification. Not going to do Gaussian NB bc not continuous. Going to try complement naive Bayes (CNB) algorithm, an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets (https://scikit-learn.org/stable/modules/naive_bayes.html) on the not downsampled data.

Going to use RandomizedSearch instead of GridSearchCV which "can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions."

In [1]:
import pickle
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform
from sklearn.metrics import average_precision_score, roc_auc_score
import random
random.seed(22)

In [2]:
# Importing labels
with open('../data/train_labels.pckl', 'rb') as f:
    train_labels = pickle.load(f)

with open('../data/dev_labels.pckl', 'rb') as f:
    dev_labels = pickle.load(f)

In [3]:
# Importing 2 of the 10 vectorizers to experiment with. Later import all. 

with open('../data/train_binary_downsampled_data.pckl', 'rb') as f:
    train = pickle.load(f)
with open('../data/dev_binary_downsampled_data.pckl', 'rb') as f:
    dev = pickle.load(f)

In [3]:
def get_data(dataset, vectorizer):
    '''
    returns feature matrix for specified dataset and vectorizer
    @param dataset: string specifying dataset, "train","dev",etc
    @param vectorizer: string specifying vectorizer "binary","count",etc

    '''
    with open(f'../data/{dataset}_{vectorizer}_downsampled_data.pckl', 'rb') as f:
        return pickle.load(f)

### Multinomial
"Empirical comparisons provide evidence that the multinomial model tends to outperform the multi-variate Bernoulli model if the vocabulary size is relatively large [13]. However, the performance of machine learning algorithms is highly dependent on the appropriate choice of features. In the case of naive Bayes classifiers and text classification, large differences in performance can be attributed to the choices of stop word removal, stemming, and token-length [14]." (Citation: https://sebastianraschka.com/Articles/2014_naive_bayes_1.html) 

In [6]:
from sklearn.naive_bayes import MultinomialNB

vectorizers = ['count', 'tfidf', 'binary'] # 'hashing', 'hashing_binary'

# specify parameters and distributions to sample from
param_dist = {'alpha': loguniform(1e-4, 1e0)}

for vectorizer in vectorizers:
    print('----- ', vectorizer, ' -----')
    train = get_data('train', vectorizer)
    dev = get_data('dev', vectorizer)

    nb_multi = MultinomialNB()  
        
    # run randomized search
    random_search = RandomizedSearchCV(nb_multi, param_distributions=param_dist)
    
    random_search.fit(train, train_labels)
    
    nb_train = random_search.predict(train)
    nb_dev = random_search.predict(dev)
    
    nb_train_auc = roc_auc_score(train_labels, nb_train)
    nb_dev_auc = roc_auc_score(dev_labels, nb_dev)
    nb_train_ap = average_precision_score(train_labels, nb_train)
    nb_dev_ap = average_precision_score(dev_labels, nb_dev)
    
    print(f'Train AUC: {nb_train_auc:.4f}\n'
          f'Dev   AUC: {nb_dev_auc:.4f}\n'
          f'Train AP:  {nb_train_ap:.4f}\n'
          f'Dev   AP:  {nb_dev_ap:.4f}')

-----  count  -----
Train AUC: 0.9868
Dev   AUC: 0.6616
Train AP:  0.9784
Dev   AP:  0.1584
-----  tfidf  -----
Train AUC: 0.9642
Dev   AUC: 0.6906
Train AP:  0.9564
Dev   AP:  0.1667
-----  binary  -----
Train AUC: 0.9862
Dev   AUC: 0.6622
Train AP:  0.9776
Dev   AP:  0.1597


### Bernoulli

V. Metsis, I. Androutsopoulos and G. Paliouras (2006). Spam filtering with Naive Bayes – Which Naive Bayes? 3rd Conf. on Email and Anti-Spam (CEAS).

In [7]:
from sklearn.naive_bayes import BernoulliNB

vectorizers = ['count', 'tfidf', 'binary'] # 'hashing', 'hashing_binary'

# specify parameters and distributions to sample from
param_dist = {'alpha': loguniform(1e-4, 1e0)}

for vectorizer in vectorizers:
    print('----- ', vectorizer, ' -----')
    train = get_data('train', vectorizer)
    dev = get_data('dev', vectorizer)

    nb_bern = BernoulliNB()  
        
    # run randomized search
    random_search = RandomizedSearchCV(nb_bern, param_distributions=param_dist)
    
    random_search.fit(train, train_labels)
    
    nb_train = random_search.predict(train)
    nb_dev = random_search.predict(dev)
    
    nb_train_auc = roc_auc_score(train_labels, nb_train)
    nb_dev_auc = roc_auc_score(dev_labels, nb_dev)
    nb_train_ap = average_precision_score(train_labels, nb_train)
    nb_dev_ap = average_precision_score(dev_labels, nb_dev)
    
    print(f'Train AUC: {nb_train_auc:.4f}\n'
          f'Dev   AUC: {nb_dev_auc:.4f}\n'
          f'Train AP:  {nb_train_ap:.4f}\n'
          f'Dev   AP:  {nb_dev_ap:.4f}')

-----  count  -----
Train AUC: 0.9680
Dev   AUC: 0.6274
Train AP:  0.9398
Dev   AP:  0.1340
-----  tfidf  -----
Train AUC: 0.9745
Dev   AUC: 0.6272
Train AP:  0.9515
Dev   AP:  0.1345
-----  binary  -----
Train AUC: 0.9699
Dev   AUC: 0.6276
Train AP:  0.9432
Dev   AP:  0.1342


## Non downsampled explorations

### Complement NB
The Complement Naive Bayes classifier was designed to correct the “severe assumptions” made by the standard Multinomial Naive Bayes classifier. It is particularly suited for imbalanced data sets. [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.ComplementNB.html#sklearn.naive_bayes.ComplementNB)

This was done with non-downsampled data that I generated by: not running downsample.ipynb then editing file names and re-running vectorize-count.ipynb and concat-features.ipynb. 

In [13]:
def get_data_not_ds(dataset, vectorizer):
    '''
    get data that is not downsampled
    @param dataset: string specifying dataset, "train","dev",etc
    @param vectorizer: string specifying vectorizer "binary","count",etc

    '''
    with open(f'../data/{dataset}_{vectorizer}_NOT_downsampled_data.pckl', 'rb') as f:
        return pickle.load(f)

In [9]:
# Importing labels NOT DOWNSAMPLED

with open('../data/train_labels_nods.pckl', 'rb') as f:
    train_labels_not_ds = pickle.load(f)

with open('../data/dev_labels_nods.pckl', 'rb') as f:
    dev_labels_not_ds = pickle.load(f)

In [10]:
from sklearn.naive_bayes import ComplementNB

vectorizers = ['count', 'tfidf', 'binary'] # 'hashing', 'hashing_binary'

# specify parameters and distributions to sample from
param_dist = {'alpha': loguniform(1e-4, 1e0)}

for vectorizer in vectorizers:
    print('----- ', vectorizer, ' -----')
    train = get_data_not_ds('train', vectorizer)
    dev = get_data_not_ds('dev', vectorizer)

    nb_comp = ComplementNB()  
        
    # run randomized search
    random_search = RandomizedSearchCV(nb_comp, param_distributions=param_dist)
    
    random_search.fit(train, train_labels_not_ds)
    
    nb_train = random_search.predict(train)
    nb_dev = random_search.predict(dev)
    
    nb_train_auc = roc_auc_score(train_labels_not_ds, nb_train)
    nb_dev_auc = roc_auc_score(dev_labels_not_ds, nb_dev)
    nb_train_ap = average_precision_score(train_labels_not_ds, nb_train)
    nb_dev_ap = average_precision_score(dev_labels_not_ds, nb_dev)
    
    print(f'Train AUC: {nb_train_auc:.4f}\n'
          f'Dev   AUC: {nb_dev_auc:.4f}\n'
          f'Train AP:  {nb_train_ap:.4f}\n'
          f'Dev   AP:  {nb_dev_ap:.4f}')

-----  count  -----
Train AUC: 0.8738
Dev   AUC: 0.5043
Train AP:  0.7713
Dev   AP:  0.1066
-----  tfidf  -----
Train AUC: 0.5013
Dev   AUC: 0.5000
Train AP:  0.1052
Dev   AP:  0.1016
-----  binary  -----
Train AUC: 0.6449
Dev   AUC: 0.5022
Train AP:  0.3624
Dev   AP:  0.1052


### Trying other 2 NB models with not downsampled data
Curious if downsampling worsens performance with other models too.
#### Bernoulli

In [12]:
# Bernoulli with raw (non-downsampled) data
vectorizers = ['count', 'tfidf', 'binary'] # 'hashing', 'hashing_binary'

# specify parameters and distributions to sample from
param_dist = {'alpha': loguniform(1e-4, 1e0)}

for vectorizer in vectorizers:
    print('----- ', vectorizer, ' -----')
    train = get_data_not_ds('train', vectorizer)
    dev = get_data_not_ds('dev', vectorizer)

    nb_bern = BernoulliNB()  
        
    # run randomized search
    random_search = RandomizedSearchCV(nb_bern, param_distributions=param_dist)
    
    random_search.fit(train, train_labels_not_ds)
    
    nb_train = random_search.predict(train)
    nb_dev = random_search.predict(dev)
    
    nb_train_auc = roc_auc_score(train_labels_not_ds, nb_train)
    nb_dev_auc = roc_auc_score(dev_labels_not_ds, nb_dev)
    nb_train_ap = average_precision_score(train_labels_not_ds, nb_train)
    nb_dev_ap = average_precision_score(dev_labels_not_ds, nb_dev)
    
    print(f'Train AUC: {nb_train_auc:.4f}\n'
          f'Dev   AUC: {nb_dev_auc:.4f}\n'
          f'Train AP:  {nb_train_ap:.4f}\n'
          f'Dev   AP:  {nb_dev_ap:.4f}')

-----  count  -----
Train AUC: 0.5353
Dev   AUC: 0.5001
Train AP:  0.1393
Dev   AP:  0.1017
-----  tfidf  -----
Train AUC: 0.7144
Dev   AUC: 0.5021
Train AP:  0.4700
Dev   AP:  0.1038
-----  binary  -----
Train AUC: 0.5286
Dev   AUC: 0.5001
Train AP:  0.1309
Dev   AP:  0.1017


#### Multinomial

In [11]:
# Multinomial with raw (non-downsampled) data
vectorizers = ['count', 'tfidf', 'binary'] # 'hashing', 'hashing_binary'

# specify parameters and distributions to sample from
param_dist = {'alpha': loguniform(1e-4, 1e0)}

for vectorizer in vectorizers:
    print('----- ', vectorizer, ' -----')
    train = get_data_not_ds('train', vectorizer)
    dev = get_data_not_ds('dev', vectorizer)

    nb_multi = MultinomialNB()  
        
    # run randomized search
    random_search = RandomizedSearchCV(nb_multi, param_distributions=param_dist)
    
    random_search.fit(train, train_labels_not_ds)
    
    nb_train = random_search.predict(train)
    nb_dev = random_search.predict(dev)
    
    nb_train_auc = roc_auc_score(train_labels_not_ds, nb_train)
    nb_dev_auc = roc_auc_score(dev_labels_not_ds, nb_dev)
    nb_train_ap = average_precision_score(train_labels_not_ds, nb_train)
    nb_dev_ap = average_precision_score(dev_labels_not_ds, nb_dev)
    
    print(f'Train AUC: {nb_train_auc:.4f}\n'
          f'Dev   AUC: {nb_dev_auc:.4f}\n'
          f'Train AP:  {nb_train_ap:.4f}\n'
          f'Dev   AP:  {nb_dev_ap:.4f}')

-----  count  -----
Train AUC: 0.9122
Dev   AUC: 0.5049
Train AP:  0.8391
Dev   AP:  0.1070
-----  tfidf  -----
Train AUC: 0.5209
Dev   AUC: 0.5001
Train AP:  0.1404
Dev   AP:  0.1018
-----  binary  -----
Train AUC: 0.6442
Dev   AUC: 0.5019
Train AP:  0.3612
Dev   AP:  0.1048
