## Naive Bayes Baseline Classifier

See [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/4.pdf) for details on Naive Bayes.

### Preprocessing
In order to get useful probabilities for Naive Bayes, lemmatize the tweets, then create the vocabulary and collect them into BoW form

In [1]:
import pandas as pd
from loading import load_train
from preprocessing import remove_tags
from preprocessing import tokenize
from preprocessing import remove_stopwords
from preprocessing import lemmatize

[nltk_data] Downloading package stopwords to /Users/franz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/franz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Preprocess tweets:

In [2]:
df = load_train(full=True)
remove_tags(df)
tokenize(df)
remove_stopwords(df)
lemmatize(df)

### Training the classifier

Class probabilities $P(c)$ are fixed in our task:

In [3]:
prior = {"pos": 0.5, "neg": 0.5}

Collect the vocabulary:

In [4]:
vocabulary = set([word for sentence in df.x for word in sentence])
len(vocabulary)

578234

Create one big document of words divided by class:

In [5]:
bigdoc = {}
bigdoc["pos"] = [word for sentence in df.loc[df['y'] == 1].x for word in sentence]
bigdoc["neg"] = [word for sentence in df.loc[df['y'] == 0].x for word in sentence]

Count the number of occurrences of each word per class:

In [6]:
count = {"pos":{}, "neg":{}}
for word in vocabulary:
    count["pos"][word] = 0
    count["neg"][word] = 0

for word in bigdoc["pos"]:
    count["pos"][word] += 1
for word in bigdoc["neg"]:
    count["neg"][word] += 1

MLE of word probability given a class (with Laplace smoothing):
$\hat P(w_i|c) = \frac{count(w_i,c)+1}{\sum_{w\in W}(count(w,c)+1)}$

In [7]:
likelihood = {}
for word in vocabulary:
    likelihood[word] = {}
    likelihood[word]["pos"] = (count["pos"][word] + 1) / (len(bigdoc["pos"]) + len(vocabulary))
    likelihood[word]["neg"] = (count["neg"][word] + 1) / (len(bigdoc["neg"]) + len(vocabulary))

### Predictions on Training data

In [8]:
import numpy as np

def classify_preprocessed_sentence(tokens, prior, likelihood, classes, vocab):
    ssum = {}
    for sentiment in classes:
        ssum[sentiment] = np.log(prior[sentiment])
        for word in tokens:
            if word in vocabulary:
                ssum[sentiment] += np.log(likelihood[word][sentiment])
    return max(ssum, key=ssum.get)

In [9]:
df['y_pred'] = df.x.apply(lambda tokens: 1 if classify_preprocessed_sentence(tokens, prior, likelihood, ["pos","neg"], vocabulary) == "pos" else -1)

### Evaluation

In [10]:
import torch
import logging
from evaluation import evaluate

logging.basicConfig(level=logging.INFO)

In [11]:
y = torch.tensor(df['y'])
y_pred = torch.tensor(df['y_pred'])

In [12]:
evaluate(y, y_pred)

INFO:root:---
* accuracy: 0.753086
* precision: 0.6893940074247157
* recall: 0.9212328
* f1: 0.7886275937236656
* bce: 8.528273375547503
* auc: 0.7530859999999999
---


(0.753086,
 0.6893940074247157,
 0.9212328,
 0.7886275937236656,
 8.528273375547503,
 0.7530859999999999)

### Predictions on test data

In [15]:
from loading import load_test

In [16]:
df2 = load_test()
remove_tags(df2)
tokenize(df2)
remove_stopwords(df2)
lemmatize(df2)

In [17]:
df2['Prediction'] = df2.x.apply(lambda tokens: 1 if classify_preprocessed_sentence(tokens, prior, likelihood, ["pos","neg"], vocabulary) == "pos" else -1)

In [18]:
df2.to_csv("naive_bayes.csv", columns=['Prediction'])