# Naive Bayes and Sentiment Classification

**Classification**: assigning a category to an input

**text categorization**: assigning a label or category to an entire text or document.

One of approach commom in *text categorization* is **sentiment analysis**, the positive or negative orientation that a writer expresses toward some object.

The simplest version of sentiment analysis is a *binary classification* task.

Others binary classification tasks are the following:
* *Spam detection* (assigning an email to one of the two classes spam or not-spam)
* *language id* (identifing the language of the document)

The most cases in classification task are done via *supervised machine learning*, where dataset of input observations are associated with some correct output.

Two of many ways of doing a classification are:
* *Generative* classifier: Given an observation, they return the class most likely to have generated the observation. This classification is done by Naive Bayes.

* *Discriminative* classifier: what features from the input are most useful to discriminate between the different possible classes. This type of classification is done by Logistic regression.

Recall, discriminative classifier are more accurate.

## Naive Bayes Classifiers

The task of *supervised classification* is to take an input $D$, and a fixed set of output classes $C = c_1, ..., c_M$  and return a predicted class $c \in C$.
Naive Bayes is a probabilistic classifier

$$\hat{c} = \underset{c \in C}{\operatorname{argmax}}\;P(c|d) = \underset{c \in C}{\operatorname{argmax}}\;\frac{P(d|c)P(c)}{P(d)} $$

We can drop the denominator $P(d)$ since when we will be computing $\frac{P(d|c)P(c)}{P(d)}$ for each possible class. But $P(d)$ doesn’t change for each class.
$$\underset{c \in C}{\operatorname{argmax}}\;P(d|c)P(c) $$

$P(d|c)$ is called likehood and $P(c)$ prior probability
$d$ can be represented as a set of $n$ features $(f_1, f_2, ..., f_n)$

$$\underset{c \in C}{\operatorname{argmax}}\;P(f_1, f_2, ..., f_n|c)P(c) $$
Under  *naive Bayes assumption*  the probabilities $P(f_i|c)$ are independent given the class $c$ 
$$\underset{c \in C}{\operatorname{argmax}}\;P(f_1, f_2, ..., f_n|c)P(c)  = \underset{c \in C}{\operatorname{argmax}}\;P(f_1|c)P(f_2|c) ,..., P(f_n|c) \; P(c)$$
The final equation is :

$$c_{NB} = \underset{c \in C}{\operatorname{argmax}}\;P(c) \prod_{i=1}^{n} P(f_i|c)$$

To apply the naive Bayes classifier to text, we need to consider word positions. $f$ is changed by $w$ meaning word.

$$c_{NB} = \underset{c \in C}{\operatorname{argmax}}\;P(c) \prod_{i \in positions }^{n} P(w_i|c)$$
Applying log:

$$c_{NB} = \underset{c \in C}{\operatorname{argmax}}\;(P(c) + \sum_{i \in positions }^{n} P(w_i|c))$$
Recall, **bag-of-words** is a set of unordered set of words (the position is ignored). 

And the documents passed as training will form a bag of words.
The model will assume that a word is a feature if the word is in the documents’s bag of words.

We need add-one (Laplace) smoothing to avoid zero probability.
In the test, word that is not in bag-of-word will be ignored.

Removeing the *stop words* (the, a, ...) don't improve the performance, so we work with all words in bag

### Application

Extracted from [Speech and Language Processing - An Introduction to Natural Language Processing,Computational Linguistics, and Speech Recognition - Daniel Jurafsky and James H. Martin] 
Page: 63

<div align = "center">
  <table id="table_01">
    <tr>
      <th></th>
      <th>Cat</th>
      <th>Document</th>
    </tr>
    <tr>
      <th rowspan="5">Training</th>
      <td>-</td>
      <td>just plain boring</td>
    </tr>
    <tr>
      <td>-</td>
      <td>entirely predictable and lacks energy</td>
    </tr>
    <tr>
      <td>-</td>
      <td>no surprises and very few laughs</td>
    </tr>
    <tr>
      <td>+</td>
      <td>very powerful</td>
    </tr>
    <tr>
      <td>+</td>
      <td>the most fun film of the summer</td>
    </tr>
    <tr>
      <th>Test</td>
      <td>?</td>
      <td>predictable with no fun</td>
    </tr>

  </table>
</div>


* As we can see the classes are $\text{-}$ and $\text{+}$

* Labeling $\text{+}$ like $1$ and the otherwise like $0$, we want to predict the phrase *predictable with no fun*

In [14]:
classes = { 0:'-', 1:'+'}

In [15]:
import pandas as pd
import numpy as np
from functools import reduce

In [16]:
# Load the data
data_bayes = pd.read_csv(r'assets/naive_bayes/data.txt')

In [17]:
class naivebayes:
    """
    The naive Bayes algorithm, using add-1 smoothing.
    Extracted from [Speech and Language Processing - An Introduction to Natural Language Processing,
    Computational Linguistics, and Speech Recognition - Daniel Jurafsky and James H. Martin]
    Page: 62
    """
    def __init__(self):
        pass

    def token(self, sentence:str) -> list:
        sentence_split = sentence.split(' ')
        token_set = set(sentence_split)
        return list(token_set)
    
    def fit(self, data:np.ndarray, target:np.array) -> None:
        x = data.flatten()
        self.classes = np.unique(target)
        n_c, log_prior, big_doc, loglikelihood = {}, {}, {}, {}

        for c in self.classes:
            n_doc = len(target)
            condition = target == c
            n_c[c] = sum(condition)
            log_prior[c] = np.log(n_c[c] / n_doc)
            vocabulary = reduce(lambda a, b: a+b, [self.token(d) for d in x])
            big_doc[c] = reduce(lambda a, b: a+b, [self.token(d) for d in x[condition]])
            loglikelihood[c] = {}
        
            for word in vocabulary:
                uniq_doc, count_doc = np.unique(big_doc[c], return_counts=True)
                if np.any(word == uniq_doc):
                    count = count_doc[uniq_doc == word][0]
                else:
                    count = 0
                temp = {word:np.log((count + 1)/(len(big_doc[c]) + len(set(vocabulary))))}
                loglikelihood[c].update(temp)
                
        
        self.log_prior, self.loglikelihood, self.vocabulary = log_prior, loglikelihood, vocabulary

    def predict(self, data:np.ndarray) -> np.ndarray:
        x = data.flatten()
        log_posterior = {}
        results = []
        for sentence in x:
            for c in self.classes:
                log_posterior[c] = self.log_prior[c]
                word_test = self.token(sentence)

                for word in word_test:

                    if word in self.vocabulary:
                        log_posterior[c] += self.loglikelihood[c][word]
                    else:
                        pass
            classes_posterior = np.array(tuple(log_posterior.keys()))
            log_posterior_final = np.array(tuple(log_posterior.values()))
            results.append(classes_posterior[np.argmax(log_posterior_final)])
        return np.array(results)


In [20]:
nb = naivebayes()

In [23]:
X = np.array(data_bayes[['document']])
y = np.array(data_bayes.cat)
nb.fit(X, y)

In [25]:
input_document = np.array(['predictable with no fun'])
y_predict = nb.predict(input_document)
for pred, doc in zip(y_predict, input_document):
    print(doc, classes[pred])

A good film +


### Improvements

* *Dealing with negation*:
  
  A very simple baseline that is commonly used in sentiment analysis to deal with negation is the following: during text normalization, prepend the prefix NOT to every word after a token of logical negation (n’t, not, no, never) until the next punctuation mark.
   
  * didn’t like this movie $\rightarrow$ didn’t NOT_like NOT_this NOT_movie.

* *Lexicons*: 

  In some situations we might have insufficient labeled training data to train accurate naive Bayes classifiers using all words in the training set to estimate positive and negative sentiment we can derive the positive and negative word features from sentiment lexicons, lists of words that are preannotated with positive or negative sentiment.

  * $+$:  admirable, beautiful, confident
  * $-$: awful, bad, bias, catastrophe

  In a naive Bayes classifier we usually add a feature that is counted whenever a word from that lexicon occurs.

### References 
  * [1] Daniel Jurafsky and James H. Martin. Speech and Language Processing - An Introduction to Natural
      Language Processing, Computational Linguistics, and Speech Recognition. Pages [61-64]