# Naive Bayes

## Probability and Bayes Rule
Bayes' theorem is a crucial concept widely applied across various fields, including medicine, education, and notably in natural language processing (NLP). Grasping its underlying principles enables tasks like sentiment analysis on tweets, which is your assignment for this week. Next, in the upcoming course, you will also develop an auto-correct feature using its fundamental principles. Consider a vast collection of tweets, each distinctly labeled as either positively or negatively sentimented. In this collection, the word "happy" is sometimes marked as positive and other times as negative. 

Let's delve into why this happens. One approach to understanding probabilities is to consider the frequency of events. If you define event A as a tweet being labeled positive, then the probability of A, denoted as P(A), is the ratio of the count of positive tweets to the total tweets. In our example, this is 13 out of 20, or 0.65, which can also be expressed as 65%. 
It's important to remember that the probability of negative sentiment tweets is simply one minus the probability of positive sentiment tweets, assuming every tweet is either positive or negative. 

<img src="../../img/probability.PNG" alt="probability" width="70%" height="70%">


Similarly, define Event B as the occurrence of the word "happy" in tweets. Let's say "happy" appears in 4 tweets. Consider the part of a diagram where tweets are both positive and contain "happy". The probability that a tweet is labeled positive and contains "happy" is the proportion of these tweets to the entire collection. So, if there are 20 tweets and 3 are positive with "happy", the probability is 3 out of 20, or 0.15. This gives you the intersection probability. Now you know how to compute the likelihood of a word, like "happy", occurring in positive tweets. In the next segment, we'll explore Naive Bayes further.

<img src="../../img/probability_intersect.PNG" alt="probability intersect" width="70%" height="70%">

## Bayes' Rule
Conditional probabilities allow us to narrow down our search within a sample space. For instance, when a particular event has already occurred, such as knowing that the word is 'happy':

<img src="../../img/bayes_rule.PNG" alt="bayes rule" width="70%" height="70%">




In that case, your search would be confined to the area within the blue circle mentioned earlier. The numerator would represent the red section, while the blue section would serve as the denominator. From this, we can draw the following conclusion:

<img src="../../img/conditional_probability.PNG" alt="conditional probability" width="70%" height="70%">


Substituting the numerator in the right hand side of the first equation, you get the following: 

<img src="../../img/conditional_probability_2.PNG" alt="conditional probability 2" width="70%" height="70%">


It's important to note that we multiplied by P(positive) to maintain consistency in our calculation. This brings us to the conclusion of Bayes' Rule, which is formulated as:

$[ P(X|Y) = \frac{P(Y|X)P(X)}{P(Y)} $]

The main takeaway for now is that, Bayes rule is based on the mathematical formulation of conditional probabilities. That's with Bayes rule, you can calculate the probability of x given y if you already know the probability of y given x and the ratio of the probabilities of x and y.
 



## Naive Bayes for sentiment analysis

Naive Bayes for sentiment analysis is highly effective and straightforward baseline method for numerous text classification projects. Naive Bayes, an instance of supervised machine learning, bears similarities to logistic regression, which you’ve previously worked with. It’s termed 'naive' because it assumes the features used for classification are independent, a situation that's quite rare in reality. Yet, you’ll see it performs well for simple tasks like sentiment analysis. 

As before, we 'll start with two groups of tweets: positive and negative. the task involves extracting all unique words and their frequencies from these groups, just like before. For positive tweets, we might count a total of 13 words, and for negative tweets, 12 words. This step is crucial in Naive Bayes as it allows to calculate each word's conditional probabilities based on its class. 

<img src="../../img/bayes_table.PNG" alt="conditional probability 2" width="70%" height="70%">

For instance, the word ‘I’ would have a conditional probability of 3/13 in the positive class. This value, 0.24, is stored in a new table, and similarly, for the negative class, ‘I’ would have a probability of 3/12, or 0.25. You'll replicate this process for every word in your vocabulary to complete the table of conditional probabilities. A unique feature of this table is that the sum of probabilities for each class equals 1. 

<img src="../../img/probability.PNG" alt="probability" width="70%" height="70%">


When examining this table, notice that some words have almost equal conditional probabilities across sentiments, like ‘I’, ‘learning’, and ‘NLP’. These neutral words contribute little to sentiment determination. Contrastingly, words like ‘happy’, ‘sad’, and ‘not’ show significant differences in probabilities, indicating strong sentiment inclination. However, words that appear in only one corpus, like ‘because’ in positive tweets, pose a challenge as their absence in the other corpus impacts calculations. To resolve this, you apply probability smoothing. 

<img src="../../img/bayes_inference.PNG" alt="bayes inference" width="70%" height="70%">

Imagine analyzing a new tweet: "I'm happy today, I'm learning." Using the Naive Bayes inference rule, you calculate the product of the ratios of each word's probability in positive and negative classes. Neutral words like ‘I’ and ‘I’m’ cancel each other out in this calculation, leaving the significant words to dictate the sentiment. In this case, the result is more than 1, suggesting the tweet is positive.


## Laplacian Smoothing

**Standard Probability Computation**:

We typically calculate the probability of a word $ w_i $ given a class (either Positive or Negative) using the formula:
$ P(w_i | \text{class}) = \frac{\text{freq}(w_i, \text{class})}{N_{\text{class}}} $
$ \text{class} \in \{ \text{Positive}, \text{Negative} \} $

**Issue with Unseen Words**:

If $ w_i $ does not appear in the training set, it gets a probability of 0. This can be problematic.

**Applying Smoothing**:

To address this, we use Laplacian Smoothing:
$ P(w_i | \text{class}) = \frac{\text{freq}(w_i, \text{class}) + 1}{N_{\text{class}} + V}$

In this formula:
- We add 1 to the numerator to account for the unseen word.
- $ V $, the total number of unique words in the vocabulary, is added to the denominator.

**Notations**:

- $ N_{\text{class}} $: Frequency of all words in the class.
- $ V $: Number of unique words in the vocabulary.

**Additional Considerations**:

When calculating the likelihood of one word following another, we count the instances where two specific words appear consecutively, divided by the frequency of the first word. If the pair never appears together in the training data, the probability becomes zero. Laplacian smoothing prevents these zero probabilities.

<img src="../../img/smoothing1.PNG" alt="probability" width="70%" height="70%">


For instance, let's apply this to our data. Suppose we have 8 unique words in the vocabulary. For the word 'I' in the positive class, we calculate:
$
P(\text{'I'} | \text{Positive}) = \frac{3 + 1}{13 + 8} = 0.19
$
And in the negative class:
$
P(\text{'I'} | \text{Negative}) = \frac{3 + 1}{12 + 8} = 0.2
$
The process is repeated for other words. Though these probabilities are rounded, they sum to one, ensuring no word has a zero probability.

<img src="../../img/smoothing.PNG" alt="probability" width="70%" height="70%">

This is the essence of Laplacian smoothing - it's crucial for ensuring that probabilities in our calculations don't fall to zero, especially in cases of word pairs that never appear together in the training set.

## Log Likelihood

To calculate the log likelihood, it's necessary to determine the ratios and utilise them in calculating a score. This score is used to judge whether a tweet has a positive or negative sentiment. A higher ratio indicates a more positive connotation of the word:

<img src="../../img/log_likelihood_1.PNG" alt="loglikelihood1" width="70%" height="70%">

To do inference, you can compute the following: 

<img src="../../img/log_likelihood_2.PNG" alt="loglikelihood2" width="30%" height="30%">

As m gets larger, we can get numerical flow issues, so we introduce the log, which gives you the following equation: 

<img src="../../img/log_likelihood_3.PNG" alt="loglikelihood3" width="40%" height="40%">

The first component is called the log prior and the second component is the log likelihood. We further introduce λ as follows: 

<img src="../../img/log_likelihood_4.PNG" alt="loglikelihood4" width="70%" height="70%">

Once you computed the λ dictionary, it becomes straightforward to do inference: 

<img src="../../img/log_likelihood_5.PNG" alt="loglikelihood5" width="70%" height="70%">

As you can see above, since 3.3 > 0 , we will classify the document to be positive. If we got a negative number we would have classified it to the negative class.


## Training naïve Bayes

To train your naïve Bayes classifier, you have to perform the following steps:

1) Get or annotate a dataset with positive and negative tweets
2) Preprocess the tweets: process_tweet(tweet) ➞ [w1, w2, w3, ...]:
    - Lowercase
    - Remove punctuation, urls, names
    - Remove stop words
    - Stemming
    - Tokenize sentences

<img src="../../img/training_naive1.PNG" alt="training_naive_1" width="70%" height="70%">

3) Compute freq(w, class):


<img src="../../img/training_naive2.PNG" alt="training_naive_2" width="70%" height="70%">

4) Get $P(w \mid \text{pos}), P(w \mid \text{neg})$  
You can use the table above to compute the probabilities.


5) Get $\lambda(w)$  
$$\lambda(w) = \log \frac{P(w \mid \text{neg})}{P(w \mid \text{pos})}$$

<img src="../../img/training_naive3.PNG" alt="training_naive_3" width="70%" height="70%">

6) Compute $\text{logprior}$  
$$\text{logprior} = \log \left( \frac{P(\text{pos})}{P(\text{neg})} \right)$$  
$$\text{logprior} = \log \frac{D_{\text{neg}}}{D_{\text{pos}}}$$  
, where $D_{\text{pos}}$ and $D_{\text{neg}}$ correspond to the number of positive and negative documents respectively.

<img src="../../img/training_naive4.PNG" alt="training_naive_4" width="70%" height="70%">


## Testing naïve Bayes

<img src="../../img/test_naive.PNG" alt="test naive" width="50%" height="50%">

The example above shows how you can make a prediction given your $\lambda$ dictionary. In this example the $\text{logprior}$ is 0 because we have the same amount of positive and negative 


## Applications of Naive Bayes

There are many applications of naive Bayes including: 

- Author identification

- Spam filtering 

- Information retrieval 

- Word disambiguation 

This method is usually used as a simple baseline. It is also really fast.

## Naïve Bayes Assumptions
Naïve Bayes makes the independence assumption and is affected by the word frequencies in the corpus. For example, if you had the following

<img src="../../img/naive_bayes.PNG" alt="test naive" width="40%" height="40%">

In the first image, you can see the word sunny and hot tend to depend on each other and are correlated to a certain extent with the word "desert". Naive Bayes assumes independence throughout. Furthermore, if you were to fill in the sentence on the right, this naive model will assign equal weight to the words "spring, summer, fall, winter". 

## Relative frequencies in corpus

<img src="../../img/naive_bayes1.PNG" alt="test naive" width="20%" height="20%">

On Twitter, there are usually more positive tweets than negative ones. However, some "clean" datasets you may find are artificially balanced to have to the same amount of positive and negative tweets. Just keep in mind, that in the real world, the data could be much noisier.

## Error Analysis
There are several mistakes that could cause you to misclassify an example or a tweet. For example, 
- Removing punctuation
- Removing words

<img src="../../img/error_analysis.PNG" alt="test naive" width="70%" height="70%">

- Adversarial attacks
These include sarcasm, irony, euphemisms.