# Naive Bayes

### General concept
The naive bayes classifier uses the Bayes theorem and the concepts of _conditional probablities_:

$$P(A|B) = \frac{P(B|A)P(A )}{P(B)}$$

We want to find the class with the highest posterior probability. This is the class that we predict:

$$
\hat{y} = \underset{k \in {(1, 2)}}{\arg \max} ~ P(Y=k | X_1 = x_1, X_2 = x_2)
$$

Bayes' rule applied to classification

$$
P(Y=k | X_1 = x_1, X_2 = x_2) = \frac{P(X_1 = x_1, X_2 = x_2|Y=k)P(Y=k)}{P(X_1 = x_1, X_2 = x_2)}
$$

The denominator is a constant as it does not depend on the value of `k`. Dropping this term does not change the optimization problem:

$$
\hat{y} = \underset{k \in {(1, 2)}}{\arg \max} ~ P(Y=k | X_1 = x_1, X_2 = x_2) = \underset{k \in {(1, 2)}}{\arg \max} ~ P(X_1 = x_1, X_2 = x_2|Y=k)P(Y=k)
$$

Now we can make **"naive"** assumption that input **features are conditionally independent** from each other

$$
\hat{y} = \underset{k \in {(1, 2)}}{\arg \max} ~ P(X_1 = x_1 | Y = k)P(X_2 = x_2|Y=k)P(Y=k)
$$ 

Because our X is binomial distributed, we can now easily estimate $P(X_1 = x_1 | Y = 1), P(X_1 = x_1 | Y = 0), \dots$ by looking at the data and counting relative frequencies.

To calculate the conditional probability that a given text belongs to an artist: $P (artist | text )$

$$P(artist1|text) = \frac{P(text|artist1)P(artist1)}{P(text)}$$

and compare it to

$$P(artist2|text) = \frac{P(text|artist2)P(artist2 )}{P(text)}$$

for each of the artists (class) and assigns the artist with the highest probability.

$$\text{IF:} P(artist1|text) > P(artist2|text): \text{song artist1}$$

$$\text{IF:} P(artist2|text) > P(artist1|text): \text{song artist2}$$

--- 

### Theoretical minimal example: 
- **text** = submarine  
- **artist 1** = Beatles
- **artist 2** = Eminem

We want to calculate: $P(Beatles|submarine)$: 

$$P(Beatles|submarine) = \frac{P(submarine|Beatles)P(Beatles )}{P(submarine)}$$

- $P(submarine|Beatles)$ := probability of 'submarine' appearing in Beatles' songs
- $P(Beatles)$ := probability of all Beatles' songs in the data set
- $P(submarine)$: probability of 'submarine' appearing in any song in the data set.

Compare to:

$$P(Eminem|submarine) = \frac{P(submarine|Eminem)P(Eminem )}{P(submarine)}$$

#### For more than 1 word, as we have many many words in our text corpus

$$P( words | artist ) = P( word1 | artist ) * P( word2 | artist ) * .... * P(word_N | artist)$$

Example:

$$P(Eminem|(\text{I have submarine})) = P(\text{I have submarine})P(Eminem )$$

$$P(Eminem|(\text{I have submarine})) = P(\text{I}|Eminem)*P(\text{have}|Eminem)*P(\text{submarine}|Eminem)*P(Eminem )$$

And this where the word *naive* comes in the classifier, as it assumes all these probabilities are independent and it can just multiply them.

### Problems and solutions

##### I. What about if we have a word that is not appearing and one of the termn  $P( word1 | artist )$ = 0?  
---> Smoothing term `C`

$P(Eminem|(\text{I have submarine})) = P(\text{I}|Eminem)*P(\text{have}|Eminem)*P(\text{submarine}|Eminem)*P(Eminem )$


    * we assume that every word occurs k times at least
    * so that probability is always > 0
    * it's like assuming that the artist attached a copy of each word in the dictionary to the song
    * IMPORTANT: If we increase the smoothing term, we *dilute* the information from the song, hende
    * **the smoothing term is a regularization hyperparameter!**
##### II. Don't we end with a very very small term if we multiply many very small probabilities? 

--> yes, this is problematic:

$p(w1) \cdot p(w2) ... \cdot p(wn)$

--> therefore: **Calculate log-probabilites instead:**

$log(p(w1)) + log(p(w2)) + log(p(wn)$

---

### Practical example

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline

In [2]:
corpus = [
    "we all love a yellow submarine",             # Beatles
    "yesterday, my submarine was in love",        # Beatles
    "we are love trouble with loyalty here",      # Eminem
    "loyalty to us is worth more than love is"    # Eminem
]

In [3]:
text_labels = ["Beatles"]*2 + ["Eminem"]*2

In [4]:
def preTrainedTextClassifier(text_features, text_labels):
    """Takes text features and labels.
    Returns pretrained model on text data."""
    m = make_pipeline(
        (TfidfVectorizer()),
        (MultinomialNB())
    )
    m.fit(text_features, text_labels)
    return m

In [5]:
def bandPredictor(new_text_data, model):
    """Takes pretrained model and new text data.
    Returns hard predictions on band."""
    new_text_data = [new_text_data]
    band_predictions = model.predict(new_text_data)
    probs = model.predict_proba(new_text_data)
    return band_predictions[0], probs

In [6]:
pre_trained_model = preTrainedTextClassifier(corpus, text_labels)

In [7]:
new_text_data = "We love our loyal submarine"

In [8]:
bandPredictor(new_text_data, pre_trained_model)

('Beatles', array([[0.61988988, 0.38011012]]))