# Naive Bayes Classification

Suppose we are using an API to gather articles from a news website and grabbing phrases from two different types of articles: on **sports** and on **politics**.

Is there a way we can use machine learning to help us label the articles quickly?

#### Example Data

In [None]:
sports = ['the match was close',
          'the coaches agreed on strategy',
          'played in a sold out stadium']

politics = ['world leaders met last week',
            'the election was close',
            'the officials agreed on a compromise']

test_statement = 'world leaders agreed to fund the stadium'

### Bringing Back Bayes

> "Naive Bayes classifiers are linear classifiers that are known for being **simple yet very efficient**. The probabilistic model of naive Bayes classifiers is based on Bayes’ theorem, and the adjective naive comes from the assumption that the features in a dataset are **mutually independent**. In practice, the independence assumption is often violated, but naive Bayes classifiers **still tend to perform very well** under this unrealistic assumption. Especially for small sample sizes, naive Bayes classifiers can outperform the more powerful alternatives."

[Source: Sebasitian Raschka: Naive Bayes and Text Classification](https://sebastianraschka.com/Articles/2014_naive_bayes_1.html) (emphasis is mine)

#### Revisiting the theorem itself:

![bayes theorem](images/bayes_theorem.svg)

#### AKA

![breaking down the function behind naive bayes](images/naive_bayes_icon.png)

####  Another way of looking at it

![further breakdown of the pieces of naive bayes](images/another_one.png)

### So, in the context of our problem:


## $ P(politics | document) = \frac{P(document|politics)P(politics)}{P(document)}$


## We need to calculate each piece in turn...

### How should we calculate $ P(politics) $ ?

This is essentially the distribution of the probability of either type of article. We have three of each type of article in our current set of example documents, therefore, we assume that there is an equal probability of either type.


## $ P(politics) = \frac{\# politics\ documents}{\# all\ documents} $

In [None]:
# we can check this, though
# going back to our intro example data...
p_politics = len(politics)/(len(politics) + len(sports))
p_politics

In [None]:
p_sports = len(sports)/(len(politics) + len(sports))
p_sports

### How should we calculate $ P(document | politics) $ ?

We need to break the phrases down into individual words - with the hope that these words actually tell us more, since we likely have never seen this exact document before

### $ P(phrase | politics) = \prod_{i=1}^{d} P(word_{i} | politics) $

### $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ docs} {\#\ of\ total\ words\ in\ politics\ docs} $

#### Can you foresee any issues with this?

- 


#### Enter: Laplace Smoothing

### $ P(word_{i} | politics) = \frac{\#\ of\ word_{i}\ in\ politics\ docs + \alpha} {\#\ of\ total\ words\ in\ politics\ docs + \alpha d} $

This correction process is called Laplace Smoothing:

- d : number of features (in this instance total number of vocabulary words)
- $\alpha$ can be any number greater than 0 (it is usually 1)

### How should we calculate $ P(document) $ ?

- well... we don't have to, because the P(document) doesn't change whether we're looking at sports or politics, we're just going to compare the numerator of these to see which is bigger between sports or politics!


### So why is this 'naive' ?

> "Naive Bayes (NB) is ‘naive’ because it makes the assumption that features of a measurement are independent of each other. This is naive because it is (almost) never true."

[Source - 'What's So Naive About Naive Bayes?', a Towards Data Science blog post all about this](https://towardsdatascience.com/whats-so-naive-about-naive-bayes-58166a6a9eba)

### Now let's calculate this...

In [None]:
print(sports)
print(politics)

| word       | frequency in politics | frequency in sports |
| ---------- | --------------------- | ------------------- |
| the        |  2                    | 2                   |
| match      |  0                    | 1                   |
| was        |  1                    | 1                   |
| close      |  1                    | 1                   |
| coaches    |  0                    | 1                   |
| agreed     |  1                    | 1                   |
| on         |  1                    | 1                   |
| strategy   |  0                    | 1                   |
| played     |  0                    | 1                   |
| in         |  0                    | 1                   |
| a          |  1                    | 1                   |
| sold       |  0                    | 1                   |
| out        |  0                    | 1                   |
| stadium    |  0                    | 1                   |
| world      |  1                    | 0                   |
| leaders    |  1                    | 0                   |
| met        |  1                    | 0                   |
| last       |  1                    | 0                   |
| week       |  1                    | 0                   |
| election   |  1                    | 0                   |
| officials  |  1                    | 0                   |
| compromise |  1                    | 0                   |

> Test sentence: 'world leaders agreed to fund the stadium'

| word    | $ P( word | politics) $                | $ P( word | sports) $                  |
| ------- | -------------------------------------- | -------------------------------------- |
| world   | $\frac{1 + 1}{15 + 30} = \frac{2}{45}$ | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ |
| leaders | $\frac{1 + 1}{15 + 30} = \frac{2}{45}$ | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ |
| agreed  | $\frac{1 + 1}{15 + 30} = \frac{2}{45}$ | $\frac{1 + 1}{15 + 30} = \frac{2}{45}$ |
| to      | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ |
| fund    | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ |
| the     | $\frac{2 + 1}{15 + 30} = \frac{3}{45}$ | $\frac{2 + 1}{15 + 30} = \frac{3}{45}$ |
| stadium | $\frac{0 + 1}{15 + 30} = \frac{1}{45}$ | $\frac{1 + 1}{15 + 30} = \frac{2}{45}$ |

I dunno about you... but I'm already exhausted trying to do this from scrach, and that's just a single sentence. Let's move into Python.

In [None]:
# Initial imports

import numpy as np
np.random.seed(123)
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict

Remember, we want

$ P(document|sports) * P(sports) $ **vs.** $ P(document|politics) * P(politics) $

We already calculated $P(sports)$ and $P(politics)$ above - they're both 50% since we have an equal number of documents in each. But the $P(document | label)$ for the two labels is going to require breaking out each word to find the likelihood that each word is in each category (plus Laplacian smoothing)

In [None]:
def vocab_maker(category):
    """
    returns the vocabulary for a given type of article
    """
    vocab_category = set()
    for art in category:
        words = art.split()
        for word in words:
            vocab_category.add(word)
    return vocab_category
        
voc_sports = vocab_maker(sports)
voc_pol = vocab_maker(politics)
total_vocabulary = voc_sports.union(voc_pol)

In [None]:
voc_sports

In [None]:
voc_pol

In [None]:
total_vocabulary

In [None]:
total_vocab_count = len(total_vocabulary) # useful for laplacian smoothing
total_sports_count = len(voc_sports)
total_politics_count = len(voc_pol)

In [None]:
def find_number_words_in_category(phrase, category):
    '''
    returns number of words in the phrase previously found in the category
    
    inputs:
    phrase - string, test phrase to classify
    category - list, all training phrases associated with that category
    
    output:
    word_count - default dictionary, with each word in the phrase as a key 
                 with a value of the number of times the words have 
                 appeared in the category in the train set
    '''
    # gets each word out - statement is a list object now
    statement = phrase.split()
    
    # creating one big string from the provided category list
    str_category=' '.join(category)
    # splitting now so it's a single list of the words found in the category
    cat_word_list = str_category.split()
    # default dict allows us to create new keys easily
    word_count = defaultdict(int) 
    
    for word in statement:
        for cat_word in cat_word_list:
            if word == cat_word:
                word_count[word] +=1
            else:
                word_count[word] # here's the part that works because default dict
    return word_count

In [None]:
test_sports_word_count = find_number_words_in_category(test_statement,sports)
test_sports_word_count

In [None]:
test_politic_word_count = find_number_words_in_category(test_statement,politics)
test_politic_word_count

### $ P(politics | article) = P(politics) x \prod_{i=1}^{d} P(word_{i} | politics) $

In [None]:
def find_likelihood(category_count, test_category_count, alpha):
    
    num = np.product(np.array(list(test_category_count.values())) + alpha)
    denom = (category_count + total_vocab_count*alpha)**(len(test_category_count))
    
    return num/denom

In [None]:
likelihood_sports = find_likelihood(total_sports_count,test_sports_word_count,1)

In [None]:
likelihood_politics = find_likelihood(total_politics_count,test_politic_word_count,1)

In [None]:
# yeah... the probabilities out don't mean anything, just worry about which is bigger
print(likelihood_sports)
print(likelihood_politics)

#### Determing the winner of our model:

![](images/solvingforyhat.png)

In [None]:
# p(politics|article) > p(music|article)
(likelihood_politics * p_politics) > (likelihood_sports * p_sports)

### Pros:

* It is an efficient way to predict class of test data set. It perform well in multi class prediction
* When assumption of independence holds, a Naive Bayes classifier performs requires less training data and can perform better than models like logistic regression.
* Performs better with categorical inputs. For numerical input, one has to assume a normal distribution.

### Cons:

* Naive Bayes is also known as a bad estimator, so the probability outputs from predict_proba are not to be taken too seriously
* We are assuming of independent predictors, but in real life, it is almost impossible that we get a set of predictors which are completely independent (amazingly, still works a lot of the time though!)

... but let's be real, we don't need to use hand-written functions for this

### Using Naive Bayes in sklearn

In [None]:
# more imports
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, f1_score, plot_confusion_matrix

from sklearn.naive_bayes import GaussianNB

In [None]:
# fetching our data
news_train = fetch_20newsgroups(subset='train', 
                                categories = ['rec.sport.baseball', 
                                              'talk.politics.misc'])
news_test = fetch_20newsgroups(subset='test', 
                               categories = ['rec.sport.baseball', 
                                              'talk.politics.misc'])

In [None]:
# collecting data in dataframe
df_train = pd.DataFrame()
df_train['Data'] = news_train.data
df_train['Target'] = news_train.target

df_test = pd.DataFrame()
df_test['Data'] = news_test.data
df_test['Target'] = news_test.target

In [None]:
# grabbing our target classes so we know which is which
target_classes = dict(enumerate(news_test.target_names))
target_classes

In [None]:
df_train.info()
df_train.head()

In [None]:
df_test.info()
df_test.head()

In [None]:
print(f'Train Target Ratio: {df_train["Target"].mean():.4f}')
print(f'Train Target Ratio: {df_test["Target"].mean():.4f}')
# roughly equivalent breakdowns between classes in train and test set

#### Need to turn our text data into numbers...

In [None]:
# Using a Count Vectorizer
# Goes through each doc and counts how many of each word
vectorizer = CountVectorizer()
# Fitting and transforming our train data
X_train = vectorizer.fit_transform(df_train['Data']).toarray() # to array is just for the model later
# Just transforming our test data
X_test = vectorizer.transform(df_test['Data']).toarray()

In [None]:
# What does this look like?
X_train_vectorized = pd.DataFrame(X_train, columns=vectorizer.get_feature_names())
X_train_vectorized.head()

#### Let's explore a single example of our new vectorized X

In [None]:
# Before
df_train.iloc[[0]]

In [None]:
# Full text before
df_train['Data'][0]

In [None]:
# After
X_train_vectorized.iloc[0].sort_values(ascending=False).head(20)

#### Now time to model!

In [None]:
# Setting our y values
y_train = df_train['Target']
y_test = df_test['Target']

In [None]:
# Instantiating our model - just using default values
model = GaussianNB()
# Fitting our model
model.fit(X_train,y_train)
# Making predictions on our test set
y_preds = model.predict(X_test)

# How'd we do?
print(f'Naive Bayes Test Accuracy: {accuracy_score(y_test, y_preds):.4f}')
print(f'Naive Bayes Test F1-Score: {f1_score(y_test, y_preds):.4f}')

In [None]:
plot_confusion_matrix(model, X_test, y_test, display_labels = target_classes.values())
plt.show()

In [None]:
# for comparison...

from sklearn.linear_model import LogisticRegression

# Instantiating our model - just using default values
logreg = LogisticRegression(random_state=123)
# Fitting our model
logreg.fit(X_train,y_train)
# Making predictions on our test set
y_preds_lr = logreg.predict(X_test)

# How'd we do?
print(f'Logistic Regression Test Accuracy: {accuracy_score(y_test, y_preds_lr):.4f}')
print(f'Logistic Regression Test F1-Score: {f1_score(y_test, y_preds_lr):.4f}')

plot_confusion_matrix(logreg, X_test, y_test, display_labels = target_classes.values())
plt.show()

In [None]:
# another comparison...

from sklearn.ensemble import RandomForestClassifier

# Instantiating our model - just using default values
rf = RandomForestClassifier(random_state=123)
# Fitting our model
rf.fit(X_train,y_train)
# Making predictions on our test set
y_preds_rf = rf.predict(X_test)

# How'd we do?
print(f'Untuned Random Forest Test Accuracy: {accuracy_score(y_test, y_preds_rf):.4f}')
print(f'Untuned Random Forest Test F1-Score: {f1_score(y_test, y_preds_rf):.4f}')

plot_confusion_matrix(rf, X_test, y_test, display_labels = target_classes.values())
plt.show()