<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Bayes Classifiers (Naive Bayes)
              
</p>
</div>

Data Science Cohort Live NYC Feb 2022
<p>Phase 3: Topic 26</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.naive_bayes import MultinomialNB, GaussianNB

import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

#### Motivating the Bayesian Classifier 

So far:
    
- Classified by finding hyperplane:
    - maximizing probability on the different classes:
    - minimize log loss of sigmoid
-

> Let's take a second to go through an example to get a feel for how Bayes' Theorem can help us with classification. Specifically about document classification

Spam, Spam, Spam, Spam, Spam...

![Many cans of spam](images/wall_of_spam.jpeg)

> This is the classic example: detecting email spam!

**The Problem Setup**

> We get emails that can be either emails we care about (***ham*** 游냥) or emails we don't care about (***spam*** 游볾). 
>
> We can probably look at the words in the email and get an idea of whether they are spam or not just by observing if they contain red-flag words 游뛀
> 
> We won't always be right, but if we see an email that uses word(s) that are more often associated with spam, then we can feel more confident as labeling that email as spam!

## Naive Bayes Setup

What we gotta do:

1. Look at spam and not spam (ham) emails
2. Identify words that suggest classification
3. Determine probability that words occur in each classification
4. Profit (classify new emails as "spam" or "ham")

### What's So Great About This?

- We can keep updating our belief based on the emails we detect
- Relatively simple
- Can expand to multiple words

## The Naive Assumption

$P(A,B) = P(A\cap B) = P(A)\ P(B)$ only if independent 

In practice, makes sense & is usually pretty good assumption

## The Formula

Let's say the word that occurs is "cash":

$$ P(游볾 | "cash") = \frac{P("cash" | 游볾)P(游볾)}{P("cash")}$$

### What Parts Can We Find?

- $P("cash")$
    * That's just the probability of finding the word "cash"! Frequency of the word!
- $P(游볾)$
    * Well, we start with some data (*prior knowledge*): The frequency of the spam occurring!
- $P("cash" | 游볾)$
    * How frequently "cash" is used in known spam emails. Count the frequency across all spam emails
    

## Calculating the Probability That Our Email Is Spam

In [None]:
# Let's just say 2% of all emails have the word "cash" in them
p_cash = 0.02

# We normally would measure this from our data, but we'll take 
# it that 10% of all emails we collected were spam
p_spam = 0.10

# 12% of all spam emails have the word "cash"
p_cash_given_its_spam = 0.12

In [None]:
p_spam_given_cash = p_cash_given_its_spam * p_spam / p_cash
print(f'If the email has the word "cash" in it, there is a \
{p_spam_given_cash*100}% chance the email is spam')

> **Check it**: Does this make sense? <br>
> Suppose I had 250 total emails.
> - How many should I expect to have the word 'cash' in them? (Ans. 5)
> - How many should I expect to be spam? (Ans. 25)
> - How many *of the spam emails* should I expect to have the word 'cash' in them? (Ans. 3)

## Extending It With Multiple Words

> With more words, the more certain we can be if it is/isn't spam

Spam:

$$ P(游볾\ |"buy",\ "cash") \propto P("buy",\ "cash"|\ 游볾)\ P(游볾)$$


But because of independence: 
    
$$ P("buy",\ "cash"|\ 游볾) = P("buy"|\ 游볾)\ P("cash"|\ 游볾)$$

Normalize by dividing!

$$
P(游볾\ |"buy",\ "cash")  =
    \frac
        {P("buy"|\ 游볾)P("cash"|\ 游볾)\ P(游볾)}
        {P("buy"|\ 游볾)P("cash"|\ 游볾)\ P(游볾) + P("buy"|\ 游냥)P("cash"|\ 游냥)\ P(游냥)}
$$



> **Note:** If we wanted to find the most probable class (especially useful for *multiclass*), we find the maximum numerator for the given criteria

# Naive Bayes Modeling Example

## Using Bayes's Theorem for Classification

Let's recall Bayes's Theorem:

$\large P(h|e) = \frac{P(h)P(e|h)}{P(e)}$

### Does this look like a classification problem?

- Suppose we have three competing hypotheses $\{h_1, h_2, h_3\}$ that would explain our evidence $e$.
    - Then we could use Bayes's Theorem to calculate the posterior probabilities for each of these three:
        - $P(h_1|e) = \frac{P(h_1)P(e|h_1)}{P(e)}$
        - $P(h_2|e) = \frac{P(h_2)P(e|h_2)}{P(e)}$
        - $P(h_3|e) = \frac{P(h_3)P(e|h_3)}{P(e)}$
        
- Suppose the evidence is a collection of elephant heights.
- Suppose each of the three hypotheses claims that the elephant whose measurements we have belongs to one of the three extant elephant species (*L. africana*, *L. cyclotis*, and *E. maximus*).

In that case the left-hand sides of these equations represent the probability that the elephant in question belongs to a given species.

If we think of the species as our target, then **this is just an ordinary classification problem**.

What about the right-hand sides of the equations? **These other probabilities we can calculate from our dataset.**

- The priors can simply be taken to be the percentages of the different classes in the dataset.
- What about the likelihoods?
    - If the relevant features are **categorical**, we can simply count the numbers of each category in the dataset. For example, if the features are whether the elephant has tusks or not, then, to calculate the likelihoods, we'll just count the tusked and non-tuksed elephants per species.
    - If the relevant features are **numerical**, we'll have to do something else. A good way of proceeding is to rely on (presumed) underlying distributions of the data. [Here](https://medium.com/analytics-vidhya/use-naive-bayes-algorithm-for-categorical-and-numerical-data-classification-935d90ab273f) is an example of using the normal distribution to calculate likelihoods. We'll follow this idea below for our elephant data.

## Elephant Example

Suppose we have a dataset that looks like this:

In [None]:
elephs = pd.read_csv('data/elephants.csv', usecols=['height (cm)',
                                                   'species'])

In [None]:
elephs.head()

In [None]:
elephs.shape

In [None]:
plt.style.use('fivethirtyeight')

fig, ax = plt.subplots()

sns.kdeplot(data=elephs[elephs['species'] == 'maximus']['height (cm)'],
            ax=ax, label='maximus')
sns.kdeplot(data=elephs[elephs['species'] == 'africana']['height (cm)'],
            ax=ax, label='africana')
sns.kdeplot(data=elephs[elephs['species'] == 'cyclotis']['height (cm)'],
            ax=ax, label='cyclotis')

plt.legend();

### Naive Bayes by Hand

Suppose we want to make prediction of species for some new elephant whose height we've just recorded. We'll suppose the new elephant has:

In [None]:
new_ht = 263

What we want to calculate is the mean and standard deviation for height for each elephant species. We'll use these to calculate the relevant likelihoods.

So:

In [None]:
max_stats = elephs[elephs['species'] == 'maximus'].describe().loc[['mean', 'std'], :]
max_stats

In [None]:
cyc_stats = elephs[elephs['species'] == 'cyclotis'].describe().loc[['mean', 'std'], :]
cyc_stats

In [None]:
afr_stats = elephs[elephs['species'] == 'africana'].describe().loc[['mean', 'std'], :]
afr_stats

In [None]:
elephs['species'].value_counts()

### Calculation of Likelihoods

We'll use the PDFs of the normal distributions with the discovered means and standard deviations to calculate likelihoods:

In [None]:
stats.norm(loc=max_stats['height (cm)'][0],
           scale=max_stats['height (cm)'][1]).pdf(263)

In [None]:
stats.norm(loc=cyc_stats['height (cm)'][0],
          scale=cyc_stats['height (cm)'][1]).pdf(263)

In [None]:
stats.norm(loc=afr_stats['height (cm)'][0],
          scale=afr_stats['height (cm)'][1]).pdf(263)

### Posteriors

What we have just calculated are (approximations of) the likelihoods, i.e.:

- $P(height=263 | species=maximus) = 2.04\%$
- $P(height=263 | species=cyclotis) = 1.50\%$
- $P(height=263 | species=africana) = 0.90\%$

(Notice that they do NOT sum to 1!) But what we'd really like to know are the posteriors. I.e. what are:

- $P(species=maximus | height=263)$?
- $P(species=cyclotis | height=263)$?
- $P(species=africana | height=263)$?

Since we have equal numbers of each species, every prior is equal to $\frac{1}{3}$. Thus we can calculate the probability of the evidence:

$P(height=263) = \frac{1}{3}(0.0204 + 0.0150 + 0.0090) = 0.0148$

And therefore calculate the posteriors using Bayes's Theorem:

- $P(species=maximus | height=263) = \frac{1}{3}\frac{0.0204}{0.0148} = 45.9\%$;
- $P(species=cyclotis | height=263) = \frac{1}{3}\frac{0.0150}{0.0148} = 33.8\%$;
- $P(species=africana | height=263) = \frac{1}{3}\frac{0.0090}{0.0148} = 20.3\%$.

Bayes's Theorem shows us that the largest posterior belongs to the *maximus* species. (Note also that, since the priors are all the same, the largest posterior will necessarily belong to the species with the largest likelihood!)

Therefore, the *maximus* species will be our prediction for an elephant of this height.

### More Dimensions

In fact, we also have elephant *weight* data available in addition to their heights. To accommodate multiple features we can make use of **multivariate normal** distributions.

![multivariate-normal](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/MultivariateNormal.png/440px-MultivariateNormal.png)

#### What's "Naive" about This?

For multiple predictors, we make the simplifying assumption that **our predictors are probablistically independent**. This will often be unrealistic, but it simplifies our calculations a great deal.

In [None]:
elephants = pd.read_csv('data/elephants.csv',
                       usecols=['height (cm)', 'weight (lbs)', 'species'])

In [None]:
elephants.head()

In [None]:
maximus = elephants[elephants['species'] == 'maximus']
cyclotis = elephants[elephants['species'] == 'cyclotis']
africana = elephants[elephants['species'] == 'africana']

Suppose our new elephant with a height of 263 cm also has a weight of 7009 lbs.

In [None]:
maximus.mean()

In [None]:
likeli_max = stats.multivariate_normal(mean=maximus.mean(),
                          cov=maximus.cov()).pdf([263, 7009])
likeli_max

In [None]:
likeli_cyc = stats.multivariate_normal(mean=cyclotis.mean(),
                         cov=cyclotis.cov()).pdf([263, 7009])
likeli_cyc

In [None]:
likeli_afr = stats.multivariate_normal(mean=africana.mean(),
                         cov=africana.cov()).pdf([263, 7009])
likeli_afr

#### Posteriors

In [None]:
post_max = likeli_max / sum([likeli_max, likeli_cyc, likeli_afr])
post_cyc = likeli_cyc / sum([likeli_max, likeli_cyc, likeli_afr])
post_afr = likeli_afr / sum([likeli_max, likeli_cyc, likeli_afr])

print(post_max)
print(post_cyc)
print(post_afr)

### [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

In [None]:
gnb = GaussianNB(priors=[1/3, 1/3, 1/3])

In [None]:
X = elephants.drop('species', axis=1)
y = elephants['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
gnb.fit(X_train, y_train)

In [None]:
gnb.predict_proba(np.array([263, 7009]).reshape(1, -1))

In [None]:
gnb.score(X_test, y_test)

In [None]:
plot_confusion_matrix(gnb, X_test, y_test);

## Comma Survey Example

In [None]:
commas = pd.read_csv('data/comma-survey.csv')

In [None]:
commas.head()

In [None]:
commas.isna().sum().sum()

We'll go ahead and drop the NaNs:

In [None]:
commas = commas.dropna()

In [None]:
commas.shape

The first question on the survey was about the Oxford comma.

In [None]:
commas['In your opinion, which sentence is more gramatically correct?'].value_counts()

Personally, I like the Oxford comma, since it can help eliminate ambiguities, such as:

"This book is dedicated to my parents, Ayn Rand, and God" <br/> vs. <br/>
"This book is dedicated to my parents, Ayn Rand and God"

Let's see how a Naive Bayes model would make a prediction here. We'll think of the comma preference as our target.

In [None]:
commas['Age'].value_counts()

Suppose we want to make a prediction about Oxford comma usage for a new person who falls into the **45-60 age group**.

### Calculating Priors and Likelihoods

The following code makes a table of values that count up the number of survey respondents who fall into each of eight bins (the four age groups and the two answers to the first comma question). 

In [None]:
table = np.zeros((2, 4))

for idx, value in enumerate(commas['Age'].value_counts().index):
    table[0, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind, and loyal.') & (commas['Age'] == value)])
    table[1, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind and loyal.') & (commas['Age'] == value)])

In [None]:
table

In [None]:
df = pd.DataFrame(table, columns=['Age45-60',
                            'Age>60',
                            'Age30-44',
                            'Age18-29'])
df

In [None]:
df['Oxford'] = [True, False]
df = df[['Age>60', 'Age45-60', 'Age30-44', 'Age18-29', 'Oxford']]
df

Since all we have is a single categorical feature here we can just read our likelihoods and priors right off of this table:

Likelihoods:

- Age45-60:
    - P(Age45-60 | Oxford=True) = $\frac{123}{470} = 0.2617$;
    - P(Age45-60 | Oxford=False) = $\frac{125}{355} = 0.3521$.

Priors:

- P(Oxford=True) = $\frac{470}{825} = 0.5697$;
- P(Oxford=False) = $\frac{355}{825} = 0.4303$.

### Calculating Posteriors

First we'll calculate the probability of the evidence:

$$\begin{align} 
    P(Age45-60) &= P(Age45-60 | Oxford=True) \cdot P(Oxford=True) \\
                & \hspace{1cm} + P(Age45-60 | Oxford=False) \cdot P(Oxford=False)\\ 
                &= 0.2617 \cdot 0.5697 + 0.3521 \cdot 0.4303 \\
                &= 0.3006
\end{align}$$

In [None]:
# This calculation should also yield P(e):
# It's the proportion of 45-60-yr.-olds
# in the data.

(123+125) / 825

Now use Bayes's Theorem to calculate the posteriors:

$$\begin{align}
P(Oxford=True | Age45-60) &= P(Oxford=True) \cdot P(Age45-60 | Oxford=True) / P(Age45-60) \\
                          &= 0.5697 \cdot 0.2617 / 0.3006 \\
                          &= 0.4960 \\
                          \\
P(Oxford=False | Age45-60) &= P(Oxford=False) \cdot P(Age45-60 | Oxford=False) / P(Age45-60) \\ 
                          &= 0.4303 \cdot 0.3521 / 0.3006 \\
                          &= 0.5040
\end{align}$$

Close! But our prediction for someone in the 45-60 age group will be that they **do not** favor the Oxford comma.

### Comparison with [`MultinomialNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [None]:
comma_model = MultinomialNB()

ohe = OneHotEncoder()
ohe.fit(commas['Age'].values.reshape(-1, 1))

X = ohe.transform(commas['Age'].values.reshape(-1, 1)).todense()
y = commas['In your opinion, which sentence is more gramatically correct?']

In [None]:
comma_model.fit(X, y)

In [None]:
comma_model.predict_proba(np.array([0, 0, 1, 0]).reshape(1, -1))