# Lecture 16 – Naive Bayes

## DSC 40A, Fall 2021

In [None]:
import pandas as pd
import numpy as np

In [None]:
comics = pd.read_csv('data/comics_compact.csv')
comics

We have a large dataset with four columns: `'HAIR'`, `'SEX'`, `'COMPANY'`, and `'ALIGN'`. Let's use Naïve Bayes to predict the `'ALIGN'` value for a character given values in the other columns, using smoothing.

### What is the predicted alignment (bad, good, neutral) of a female Marvel character with blond hair?

There are several probabilities we need to compute.

In [None]:
n_bad = comics[comics.get('ALIGN') == 'Bad'].shape[0] + 1
n_good = comics[comics.get('ALIGN') == 'Good'].shape[0] + 1
n_neutral = comics[comics.get('ALIGN') == 'Neutral'].shape[0] + 1

p_bad = n_bad / (comics.shape[0] + 1)
p_good = n_good / (comics.shape[0] + 1)
p_neutral = n_neutral / (comics.shape[0] + 1)

# P(bad|features)
p_female_bad = (comics[(comics.get('SEX') == 'Female') & (comics.get('ALIGN') == 'Bad')].shape[0] + 1) / n_bad
p_marvel_bad = (comics[(comics.get('COMPANY') == 'Marvel') & (comics.get('ALIGN') == 'Bad')].shape[0] + 1) / n_bad
p_blond_bad = (comics[(comics.get('HAIR') == 'Blond Hair') & (comics.get('ALIGN') == 'Bad')].shape[0] + 1) / n_bad

p_bad_features_numerator = p_bad * p_female_bad * p_marvel_bad * p_blond_bad

# P(bad|features)
p_female_good = (comics[(comics.get('SEX') == 'Female') & (comics.get('ALIGN') == 'Good')].shape[0] + 1) / n_good
p_marvel_good = (comics[(comics.get('COMPANY') == 'Marvel') & (comics.get('ALIGN') == 'Good')].shape[0] + 1) / n_good
p_blond_good = (comics[(comics.get('HAIR') == 'Blond Hair') & (comics.get('ALIGN') == 'Good')].shape[0] + 1) / n_good

p_good_features_numerator = p_good * p_female_good * p_marvel_good * p_blond_good

# P(neutral|features)
p_female_neutral = (comics[(comics.get('SEX') == 'Female') & (comics.get('ALIGN') == 'Neutral')].shape[0] + 1) / n_neutral
p_marvel_neutral = (comics[(comics.get('COMPANY') == 'Marvel') & (comics.get('ALIGN') == 'Neutral')].shape[0] + 1) / n_neutral
p_blond_neutral = (comics[(comics.get('HAIR') == 'Blond Hair') & (comics.get('ALIGN') == 'Neutral')].shape[0] + 1) / n_neutral

p_neutral_features_numerator = p_neutral * p_female_neutral * p_marvel_neutral * p_blond_neutral

In [None]:
numerators = p_bad_features_numerator, p_good_features_numerator, p_neutral_features_numerator
numerators

Since `p_good_features_numerator` is the largest, we'd predict that a female Marvel character with blond hair is good.

### What is the predicted alignment for any character, given their hair color, sex, and company?

What if we want to generalize this so that it works for any combination of `'HAIR'`, `'SEX'`, and `'COMPANY'`? Sounds like a job for a function.

In [None]:
def predict_alignment(hair, sex, company, return_numerators=False):
    numerators = np.array([])
    unique_align = ['Bad', 'Good', 'Neutral']
    for align in unique_align:
        n_align = comics[comics.get('ALIGN') == align].shape[0] + 1
        p_align = n_align / (comics.shape[0] + 1)
        
        p_sex_align = (comics[(comics.get('SEX') == sex) & (comics.get('ALIGN') == align)].shape[0] + 1) / n_align
        p_company_align = (comics[(comics.get('COMPANY') == company) & (comics.get('ALIGN') == align)].shape[0] + 1) / n_align
        p_hair_align = (comics[(comics.get('HAIR') == hair) & (comics.get('ALIGN') == align)].shape[0] + 1) / n_align

        p_align_features_numerator = p_align * p_sex_align * p_company_align * p_hair_align
        
        numerators = np.append(numerators, p_align_features_numerator)
        
    align_numerators = pd.DataFrame().assign(classes=unique_align,
                                             numerators=numerators)
    
    if return_numerators:
        return numerators
    return align_numerators.sort_values('numerators', ascending=False).get('classes').iloc[0]

In [None]:
predict_alignment('Blond Hair', 'Female', 'Marvel')

In [None]:
predict_alignment('Black Hair', 'Male', 'DC')

Cool! We've built our very own classifier.

### Extra: Naive Bayes in `sklearn`

**Note:** the rest of this notebook is entirely optional, you are not expected to know or remember any of this.

Just to show you what is out there, we'll try out the implementation of Naive Bayes in `sklearn`, an industry-standard Python package for machine learning that you'll get more experience with in DSC 80.

In [None]:
from sklearn.naive_bayes import CategoricalNB

`CategoricalNB` is the name of the specific Naive Bayes classifier in `sklearn`. (It has "Categorical" in the name since all of the features themselves are categorical; we will not look at Naive Bayes for non-categorical features, but it exists.)

A slight annoyance with `sklearn`'s Naive Bayes is that it requires all columns to be stored as integers, even though they're all categorical.

In [None]:
comics_reformatted = comics.copy()
comics_reformatted['HAIR'] = (comics_reformatted['HAIR'] == 'Blond Hair').astype(int)
comics_reformatted['ALIGN'] = comics_reformatted['ALIGN'].replace({'Bad': 0, 'Good': 1, 'Neutral': 2})
comics_reformatted['SEX'] = comics_reformatted['SEX'].replace({'Male': 0, 'Female': 1})
comics_reformatted['COMPANY'] = comics_reformatted['COMPANY'].replace({'DC': 0, 'Marvel': 1})

In [None]:
comics_reformatted

Here's the mapping between integers and categories:

- `'ALIGN'`: 0 is bad, 1 is good, 2 is neutral.
- `'HAIR'`: 0 is not blond, 1 is blond.
- `'SEX'`: 0 is male, 1 is female.
- `'COMPANY'`: 0 is DC, 1 is Marvel.

Below, we'll create a `CategoricalNB` object and fit our training data to it. (By setting `alpha=1`, we're telling `sklearn` to use smoothing the way that we've defined it in class.)

In [None]:
model = CategoricalNB(alpha=1)
model.fit(X=comics_reformatted[['HAIR', 'SEX', 'COMPANY']], y=comics_reformatted['ALIGN'])

Now we can use `model.predict` to make predictions about other characters. Each time we'll use it, we'll compare its results to `predict_alignment` to confirm that they're the same (spoiler alert: they always will be).

**Black hair (0), male (0), DC (0)**

In [None]:
model.predict([[0, 0, 0]])

In [None]:
predict_alignment('Black Hair', 'Male', 'DC')

**Blond hair (1), female (1), DC(0)**

In [None]:
model.predict([[1, 1, 0]])

In [None]:
predict_alignment('Blond Hair', 'Female', 'DC')

**Blond hair (1), male (0), Marvel (1)**

In [None]:
model.predict([[1, 0, 1]])

In [None]:
predict_alignment('Blond Hair', 'Male', 'Marvel')

### Extra extra: `predict_proba`

It seems like `model.predict` and `predict_alignment` give the same results each time. It turns out that we can "peek under the hood" for `model.predict` and see the probabilities it's calculating in order to make a decision.

Let's consider the most recent example, where we predicted the alginment of a male Marvel character with blond hair.

The following is saying that `sklearn` thinks there's a 0.429 chance that the character is bad, a 0.409 chance that the character is good, and a 0.162 chance that the character is neutral.


In [None]:
model.predict_proba([[1, 0, 1]])

We can extract those same values from `predict_alignment` by passing in the argument `return_numerators=True`.

In [None]:
numerators = predict_alignment('Blond Hair', 'Male', 'Marvel', return_numerators=True)
numerators

These are only the numerators of $P(\text{bad|features})$, $P(\text{good|features})$, and $P(\text{neutral|features})$, respectively. Note that they don't add up to 1.

To instead compute $P(\text{bad|features})$ in its entirety, we'd need to know $P(\text{features})$, which we haven't really discussed. To estimate this denominator, we'll use the fact that $P(\text{bad|features})$, $P(\text{good|features})$, and $P(\text{neutral|features})$ should add up to 1, since given some features, a character must be either bad, good, or neutral.

In [None]:
numerators / np.sum(numerators)

The above array represents the estimated values of $P(\text{bad|features})$, $P(\text{good|features})$, and $P(\text{neutral|features})$. Note that they're almost identical to those resulting from `model.predict_proba([[1, 0, 1]])`!