In [3]:
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
import numpy as np
from plotnine import *


from sklearn.naive_bayes import GaussianNB, BernoulliNB, CategoricalNB # Decision Tree
from sklearn.model_selection import train_test_split

from sklearn import metrics 
from sklearn.preprocessing import StandardScaler #Z-score variables

from sklearn.model_selection import train_test_split # simple TT split cv
from sklearn.model_selection import KFold # k-fold cv
from sklearn.model_selection import LeaveOneOut #LOO cv
from sklearn.model_selection import cross_val_score # cross validation metrics
from sklearn.model_selection import cross_val_predict # cross validation metrics
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import plot_confusion_matrix

#set precision to get rid of some scientific notation
%precision %.7g

'%.7g'

## 0. Together

### 0.0 Probability and Conditional Probability

#### *Question*
What is the difference between a conditional probability and a regular probability (for example $P(dog)$ vs $P(dog | kids)$)?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />

Using the table below, how would we calculate $P(dog)$? $P(dog | kids)$?

|           | dog | kid |
|-----------|-----|-----|
| Person 1  | 1   | 1   |
| Person 2  | 1   | 1   |
| Person 3  | 1   | 0   |
| Person 4  | 1   | 1   |
| Person 5  | 1   | 0   |
| Person 6  | 1   | 1   |
| Person 7  | 0   | 0   |
| Person 8  | 0   | 1   |
| Person 9  | 0   | 1   |
| Person 10 | 0   | 1   |

Using the table below, how would we calculate $P(dog | kids, over20)$?

|           | dog | kid | over20 |
|-----------|-----|-----|--------|
| Person 1  | 1   | 1   | 1      |
| Person 2  | 1   | 1   | 0      |
| Person 3  | 1   | 0   | 0      |
| Person 4  | 1   | 1   | 1      |
| Person 5  | 1   | 0   | 1      |
| Person 6  | 1   | 1   | 1      |
| Person 7  | 0   | 0   | 1      |
| Person 8  | 0   | 1   | 0      |
| Person 9  | 0   | 1   | 0      |
| Person 10 | 0   | 1   | 1      |

### 0.1 Naive
Naive Bayes is a classification algorithm which assumes (incorrectly) that within a group/class, the probability of a combination of predictor values (like $P(diabetic, obese, smoker)$) is equal to the product of the individual predictor probabilities. In other words, it assumes that they are *independent* and that knowing someone is a smoker does *not* affect the probability of being diabetic. In mathematical terms, for example:

$$P(D,O,S) = P(D) * P(O) * P(S)$$

In real life we know that this independence is very unlikely (hence: *naive*). But it turns out that this inapproporiate assumption doesn't usually have a huge effect on the accuracy of the model, and it saves a LOT of computational time because we can simply calculate independent probabilities and multiply them, rather than calculating complex conditional probabilities.

### 0.2 Bayes
The Bayes part of the Naive Bayes algorithm refers to the fact that we calculate "scores" that measure how likely a data point is to belong to some class, $C$. These "scores" are proportional to the probability of a data point belonging to class $C$. Once we have a "score" for each possible category, we choose whichever category has the highest score. 

The "score" is based on Bayes' Theory which says:

$$P(category | data) \underbrace{\propto}_\text{is proportional to} \underbrace{P(Data | Category)}_{\text{How common this combination of predictors is for that Category}^1} * \underbrace{P(Category)}_\text{How common that category is in the dataset}$$


$^1$
For example, what is the probability that someone is diabetic, obsese, and a smoker given that they have heart disease.

### 0.3 NB in sklearn
In sklearn there are 3 main functions you can use to perform Naive Bayes:

* `GaussianNB()`: Assumes that features follow a Normal/Gaussian Distribution.
* `BernoulliNB()`: Assumes features are binary (0/1)
* `CategoricalNB()`: Assumes features are discrete categories (can have more than 2 categories)

This means that if your features are continuous you'd use `GaussianNB()`, if they are only binary, use `BernoulliNB()` and if they are only Categorical, use `CategoricalNB()`. In practice, we'll often use either `GaussianNB()` or `CategoricalNB()` (since `CategoricalNB()` can also handle it when we have binary + categorical).

This means that computationally, we cannot have both continuous + categorical predictors in one sklearn NB model. (There are workarounds for this: see [here](https://stackoverflow.com/questions/14254203/mixing-categorial-and-continuous-data-in-naive-bayes-classifier-using-scikit-lea), but for now, we'll be using only one or the other).


## 1. Naive Bayes By Hand

### 1.1 Calculating Probabilities for Each Category

The dataframe `d` below, is a (fake) dataset that we'll use to predict whether someone owns a home or not (the `own` column). For each outcome category (own-`1`, not own-`0`) calculate the probability of having a `1` in each of the predictor categories (having an income > 100k, being over 40, having kids, and having more than one income).


Store these probabilities in a dataframe. The dataframe should look like the table below, but with the actual probabilities instead of 1's.


<img src="https://drive.google.com/uc?export=view&id=1imX0dbPjiEy56kruM8c86A3wA1EqpA8Q" width = 250px/>


In [7]:
d = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/HomeOwnership.csv")
d.head(20)

### YOUR CODE HERE ###

Unnamed: 0,incomeOver100k,ageOver40,kids,morethan1Income,own
0,1,1,1,0,1
1,1,1,1,1,1
2,1,1,0,1,1
3,0,1,0,1,1
4,1,1,1,0,1
5,1,1,1,1,1
6,1,1,1,1,1
7,1,1,1,0,1
8,1,1,1,1,1
9,0,1,1,1,1


In [30]:
doesOwn = d["own"] == 1
doesntOwn = d["own"] == 0

homeowners = d.loc[doesOwn]
notHomeowners = d.loc[doesntOwn]

inOv100 = homeowners["incomeOver100k"].sum()
ageOv40 = homeowners["ageOver40"].sum()
kids = homeowners["kids"].sum()
moreIn = homeowners["morethan1Income"].sum()

own = [inOv100/len(homeowners), ageOv40/len(homeowners), kids/len(homeowners), moreIn/len(homeowners)]

inOv100 = notHomeowners["incomeOver100k"].sum()
ageOv40 = notHomeowners["ageOver40"].sum()
kids = notHomeowners["kids"].sum()
moreIn = notHomeowners["morethan1Income"].sum()

notOwn = [inOv100/len(notHomeowners), ageOv40/len(notHomeowners), kids/len(notHomeowners), moreIn/len(notHomeowners)]

names = ["incomeOver100k", "ageOver40", "kids", "morethan1Income"]

proba = pd.DataFrame({"names" : names,
                      "own" : own,
                      "not" : notOwn})

print(proba["own"][1])

proba.head()

1.0


Unnamed: 0,names,own,not
0,incomeOver100k,0.8,0.8
1,ageOver40,1.0,0.7
2,kids,0.8,0.3
3,morethan1Income,0.7,0.9


### 1.2 Predicting Category
Using the formula we learned in the Naive Bayes lecture, choose which category (own-`1` or not own-`0`) the following two people should be classified as:

| incomeOver100k | ageOver40 | kids | morethan1Income |
|----------------|-----------|------|-----------------|
| 0              | 1         | 1    | 0               |
| 1              | 1         | 0    | 1               |

In [21]:
### YOUR CODE HERE ###

# hint: to predict a single data point, use: data_point = np.array(dp).reshape(1,-1), where dp is a list with
# the predictor values, and then call .predict(data_point) on your model 

def predict(p):
    score1 = (proba["own"][0] * p[0] +)
    score0 = 

### 1.3 Build a NB in sklearn

#### *Question*
Now, using d, build a naive bayes model using `d` (no need for model validation here). Then use the `.predict()` function to predict the category for the two people from 1.2. Does the models predicted category match the one you did by hand?

<img src="https://drive.google.com/uc?export=view&id=1ghyQPx1N8dmU3MV4TrANvqNhGwnLni72" width = 200px />



In [22]:
### YOUR CODE HERE ###

# Use BernoulliNB or CategoricalNB since we have categorical variables

### 1.4 Build a CONTINUOUS NB in sklearn

While we won't do the math by hand for the continous (Gaussian) version of Naive Bayes, let's practice running it in sklearn.

Using the `diabetes` dataset, create and fit a `GaussianNB()` model to predict whether or not someone has diabetes (`1`-diabetes, `0`-no diabetes). Use Train Test Split an evaluate how well your model does on unseen data.

In [23]:
diabetes = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/diabetes2.csv")
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [24]:
### YOUR CODE HERE ###

# Use GaussianNB because we have all continuous predictors

## 2. Why Being Naive is...good!

We mentioned in lecture why the naive assumption in NB is useful, computationally. But now, it's your turn to experience it first hand! Using the LARGER home ownership dataset `d2`, first calculate the probability $P(1,0,1,1)$ (where `[1,0,1,1]` represents a person's values for the 4 predictors, `incomeOver100k`, `ageOVer40`, `kids`, and `morethan1Income`) the **naive** way for *both* home owners and non-owners. 

$$ P(A,B,C,D) = P(A)*P(B)*P(C)*P(D)$$

In [25]:
d2 = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/HomeOwnership2.csv")
d2.head()
### YOUR CODE HERE ###

#p(1,0,1,1) for homeowners

#p(1,0,1,1) for non-homeowners


Unnamed: 0,incomeOver100k,ageOver40,kids,morethan1Income,own
0,1,0,1,1,1
1,1,1,1,1,1
2,0,1,1,1,1
3,1,0,0,1,1
4,1,0,1,1,1


Now calculate $P(1,0,1,1 | \text{own})$ (where `[1,0,1,1]` represents a person's values for the 4 predictors, `incomeOver100k`, `ageOVer40`, `kids`, and `morethan1Income`) the **regular** way for *both* home owners and non-owners. 

Using the *chain rule* of probabilities, the probability of multiple events, $P(A,B,C,D)$ is equal to:

$$P(A,B,C,D)= P(A|B,C,D)*P(B|C,D)*P(C|D)*P(D)$$

In [26]:
### YOUR CODE HERE ###
#p(1,0,1,1) for homeowners

#p(1,0,1,1) for non-homeowners

See how much simpler the naive way is?? and this is a TINY dataset with very few features.