## Naive Bayes

[Naive Bayes classifiers](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.


Abstractly, naive Bayes is a conditional probability model given a problem instance to be classified, represented by a vector $\mathbf{x} = (x_1, \dots, x_n)$ representing some $n$ features (independent variables), it assigns to this instance probabilities

$$p(C_k \mid x_1, \dots, x_n)\,$$

for each of $k$ possible outcomes or classes $C_k$.

The problem with the above formulation is that if the number of features $n$ is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible.  We therefore reformulate the model to make it more tractable.  Using Bayes' theorem, the conditional probability can be decomposed as

$$p(C_k \mid \mathbf{x}) = \frac{p(C_k) \ p(\mathbf{x} \mid C_k)}{p(\mathbf{x})} \,$$

In plain English, using Bayesian probability terminology, the above equation can be written as

$$\mbox{posterior} = \frac{\mbox{prior} \times \mbox{likelihood}}{\mbox{evidence}} \,$$

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on $C$ and the values of the features $F_i$ are given, so that the denominator is effectively constant.
The numerator is equivalent to the joint probability model

$$p(C_k, x_1, \dots, x_n)\,$$

which can be rewritten as follows, using the Chain rule for repeated applications of the definition of conditional probability

$$
\begin{align}
p(C_k, x_1, \dots, x_n) & = p(x_1, \dots, x_n, C_k) \\
                        & = p(x_1 \mid x_2, \dots, x_n, C_k) p(x_2, \dots, x_n, C_k) \\
                        & = p(x_1 \mid x_2, \dots, x_n, C_k) p(x_2 \mid x_3, \dots, x_n, C_k) p(x_3, \dots, x_n, C_k) \\
                        & = \dots \\
                        & = p(x_1 \mid x_2, \dots, x_n, C_k) p(x_2 \mid x_3, \dots, x_n, C_k) \dots   p(x_{n-1} \mid x_n, C_k) p(x_n \mid C_k) p(C_k)  \\
\end{align}
$$

Now the "naive" conditional independence assumptions come into play assume that each feature $F_i$ is conditionally statistical independence|independent of every other feature $F_j$ for $j\neq i$, given the category $C$.  This means that

$$p(x_i \mid x_{i+1}, \dots ,x_{n}, C_k ) = p(x_i \mid C_k)\,$$.

Thus, the joint model can be expressed as

$$
\begin{align}
p(C_k \mid x_1, \dots, x_n) & \varpropto p(C_k, x_1, \dots, x_n) \\
                            & \varpropto p(C_k) \ p(x_1 \mid C_k) \ p(x_2\mid C_k) \ p(x_3\mid C_k) \ \cdots \\
                            & \varpropto p(C_k) \prod_{i=1}^n p(x_i \mid C_k)\,.
\end{align}
$$

This means that under the above independence assumptions, the conditional distribution over the class variable $C$ is

$$p(C_k \mid x_1, \dots, x_n) = \frac{1}{Z} p(C_k) \prod_{i=1}^n p(x_i \mid C_k)$$

where the evidence $Z = p(\mathbf{x})$ is a scaling factor dependent only on $x_1, \dots, x_n$, that is, a constant if the values of the feature variables are known.

### Constructing a classifier from the probability model   

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model.  The naive Bayes classifier combines this model with a decision rule.  One common rule is to pick the hypothesis that is most probable; this is known as the 'maximum a posteriori' or 'MAP' decision rule.  The corresponding classifier, a Bayes classifier, is the function that assigns a class label $\hat{y} = C_k$ for some $k$ as follows

$$\hat{y} = \underset{k \in \{1, \dots, K\}}{\operatorname{argmax}} \ p(C_k) \displaystyle\prod_{i=1}^n p(x_i \mid C_k).$$

## Gaussian naive Bayes

When dealing with continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Normal distribution. For example, suppose the training data contains a continuous attribute, $x$. We first segment the data by the class, and then compute the mean and variance of $x$ in each class. Let $\mu_c$ be the mean of the values in $x$ associated with class c, and let $\sigma^2_c$ be the variance of the values in $x$ associated with class c. Suppose we have collected some observation value $v$. Then, the probability distribution of $v$ given a class $c$, $p(x=v \mid c)$, can be computed by plugging $v$ into the equation for a Normal distribution parameterized by $\mu_c$ and $\sigma^2_c$. That is,

$$
p(x=v \mid c)=\frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(v-\mu_c)^2}{2\sigma^2_c} }
$$

Another common technique for handling continuous values is to use binning  the feature values, to obtain a new set of Bernoulli-distributed features; some literature in fact suggests that this is necessary to apply naive Bayes, but it is not, and the discretization may throw away discriminative information.


## Multinomial Naive Bayes 

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for rolling a k-sided dice n times. For n independent trials each of which leads to a success for exactly one of $k$ categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial $(p_1, \dots, p_n)$ where $p_i$ is the probability that event $i$ occurs $K$ such multinomials in the multiclass case). A feature vector $\mathbf{x} = (x_1, \dots, x_n)$ is then a histogram, with $x_i$ counting the number of times event $i$ was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption). The likelihood of observing a histogram $x$ is given by

$$
p(\mathbf{x} \mid C_k) = \frac{(\sum_i x_i)!}{\prod_i x_i !} \prod_i {p_{ki}}^{x_i}
$$

The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space

$$
\begin{align}
\log p(C_k \mid \mathbf{x}) & \varpropto \log \left( p(C_k) \prod_{i=1}^n {p_{ki}}^{x_i} \right) \\
 & = \log p(C_k) + \sum_{i=1}^n x_i \cdot \log p_{ki} \\
 & = b + \mathbf{w}_k^\top \mathbf{x}
\end{align}
$$

where $b = \log p(C_k)$ and $w_{ki} = \log p_{ki}$.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case.

Rennie et al. discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of tf–idf weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with support vector machines.


## Bernoulli naive Bayes

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is the probability distribution of a random variable which takes the value 1 with probability $p$ and the value 0 with probability $q=1-p$ —i.e., the probability distribution of any single experiment that asks a yes–no question; the question results in a boolean-valued function, a single bit of information whose value is success/yes/true/one with probability p and failure/no/false/zero with probability q. It can be used to represent a coin toss where 1 and 0 would represent "head" and "tail" (or vice versa), respectively. In particular, unfair coins would have $p \neq 0.5$. 

The Bernoulli distribution is a special case of the binomial distribution where a single experiment/trial is conducted (n=1). It is also a special case of the two-point distribution, for which the outcome need not be a bit, i.e., the two possible outcomes need not be 0 and 1.

In the multivariate Bernoulli event model, features are independent boolean data type describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies. If $x_i$ is a boolean expressing the occurrence or absence of the $i$'th term from the vocabulary, then the likelihood of a document given a class $C_k$ is given by

$$
p(\mathbf{x} \mid C_k) = \prod_{i=1}^n p_{ki}^{x_i} (1 - p_{ki})^{(1-x_i)}
$$

where $p_{ki}$ is the probability of class $C_k$ generating the term $w_i$. This event model is especially popular for classifying short texts. It has the benefit of explicitly modelling the absence of terms. Note that a naive Bayes classifier with a Bernoulli event model is not the same as a multinomial NB classifier with frequency counts truncated to one.



## Bayesian Probability

In this lesson we will disucss Bayesian probability theory. There are no data sets, or libraires to be installed.

## Probability and Statistics

Probability is a measure of the likelihood of a random phenomenon or chance behavior.  Probability describes the long-term proportion with which a certain outcome will occur in situations with short-term uncertainty. 

### The Axioms of Probability


#### First axiom - The probability of an event is a non-negative real number:
$$
P(E)\in\mathbb{R}, P(E)\geq 0 \qquad \forall E\in F
$$
where $F$ is the event space

#### Second axiom -  unit measure:

The probability that some elementary event in the entire sample space will occur is 1.

$$
P(\Omega) = 1.
$$  

#### Third axiom - the assumption of $\sigma$-additivity:

Any countable sequence of disjoint (synonymous with mutually exclusive) events $E_1, E_2, ...$ satisfies

$$
P\left(\bigcup_{i = 1}^\infty E_i\right) = \sum_{i=1}^\infty P(E_i).
$$

The total probability of all possible event always sums to 1. 

### Consequences of these axioms

The probability of the empty set:
$$
P(\varnothing)=0.
$$

Monotonicity   
$$
\quad\text{if}\quad A\subseteq B\quad\text{then}\quad P(A)\leq P(B).
$$

The numeric bound between 0 and 1:  

$$
0\leq P(E)\leq 1\qquad \forall E\in F.
$$


![Probability is expressed in numbers between 0 and 1](http://nikbearbrown.com/YouTube/MachineLearning/M10/Probability_0_1.png)    
*Probability is expressed in numbers between 0 and 1.*   


Probabilty of a certain event is 1:

$$
P(True) = 1
$$

Probability = 1 means it always happens.


Probabilty of an impossible event is 0:

$$
P(False) = 0
$$

Probability = 0 means the event never happens.  

Probabilty of A or B:

$$
P(A \quad or \quad B) = P(A) + P(A) - P(A \quad and \quad B) 
$$

or

$$
P(A \cup B) = P(A) + P(A) - P(A \cap B) 
$$


Probabilty of not A:

$$
P(not  \quad A) = 1- P(A) 
$$


## Conditional Probability

In probability theory, a [conditional probability](https://en.wikipedia.org/wiki/Conditional_probability) measures the probability of an event given that another event has occurred. That is,  "the conditional probability of A given B."   

 the conditional probability of A given B is defined as the quotient of the probability of the joint of events A and B, and the probability of B:
 
$$ 
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$

This may be visualized using a Venn diagram. 

![P(A and B) Venn](http://nikbearbrown.com/YouTube/MachineLearning/M10/Conditional_Probability_Venn_Diagram.png)     
*$P(A \cap B)$*

### Corollary of Conditional Probability is The Chain Rule

If we multiply both sides by $P(B)$ then

$$ 
P(A|B) = \frac{P(A \cap B)}{P(B)}
$$

becomes

$$ 
P(A|B) P(B) = P(A \cap B) 
$$





### Statistical independence

Events A and B are defined to be statistically independent if:

$$
\begin{align}
             P(A \cap B) &= P(A) P(B) \\
  \Leftrightarrow P(A|B) &= P(A) \\
  \Leftrightarrow P(B|A) &= P(B)
\end{align}
$$

That is, the occurrence of A does not affect the probability of B, and vice versa


Probabilty of A or B for independent events $P(A and B) is 0$:

$$
P(A or B) = P(A) + P(A) 
$$

## Bayes Rule

[Bayes' theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) (alternatively Bayes' law or Bayes' rule) describes the probability of an event, given prior events. That is, a conditional probability.

$$
P(A|B) = \frac{P(A)\, P(B | A)}{P(B)},
$$

where A and B are events.

* P(A) and P(B) are the independent probabilities of A and B.  
* P(A | B), a conditional probability, is the probability of observing event A given that B is true.  
* P(B | A), is the probability of observing event B given that A is true.  


## Bayesian inference

[Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference) is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as evidence. Bayesian inference derives the posterior probability as a consequence of two antecedents, a prior probability and a "likelihood function" derived from a statistical model for the observed data. 

Bayesian inference computes the posterior probability according to Bayes' theorem:

$$
P(H\mid E) = \frac{P(E\mid H) \cdot P(H)}{P(E)}
$$

where,    

$P(H\mid E)$ the posterior probability, denotes a conditional probability of $\textstyle H$ (the hypothesis) whose probability may be affected by the evidence $\textstyle E$.   

$\textstyle P(H)$, the prior probability, is an estimate of the probability that a hypothesis is true, before observing the current evidence.   

$\textstyle P(E\mid H)$ is the probability of observing $\textstyle E$ given $\textstyle H$. It indicates the compatibility of the evidence with the given hypothesis.   

$\textstyle P(E)$ is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered. 

Note that Bayes' rule can also be written as follows:
$$
P(H\mid E) = \frac{P(E\mid H)}{P(E)} \cdot P(H)
$$
where the factor $\textstyle \frac{P(E\mid H)}{P(E)}$ represents the impact of $E$ on the probability of $H$.   

## Bayesian probability example

Suppose a certain disease has an incidence rate of 0.01% (that is, it afflicts 0.01% of the population).  A test has been devised to detect this disease.  The test does not produce false negatives (that is, anyone who has the disease will test positive for it), but the false positive rate is 1% (that is, about 1% of people who take the test will test positive, even though they do not have the disease).  Suppose a randomly selected person takes the test and tests positive.  What is the probability that this person actually has the disease?

Bayes theorem would ask the question, what is the probability of disease given a postive result, or $P(disease\mid positive))$. 

What do we know?  

$P(positive\mid disease)=1$ (i.e. The test does not produce false negatives.)     
$P(disease)=0.0001$ (i.e.  1/10,000 have the disease)      
$P(positive\mid no disease)=0.01$ (i.e. he false positive rate is 1%. This means 1% of people who take the test will test positive, even though they do not have the disease)      

Bayes’ Theorem

$$
P(A|B) = \frac{P(A)\, P(B | A)}{P(B)},
$$

which can be rewritten as  

$$
P(A|B) = \frac{P(A)\, P(B | A)}{P(A)P(B|A)+P(\bar{A})P(B|\bar{A})},
$$

which in our example is
$$
P(disease|positive) = \frac{P(disease)\, P(positive | disease)}{P(disease)P(positive|disease)+P(no \quad  disease)P(positive|no \quad disease)},
$$

plugging in the numbers gives

$$
P(disease|positive)= \frac{(0.0001)\, (1)}{(0.0001)(1)+(0.9999)(0.01)}, \approx 0.01
$$

So even though the test is 99% accurate, of all people who test positive, over 99% do not have the disease.  


# Bayesians versus Frequentists

 
[Frequentist inference](https://en.wikipedia.org/wiki/Frequentist_inference) or frequentist statistics is a scheme for making statistical inference based on the frequency or proportion of the data. This effectively requires that conclusions should only be drawn with a set of repetitions.    

Frequentists will only generate statistical inference given a large enough set of repetitions. In contrast, a Bayesian approach to inference does allow probabilities to be associated with unknown parameters.   

![Count Von Count](http://nikbearbrown.com/YouTube/MachineLearning/M10/Count_von_Count_kneeling.png)   
*Count Von Count*   
- from https://en.wikipedia.org/wiki/File:Count_von_Count_kneeling.png  

While "probabilities" are involved in both approaches to inference, frequentist probability is essentially equivelent to counting. The Bayesian approach allows these estimates of probabilities to be based upon counting but also allows for subjective estimates (i.e. guesses) of prior probabilities.

Bayesian probability, also called evidential probability, or subjectivist probability, can be assigned to any statement whatsoever, even when no random process is involved. Evidential probabilities are considered to be degrees of belief, and a Bayesian can even use an un-informative prior (also called a non-informative or Jeffreys prior).

In Bayesian probability, the [Jeffreys prior](https://en.wikipedia.org/wiki/Jeffreys_prior), named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space.  The crucial idea behind the Jeffreys prior is the Jeffreys posterior. This posterior aims to reflect as best as possible the information about the parameters brought by the data, in effect  "representing ignorance" about the prior. This is sometimes called the "principle of indifference." Jeffreys prior is proportional to the square root of the determinant of the Fisher information:

$$
p\left(\vec\theta\right) \propto \sqrt{\det \mathcal{I}\left(\vec\theta\right)}.\,
$$

It has the key feature that it is invariant under reparameterization of the parameter vector $\vec\theta.$  

At its essence the Bayesian can be vague or subjective about an inital guess at a prior probability. and the the posterior probability be updated data point by data point. A Bayesian defines a "probability" in the same way that many non-statisticians do - namely an indication of the plausibility or belief of a proposition.

A Frequentist is someone that believes probabilities represent long run frequencies with which events occur; he or she will have a model (e.g. Guassian, uniform, etc.) of how the sample popluation was generated. The observed counts are considered a random sample the estimate the true parameters of the model.   

It is important to note that most Frequentist methods have a Bayesian equivalent (that is, they give the same results) when there are enough repeated trails. That is, they converge the the same result given enough data.  



## Applying Bayes' theorem to iris classification

Can **Bayes' theorem** predict the species of an iris?

In [None]:
from __future__ import print_function
%matplotlib inline
import random
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy as sp
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

In [None]:
iris = sns.load_dataset("iris")
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [None]:
# apply the ceiling function to the numeric columns
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.0,4.0,2.0,1.0,setosa
1,5.0,3.0,2.0,1.0,setosa
2,5.0,4.0,2.0,1.0,setosa
3,5.0,4.0,2.0,1.0,setosa
4,5.0,4.0,2.0,1.0,setosa


In [None]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,6.273333,3.46,4.22,1.773333
std,0.889134,0.551143,1.74506,0.696556
min,5.0,2.0,1.0,1.0
25%,6.0,3.0,2.0,1.0
50%,6.0,3.0,5.0,2.0
75%,7.0,4.0,6.0,2.0
max,8.0,5.0,7.0,3.0


In [None]:
iris['sepal_length'].value_counts()

6.0    57
7.0    49
5.0    32
8.0    12
Name: sepal_length, dtype: int64

In [None]:
iris['sepal_width'].value_counts()

3.0    82
4.0    64
5.0     3
2.0     1
Name: sepal_width, dtype: int64

In [None]:
iris['petal_length'].value_counts()

2.0    49
5.0    42
6.0    33
4.0    15
7.0     9
3.0     1
1.0     1
Name: petal_length, dtype: int64

In [None]:
iris['petal_width'].value_counts()

2.0    70
1.0    57
3.0    23
Name: petal_width, dtype: int64

In [None]:
iris['species'].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

## Making a Bayesian prediction

Let's say that I have an **out-of-sample iris** with the following measurements (Highest counts of each column): **7, 3, 5, 2**. How might I predict the species?

In [None]:
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
54,7.0,3.0,5.0,2.0,versicolor
58,7.0,3.0,5.0,2.0,versicolor
63,7.0,3.0,5.0,2.0,versicolor
68,7.0,3.0,5.0,2.0,versicolor
72,7.0,3.0,5.0,2.0,versicolor
73,7.0,3.0,5.0,2.0,versicolor
74,7.0,3.0,5.0,2.0,versicolor
75,7.0,3.0,5.0,2.0,versicolor
76,7.0,3.0,5.0,2.0,versicolor
77,7.0,3.0,5.0,2.0,versicolor


In [None]:
# count the species for these observations
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()

versicolor    13
virginica      4
Name: species, dtype: int64

## What is the probability of some particular species, given the measurements 7, 3, 5, and 2?

$$P(species \ | \ 7352)$$

We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:

$$P(setosa \ | \ 7352)$$
$$P(versicolor \ | \ 7352)$$
$$P(virginica \ | \ 7352)$$

### Calculating the conditional probability of versicolor

**Bayes' theorem** gives us a way to calculate these conditional probabilities.

$$P(setosa) = P(versicolor) = P(virginica) = 50/150 = 1/3$$

For **versicolor**:

$$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$

We can calculate each of the terms on the right side of the equation:

$$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$

$$P(versicolor) = \frac {50} {150} = 0.33$$

$$P(7352) = \frac {17} {150} = 0.11$$

Bayes' theorem says:

$$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.765$$



In [None]:
((13/50)*(50/150))/(17/150)

0.7647058823529412

### Quiz calculate P(virginica  | 7352) and P(setosa |  7352)


**Calculate $P(virginica \ | \ 7352)$ and $P(setosa \ | \ 7352)$** 


### $P(virginica \ | \ 7352)$

$$P(virginica \ | \ 7352) = \frac {P(7352 \ | \ virginica) \times P(virginica)} {P(7352)}$$

We can calculate each of the terms on the right side of the equation:

$$P(7352 \ | \ virginica) = \frac {4} {50} = 0.26$$

$$P(virginica) = \frac {50} {150} = 0.33$$

$$P(7352) = \frac {17} {150} = 0.11$$

Bayes' theorem says:

$$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.25$$


### $P(setosa \ | \ 7352)$


$$P(setosa \ | \ 7352) = \frac {P(7352 \ | \ setosa) \times P(setosa)} {P(7352)}$$

We can calculate each of the terms on the right side of the equation:

$$P(7352 \ | \ setosa) = \frac {0} {50} = 0$$

$$P(setosa) = \frac {50} {150} = 0.33$$

$$P(7352) = \frac {17} {150} = 0.11$$

Bayes' theorem says:

$$P(virginica \ | \ 7352) = \frac {0.0 \times 0.33} {0.11} = 0$$

In [None]:
((4/50)*(50/150))/(17/150)

0.23529411764705882

In [None]:
((0/50)*(50/150))/(17/150)

0.0

## Predict that the iris is a versicolor

Bayes' theorem says:

$$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$


$$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.25$$

$$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$

We predict that the iris is a versicolor, given it has the **highest conditional probability**.

## Building a Naive Bayes model


## Comparing Multinomial and Gaussian Naive Bayes

scikit-learn documentation: [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)


* Bernoulli Naive Bayes The binomial model is useful if your feature vectors are binary (i.e., 0s and 1s).  One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

* Multinomial Naive Bayes The multinomial naive Bayes model is typically used for discrete counts. That is , the "number of times outcome number  $x_i$  is observed over the n trials".  For example, the  “count how often word occurs in the document.”

* Gaussian Naive Bayes Here, we assume that the features follow a normal distribution. Instead of discrete counts, we have continuous features.


Dataset: [Pima Indians Diabetes](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) from the UCI Machine Learning Repository

In [None]:
data = 'data/pima-indians-diabetes.csv'
# col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(data)

In [None]:
# notice that all features are continuous
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# create X and y
y = pima.Outcome
X = pima.drop('Outcome', axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_class = mnb.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred_class))

0.5416666666666666


In [None]:
# testing accuracy of Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_class = gnb.predict(X_test)
print (metrics.accuracy_score(y_test, y_pred_class))

0.7916666666666666


### Discusion Multinomial and Gaussian Naive Bayes

Note the both the multinomial and Gaussian Naive Bayes work in the sense that the python code runs. The multinomial Naive Bayes is appropriate and gives terrible (coin flip level) results.


Last update July 8, 2019