In [2]:
import numpy as np

### 1. Introduction 

### 2. Theory

We can use nothing more than Bayes' Theorem for a classification model if we make a few simplifications first.  Bayes' Theorem assumes that all input variables are dependent on all other variables.  The fundamental assumption in the Naive Bayes model is that the features are **independent** of each other.  By making this assumption, we are able to factor the probabilities in the equation from dependent conditional probabilities to independent conditional probabilities.  

We can first get rid of the denominator in the equation $P(x1, x2, ..., xn)$ since it is not dependent on $y$.  So that leaves us with: 

$P(y_{i}|x_{1},...,x_{n}) = P(x_{1},...,x_{n}|y_{i}) * P(y_{i})$

Then, the conditional probability of all variables given the class label is changed into separate conditional probabilities of each variable value given the class label.  These are then multiplied together: 

$P(y_{i}|x_{1},...,x_{n}) = P(x_{1}|y_{1}) *... * P(x_{n}|y_{1})$

We can do this for each of the class labels and choose the label with the largest probability to be the classification.  This is the maximum a posteriori (MAP) decision rule.  

### 3. Example 

To see an example, let's create 100 instances with two numerical features, each assigned to one of two classes: 

In [1]:
# Create small dataset
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=42)
print(X.shape, y.shape)
print(X[:5])
print(y[:5])

(100, 2) (100,)
[[-2.98837186  8.82862715]
 [ 5.72293008  3.02697174]
 [-3.05358035  9.12520872]
 [ 5.461939    3.86996267]
 [ 4.86733877  3.28031244]]
[0 1 0 1 1]


We can model the numerical input variables using a Gaussian distribution: 

In [3]:
# Fit a probability distribution to a univariate data sample
from scipy.stats import norm
def fit_distribution(data):
    
    # Estimate parameters
    mu = np.mean(data)
    sigma = np.std(data)
    print(mu, sigma)
    
    # Fit distribution 
    dist = norm(mu, sigma)
    return dist

We are interested in the conditional probability of each input variable.  So this means we need one distribution for each of the input variables, and one set of distributions for the class labels, with four in total: 

In [4]:
# Sort data into classes 
Xy0 = X[y == 0]
Xy1 = X[y == 1]
print(Xy0.shape, Xy1.shape)

(50, 2) (50, 2)


Now that we have these two groups, we can use them to calculate the priors for a sample belonging to either group.  We know this is going to be 50% since we created this dataset, but in a real situation, these would typically be different: 

In [6]:
# Calculate priors 
prior0 = len(Xy0) / len(X)
prior1 = len(Xy1) / len(X)
print(prior0, prior1)

0.5 0.5


Finally, we call the `fit_distribution()` function to generate a probability distribution for each variable and each class label: 

In [7]:
# Create PDFs for y == 0
X1_y0 = fit_distribution(Xy0[:, 0])
X2_y0 = fit_distribution(Xy0[:, 1])

# Create PDFs for y == 1
X1_y1 = fit_distribution(Xy1[:, 0])
X2_y1 = fit_distribution(Xy1[:, 1])

-2.702923013045944 0.832080242600683
8.89011496146507 0.9529082761802956
4.608404434892749 0.8720134236486649
2.1699819200257497 0.9986664144394607


We see the mean and standard deviation of each distribution printed, demonstrating that they are different.  We now use the prepared probabilistic model to make a prediction.  The independent conditional probability distribution for each class label can be calculated using the prior for the class (50%) and the conditional probability of the value for each variable:

In [8]:
# Calculate independent conditional probability
def probability(X, prior, dist1, dist2):
    return prior * dist1.pdf(X[0]) * dist2.pdf(X[1])

We can use this function to calculate the probability for an example belonging to each class.  Let's classify one example:  

In [9]:
# Classify one example
Xsample, ysample = X[0], y[0]
py0 = probability(Xsample, prior0, X1_y0, X2_y0)
py1 = probability(Xsample, prior1, X1_y1, X2_y1)
print('P(y=0 | %s) = %.3f' % (Xsample, py0*100))
print('P(y=1 | %s) = %.3f' % (Xsample, py1*100))
print('Truth: y=%d' % ysample)

P(y=0 | [-2.98837186  8.82862715]) = 9.443
P(y=1 | [-2.98837186  8.82862715]) = 0.000
Truth: y=0


In scikit-learn, there are three implementations of the Naive Bayes model: `BernoulliNB`, `MultinomialNB`, and `GaussianNB`.  Let's look at an example of the Gaussian form: 

In [11]:
# Example of Gaussian NB
from sklearn.naive_bayes import GaussianNB

# Generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=42)

# Define the model
model = GaussianNB()

# Fit the model
model.fit(X, y)

# Select a single sample
Xsample, ysample = [X[0]], y[0]

# Make a probabilistic prediction
yhat_prob = model.predict_proba(Xsample)
print(f'Predicted probabilities: {yhat_prob}')

# Make classification prediction
yhat_class = model.predict(Xsample)
print(f'Predicted class: {yhat_class}')
print(f'Truth: {ysample}')

Predicted probabilities: [[1.00000000e+00 7.10993754e-27]]
Predicted class: [0]
Truth: 0


In this case, the probability of the example belonging to y=0 is 1.0, or a certainty.  The probability of y=1 is very small, essentially zero.  Finally, the class label is predicted.  

### 4. Tips When Using Naive Bayes

- If the probability distribution for a variable is complex or uknown, it can be a good idea to use a kernel density estimator, or KDE.  This allows us to approximate the distribution from the data samples.

- Naive Bayes assumes the variables are independent, even though this is rarely true in the real world.  Still, it makes good approximations often when that isn't the case.  However, the more dependent the variables are, the less well the model is able to perform. 

- When calculating the independent conditional probability for one example for one class label, we have to multiply a bunch of individual probabilities together.  This can be unstable if the probabilities are small.  A trick is to transform the multiplication into log additions.  This is the 'log trick'. 

- When new data becomes available, we can use the new data with the old data to update the estimates of the parameters for each variable's probability distribution 

- The probability distributions will summarize the conditional probability of each input variable value for each class label.  We can use these distributions to randomly sample and create new plausible data instances.  