# Naïve Bayes Recommendation System - Part I: Foundation
 

In this notebook, I'll build Naive Bayes from scratch to deeply understand its mechanics, then demonstrate its implementation using scikit-learn.

## Why This Matters in the Real World

Naive Bayes is everywhere in production systems today:
- **Spam Detection**: Gmail uses it to filter billions of emails daily
- **Medical Diagnosis**: Assisting doctors by predicting disease likelihood from symptoms
- **Recommendation Systems**: Powering "customers who bought this also bought..."
- **Text Classification**: Social media platforms identifying toxic content in milliseconds
- **Fraud Detection**: Banks flagging suspicious transactions before money is lost

## The Business Value

Companies prefer Naive Bayes because it's:
- **Fast**: Predictions in microseconds, even on millions of data points
- **Interpretable**: You can explain exactly why a decision was made
- **Memory Efficient**: No need for massive compute resources
- **Robust**: Works well even with limited training data
- **Transparent**: Stakeholders can audit and trust the model's logic

By building it from scratch, I gain the mathematical intuition needed to confidently deploy this algorithm in production systems where reliability and explainability matter most.


In [1]:
import numpy as np

## Step1: Organizing Training Data

Before we can learn patterns, we need to structure our data intelligently. This function creates a lookup table that groups samples by their class labels—think of it as creating separate folders for positive and negative examples.

### Why This Matters
This efficient data organization is the foundation of fast prediction. In production systems processing millions of transactions per second (like credit card fraud detection), this organizational step enables real-time decision making. It's the difference between a system that takes 1 millisecond vs 1000 milliseconds per prediction.


In [2]:
X_train = np.array([
    [0, 1, 1],
    [0, 0, 1],
    [0, 0, 0],
    [1, 1, 0]])

Y_train = ['Y', 'N', 'Y', 'Y']

X_test = np.array([[1, 1, 0]])

In [3]:
def get_label_indices(labels):
    from collections import defaultdict
    label_indices = defaultdict(list)
    for index, label in enumerate(labels):
        label_indices[label].append(index)
    return label_indices

## Step 2: Calculating Prior Probabilities

The "prior" tells us: **before seeing any evidence, what's our best guess?**

### Real-World Application
Imagine a medical diagnosis system:
- 99% of patients with these symptoms have a cold (prior = 0.99)
- 1% have something serious (prior = 0.01)

Even before examining a specific patient, we have a baseline expectation. This is critical in business because:
1. **Risk Management**: We start with a conservative assumption
2. **Resource Allocation**: We know which outcomes are likely vs rare
3. **Decision Making**: We can make informed choices even with limited information

### Business Impact
Insurance companies use priors to set premiums. An understanding of baseline probabilities prevents catastrophic financial losses.


In [4]:
label_indices = get_label_indices(Y_train)
print('label_indices:\n', label_indices)


label_indices:
 defaultdict(<class 'list'>, {'Y': [0, 2, 3], 'N': [1]})


In [5]:
 def get_prior(label_indices):
     """
     Compute prior based on training samples
     @param label_indices: grouped sample indices by class
     @return: dictionary, with class label as key, corresponding
              prior as the value
     """
     prior = {label: len(indices) for label, indices in
                                      label_indices.items()}
     total_count = sum(prior.values())
     for label in prior:
         prior[label] /= total_count
     return prior


## Step 3: Computing Likelihood with Smoothing

Likelihood answers: **"Given a class, what's the probability of observing these features?"**

The `smoothing` parameter is our **regularization technique**—it prevents the model from being overly confident based on limited data.

### Why Smoothing is Critical in Production
Consider a spam filter:
- Email contains word "Viagra" (appears in 99% of spam, 0.01% of regular emails)
- Without smoothing: We'd be 99% confident it's spam—potentially causing false positives
- With smoothing: We consider uncertainty, leading to more reliable predictions

### Business Impact
Without smoothing, rare events in training data create **false alarms** in production—costing customer trust and operational efficiency. This is why Amazon's recommendation system, Google's search ranking, and Netflix's content suggestions all use smoothing techniques.


In [6]:
prior = get_prior(label_indices)
print('Prior:', prior)

Prior: {'Y': 0.75, 'N': 0.25}


In [7]:
def get_likelihood(features, label_indices, smoothing=0):
    """
    Compute likelihood based on training samples
    @param features: matrix of features
    @param label_indices: grouped sample indices by class
    @param smoothing: integer, additive smoothing parameter
    @return: dictionary, with class as key, corresponding
             conditional probability P(feature|class) vector 
             as value
    """
    likelihood = {}
    for label, indices in label_indices.items():
        likelihood[label] = features[indices, :].sum(axis=0) + smoothing
        total_count = len(indices)
        likelihood[label] = likelihood[label] / (total_count + 2 * smoothing)
    return likelihood

In [8]:
smoothing = 1
likelihood = get_likelihood(X_train, label_indices, smoothing)
print('Likelihood:\n', likelihood)

Likelihood:
 {'Y': array([0.4, 0.6, 0.4]), 'N': array([0.33333333, 0.33333333, 0.66666667])}


In [9]:
def get_posterior(X, prior, likelihood):
    """
    Compute posterior of testing samples, based on prior and
    likelihood
    @param X: testing samples
    @param prior: dictionary, with class label as key,
                  corresponding prior as the value
    @param likelihood: dictionary, with class label as key,
                       corresponding conditional probability
                           vector as value
    @return: dictionary, with class label as key, corresponding
             posterior as value
    """
    posteriors = []
    for x in X:
        # posterior is proportional to prior * likelihood
        posterior = prior.copy()
        for label, likelihood_label in likelihood.items():
            for index, bool_value in enumerate(x):
                posterior[label] *= likelihood_label[index] if \
                  bool_value else (1 - likelihood_label[index])
        # normalize so that all sums up to 1
        sum_posterior = sum(posterior.values())
        for label in posterior:
            if posterior[label] == float('inf'):
                posterior[label] = 1.0
            else:
                posterior[label] /= sum_posterior
        posteriors.append(posterior.copy())
    return posteriors

    

In [10]:
posterior = get_posterior(X_test, prior, likelihood)
print('Posterior:\n', posterior)

Posterior:
 [{'Y': np.float64(0.9210360075805433), 'N': np.float64(0.07896399241945673)}]


#### Implementing Naïve Bayes with scikit-learn

In [11]:
from sklearn.naive_bayes import BernoulliNB

In [13]:
# a model with a smoothing factor
clf = BernoulliNB(alpha=1.0, fit_prior=True)

In [14]:
# train model
clf.fit(X_train, Y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,binarize,0.0
,fit_prior,True
,class_prior,


In [15]:
pred_prob = clf.predict_proba(X_test)
print('[scikit-learn] Predicted probabilities:\n', pred_prob)

[scikit-learn] Predicted probabilities:
 [[0.07896399 0.92103601]]


In [16]:
pred = clf.predict(X_test)
print('[scikit-learn] Prediction:', pred)


[scikit-learn] Prediction: ['Y']
