# Naive Bayes from Scratch – Dummy Implementation

This notebook demonstrates a **Naive Bayes classifier built completely from scratch** using only NumPy. Instead of working with a real-world dataset, we use a **simplified, synthetic example** to focus purely on the core logic of the algorithm.

The goal is to **demystify the mechanics of Naive Bayes**, walking step-by-step through how probabilities are calculated and predictions are made without relying on external machine learning libraries.

**Key Steps:**

1. Creating a small dummy dataset
2. Calculating **prior probabilities** for each class
3. Estimating **likelihoods** for each feature given the class
4. Applying **Bayes' Theorem** to compute posterior probabilities
5. Making predictions and verifying results manually
6. Comparing results with `sklearn` for validation



In [52]:
import numpy as np

In [53]:
class GaussianNaiveBayes:
    def __init__(self):
        self._prior = None
        self._variance = None
        self._mean = None
        self._classes = None

    def fit(self, x, y):
        n_sample, n_feature = x.shape
        self._classes = np.unique(y)
        n_classes = len(self._classes)
        self._mean = np.zeros((n_classes, n_feature), dtype=np.float32)
        self._variance = np.zeros((n_classes, n_feature), dtype=np.float32)
        self._prior = np.zeros(n_classes, dtype=np.float32)
        
        for i, c in enumerate(self._classes):
            x_for_class_c = x[y == c]
            self._mean[i, :] = np.mean(x_for_class_c, axis=0)
            self._variance[i, :] = np.var(x_for_class_c, axis=0)
            self._prior[i] = x_for_class_c.shape[0] / float(n_sample)
    
    def likelihood(self, class_idx, x):
        mean = self._mean[class_idx]
        variance = self._variance[class_idx]
        
        num = np.exp(-(x - mean) ** 2 / (2 * variance))
        denom = np.sqrt(2 * np.pi * variance)
        return num / denom
    
    def predict(self, x):
        y_pred = [self._classify_sample(sample) for sample in x]
        return np.array(y_pred)
    
    def _classify_sample(self, x):
        posteriors = []
        for i, c in enumerate(self._classes):
            prior = np.log(self._prior[i])
            post = np.sum(np.log(self.likelihood(i, x)))
            posterior = prior + post
            posteriors.append(posterior)
            
        return self._classes[np.argmax(posteriors)]
        

In [54]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
import time

In [55]:
# synthesizing data
x, y = make_classification(n_samples=500000, n_features=20, n_classes=2, random_state=42)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

In [56]:
start = time.perf_counter() 
gnb = GaussianNaiveBayes()
gnb.fit(x, y)
y_pred = gnb.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
end = time.perf_counter()
print(f'time to predict: {end - start}')
print(f'accuracy: {accuracy}')

time to predict: 3.5389241999946535
accuracy: 0.877328


In [57]:
start = time.perf_counter()
gnb_std = GaussianNB()
gnb_std.fit(x, y)
y_pred_std = gnb_std.predict(x_test)
accuracy_std = accuracy_score(y_test, y_pred_std)
end = time.perf_counter()
print(f'time to predict: {end - start}')
print(f'accuracy: {accuracy_std}')

time to predict: 0.20145789999514818
accuracy: 0.877328
