# Assignment 0 - Task 2: Naive Bayes Classifier

## Problem Statement
Predict Heart Disease (`target`) using patient attributes (`age`, `sex`, `chol`, etc).
We will build the **Naive Bayes** algorithm from scratch, demonstrating the logic step-by-step.

## Approach: Incremental Building
1.  **Data Preparation**: We will process the data one variable at a time.
2.  **The Math**: We will define independent functions for the core formulas (`Prior`, `Likelihood`).
3.  **The Assembly**: We will combine these functions into a `CustomNaiveBayes` class.
4.  **Evaluation**: We will compare our results with `sklearn`.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## 1. Data Preparation
First, we load the dataset.

In [None]:
df = pd.read_csv('heart-dataset.csv')

Let's inspect the raw data to identify text columns.

In [None]:
df.head()

### 1.1 Encoding Target
The `target` column contains strings. Let's check the unique values.

In [None]:
df['target'].unique()

We will convert 'No Disease' to 0, and any other value (Disease) to 1.

In [None]:
df['target'] = df['target'].apply(lambda x: 0 if 'No Disease' in str(x) else 1)

Let's verify the counts.

In [None]:
df['target'].value_counts()

### 1.2 Encoding Categorical Features
The `sex` column is also categorical (`Male`, `Female`).

In [None]:
df['sex'].unique()

We map `Male` to 1 and `Female` to 0.

In [None]:
df['sex'] = df['sex'].map({'Male': 1, 'Female': 0})

For this initial implementation, we will select only the numeric features to ensure our probability functions work smoothly.

In [None]:
X = df.select_dtypes(include=[np.number]).drop(columns=['target'])

Let's verify which features we selected.

In [None]:
X.columns.tolist()

## 2. Component Implementation
We will now implement the core mathematical components of Naive Bayes as independent functions.

### 2.1 The Prior
The prior probability $P(C)$ represents the frequency of a class in the dataset.

In [None]:
def get_prior(df, target_col, class_val):
    total_samples = len(df)
    class_samples = len(df[df[target_col] == class_val])
    return class_samples / total_samples

**Test:** Check the probability of having the disease in our dataset.

In [None]:
prior_disease = get_prior(df, 'target', 1)
print(f"Prior Probability of Disease: {prior_disease:.4f}")

### 2.2 Gaussian Likelihood
For continuous features (like `Age`), we assume a Normal Distribution. The probability density function is:
$$ P(x|C) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$

In [None]:
def gaussian_pdf(x, mean, var):
    epsilon = 1e-9 # Smoothing to avoid division by zero
    coefficient = 1 / np.sqrt(2 * np.pi * (var + epsilon))
    exponent = np.exp(-((x - mean)**2) / (2 * (var + epsilon)))
    return coefficient * exponent

**Test:** Let's verify this with a simple example. If the mean age of sick patients is 55, a patient aged 60 should have a higher probability than a patient aged 20.

In [None]:
test_mean = 55
test_var = 100

prob_60 = gaussian_pdf(60, test_mean, test_var)
prob_20 = gaussian_pdf(20, test_mean, test_var)

print(f"Likelihood of Age 60: {prob_60:.5f}")
print(f"Likelihood of Age 20: {prob_20:.5f}")

## 3. Class Assembly
We will now organize these components into a standardized Python class.

In [None]:
class CustomNaiveBayes:
    def fit(self, X, y):
        # Store unique classes (0, 1)
        self.classes = np.unique(y)
        n_classes = len(self.classes)
        n_features = X.shape[1]
        
        # Initialize arrays to store Mean, Variance, and Priors for each class
        self.means = np.zeros((n_classes, n_features))
        self.vars = np.zeros((n_classes, n_features))
        self.priors = np.zeros(n_classes)
        
        for idx, c in enumerate(self.classes):
            # Subset data for class `c`
            X_c = X[y == c]
            
            # Calculate statistics
            self.means[idx, :] = X_c.mean(axis=0)
            self.vars[idx, :] = X_c.var(axis=0)
            self.priors[idx] = X_c.shape[0] / len(X)
            
    def predict(self, X):
        # Predict for every row in X
        y_pred = [self._predict_single(x) for x in X.values]
        return np.array(y_pred)
    
    def _predict_single(self, x):
        posteriors = []
        
        for idx, c in enumerate(self.classes):
            # 1. Start with Log Prior
            prior = np.log(self.priors[idx])
            
            # 2. Calculate Likelihood for all features using Gaussian PDF
            class_pdf = gaussian_pdf(x, self.means[idx], self.vars[idx])
            
            # 3. Sum Log Likelihoods (Log Sum is equivalent to Product in raw probability)
            posterior = prior + np.sum(np.log(class_pdf))
            posteriors.append(posterior)
            
        # Return the class with the highest posterior probability
        return self.classes[np.argmax(posteriors)]

## 4. Training and Evaluation
We will now train the model on the full dataset and evaluate it.

### 4.1 Splitting Data

In [None]:
from sklearn.model_selection import train_test_split

y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4.2 Training Custom Model

In [None]:
model = CustomNaiveBayes()
model.fit(X_train, y_train)

### 4.3 Prediction

In [None]:
predictions = model.predict(X_test)

### 4.4 Results

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy:", accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))

## 5. Benchmarking
Verifying our implementation against `sklearn`.

In [None]:
from sklearn.naive_bayes import GaussianNB

sk_model = GaussianNB()
sk_model.fit(X_train, y_train)
sk_preds = sk_model.predict(X_test)

In [None]:
print("Sklearn Accuracy:", accuracy_score(y_test, sk_preds))