## Assignment-2 : Naive Bayes
### Name: Utkarsh Sathawane
### Roll: 25CS60R75
### Section: B

## Set-up and Installations

In [None]:
!pip install scikit-learn pandas numpy matplotlib
!pip install seaborn

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

!pip install ucimlrepo


## Spambase Dataset(Odd Roll Numbers)

In [None]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
spambase = fetch_ucirepo(id=94)

x = spambase.data.features
y = spambase.data.targets

print(x.head())
print(y.head())
allfeature=[]
for i in x:
  allfeature.append(i)

## Pre-Processing

A data cleaning function was defined to ensure all feature data was numeric and to handle any missing values. Upon execution, it was found that 0 rows were dropped due to missing or bad values. The cleaned dataset was then split into training and testing sets using an 80/20 ratio, resulting in 3680 training samples and 921 test samples. The split was stratified to maintain the original class distribution in both sets.

In [None]:
def clean_data(x, y, allfeature):
    x = x.apply(pd.to_numeric, errors='coerce')
    label_name = y.columns[0]
    full_df = pd.concat([x, y], axis=1)
    count = len(full_df)
    full_df.dropna(inplace=True)
    print("Dropped", count - len(full_df), "rows with missing/bad values.")
    X = full_df[allfeature].values
    y = full_df[label_name].values.ravel()
    return X, y

X, y = clean_data(x, y, allfeature)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "y_test shape:", y_test.shape)



## Visualize the data

Initial data visualization was performed. This included plotting a correlation heatmap of the all  features to observe relationships between them, and a count plot of the target variable ('Class') to visualize the distribution of spam versus non-spam emails in the dataset.

In [None]:
def plotfeature(X_train, y_train, allfeature):
    plot_df = pd.DataFrame(X_train, columns=allfeature)
    plot_df['label'] = y_train
    for feature in allfeature:
        plt.figure(figsize=(8, 5))
        sns.histplot(data=plot_df, x=feature, hue='label', kde=True, stat="density", common_norm=False)
        plt.title('Distribution of "' + feature + '"')
        plt.legend(title='Spam', labels=['1 (Spam)', '0 (Not Spam)'])
        plt.tight_layout()
        plt.show()

plotfeature(X_train, y_train, allfeature)


## Implement Naive Bayes from Scratch    

A NaiveBayesClassifier class was implemented from scratch using Gaussian Naive
Bayes principles.
● fit(X, y): This method calculates the mean, variance, and log-priors for each feature
based on the classes (spam/non-spam). A smoothing parameter alpha (defaulting
to 1e-5) is added to the variance calculation to prevent numerical instability from
features with zero variance.
● predict(X): This method calculates the log-likelihood of each sample belonging to
each class using the Gaussian probability density formula. It combines these
likelihoods with the log-priors to determine the posterior probability and predicts the
class with the highest score.
● Helper functions were also created: findmetrics to calculate accuracy, precision,
recall, and F1-score, and calculate_manual_confusion_matrix to generate a
2x2 confusion matrix

In [None]:

class NaiveBayesClassifier:
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self._classes = None
        self._log_priors = {}
        self._means = {}
        self._vars = {}

    def fit(self, X, y):
        n, m = X.shape
        self._classes = np.unique(y)
        for c in self._classes:
            X_c = X[y == c]
            self._log_priors[c] = np.log(len(X_c) / n)
            self._means[c] = np.mean(X_c, axis=0)
            self._vars[c] = np.var(X_c, axis=0) + self.alpha

    def predict(self, x):
        ypred = []
        for s in x:
            scores = []
            for c in self._classes:
                lp = self._log_priors[c]
                mu = self._means[c]
                var = self._vars[c]
                num = -((s - mu) ** 2) / (2 * var)
                den = 0.5 * np.log(2 * np.pi * var)
                ll = num - den
                tot = np.sum(ll)
                score = lp + tot
                scores.append(score)
            ypred.append(self._classes[np.argmax(scores)])
        return np.array(ypred)



In [None]:

def findmetrics(y_true, y_pred, label=1):
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    for i in range(len(y_true)):
        if y_true[i] == label and y_pred[i] == label:
            TP += 1
        elif y_true[i] != label and y_pred[i] != label:
            TN += 1
        elif y_true[i] != label and y_pred[i] == label:
            FP += 1
        elif y_true[i] == label and y_pred[i] != label:
            FN += 1
    total = TP + TN + FP + FN
    accuracy = (TP + TN) / total if total else 0
    precision = TP / (TP + FP) if (TP + FP) else 0
    recall = TP / (TP + FN) if (TP + FN) else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0
    return accuracy, precision, recall, f1



# acc, prec, rec, f1 = findmetrics(y_test, y_pred)
# cm = calculate_manual_confusion_matrix(y_test, y_pred)


In [None]:
def calculate_manual_confusion_matrix(y_true, y_pred, label=1):
    TN = 0
    FP = 0
    FN = 0
    TP = 0
    for i in range(len(y_true)):
        if y_true[i] == label and y_pred[i] == label:
            TP += 1
        elif y_true[i] != label and y_pred[i] != label:
            TN += 1
        elif y_true[i] != label and y_pred[i] == label:
            FP += 1
        elif y_true[i] == label and y_pred[i] != label:
            FN += 1
    return np.array([[TN, FP], [FN, TP]])

In [None]:
def testmodel(alphalist, xtrain, xtest, ytrain, ytest):
    res = []
    for a in alphalist:
        modle = NaiveBayesClassifier(alpha=a)
        modle.fit(xtrain, ytrain)
        yptrain = modle.predict(xtrain)
        yptest = modle.predict(xtest)
        tracc, trpre, trrec, trf1 = findmetrics(ytrain, yptrain)
        teacc, tepre, terec, tef1 = findmetrics(ytest, yptest)
        res.append({
            'alpha': a,
            'train_accuracy': tracc,
            'test_accuracy': teacc,
            'train_precision': trpre,
            'test_precision': tepre,
            'train_recall': trrec,
            'test_recall': terec,
            'train_f1': trf1,
            'test_f1': tef1
        })
        print("Alpha used :", a, "| test accuracy:", round(teacc, 4), "| test f1:", round(tef1, 4))
    return pd.DataFrame(res)


alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10]
res_df = testmodel(alpha, X_train, X_test, y_train, y_test)
print("\nhyperparameter tuning complete.")

In [None]:
print(res_df)


## Visualization and Hyperparameter Tuning

Results: The model's performance was tracked across these alphas. The optimal performance was achieved with alpha = 0.001, yielding a test accuracy of 0.8339 and an F1-score of 0.8189.
Analysis: Performance degraded significantly at higher alphas (1 and 10), which exhibited high precision but very low recall. The results, including accuracy, F1, precision, and recall plots versus alpha, and confusion matrices for the best model, were generated.


In [None]:
def trainbestmodel(results, xtrain, xtest, ytrain):
    row = results.loc[results['test_accuracy'].idxmax()]
    bestalpha = row['alpha']
    print("\nBest performing alpha (by test accuracy):", bestalpha)
    model = NaiveBayesClassifier(alpha=bestalpha)
    model.fit(xtrain, ytrain)
    ypredtrain = model.predict(xtrain)
    ypredtest = model.predict(xtest)
    return model, ypredtrain, ypredtest, bestalpha

best_model, y_pred_train_best, y_pred_test_best, best_alpha = trainbestmodel(res_df, X_train, X_test, y_train)

In [None]:
def plotconfmatrix(ytrain, ytest, ypredtrain, ypredtest, bestalpha):
    cmtrain = calculate_manual_confusion_matrix(ytrain, ypredtrain)
    cmtest = calculate_manual_confusion_matrix(ytest, ypredtest)
    labels = ['Not Spam (0)', 'Spam (1)']
    plt.figure(figsize=(14, 6))
    plt.subplot(1, 2, 1)
    sns.heatmap(cmtrain, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.title('Training set confusion matrix (alpha = ' + str(bestalpha) + ')')
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.subplot(1, 2, 2)
    sns.heatmap(cmtest, annot=True, fmt='d', cmap='Oranges', xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.show()



In [None]:
plotconfmatrix(y_train, y_test, y_pred_train_best, y_pred_test_best, best_alpha)

In [None]:
# Build a plot with alpha along the x-axis and training/test accuracy
def plotmetric(res_df, metric, title):
    plt.plot(res_df['alpha'], res_df[f'train_{metric}'], 'o-', label='Train')
    plt.plot(res_df['alpha'], res_df[f'test_{metric}'], 'o-', label='Test')
    plt.xscale('log')
    plt.xlabel("Smoothing parameter 'alpha' (log scale)")
    plt.ylabel(title)
    plt.title(title + " vs. smoothing parameter")
    plt.legend()

def plotallmetrics(res_df):
    metrics = [('accuracy', 'Accuracy'), ('precision', 'Precision'),
               ('recall', 'Recall'), ('f1', 'F1-score')]
    plt.figure(figsize=(16, 12))
    for i, (metric, title) in enumerate(metrics, 1):
        plt.subplot(2, 2, i)
        plotmetric(res_df, metric, title)
    plt.suptitle("Model performance vs. hyperparameter 'alpha'", fontsize=18)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()



In [None]:
plotallmetrics(res_df)


#### Explain which value of ‘a’ is most suitable for your dataset and why.
#### Discuss how this smoothing parameter influences the model and helps prevent overfitting.
-----------------------
#### The best value for ‘a’ is 0.001 because it gives the highest test accuracy, meaning the model performs best on unseen data. In Gaussian Naive Bayes, this parameter works like a small adjustment to the variance (σ² + a) to avoid division-by-zero errors. It also helps reduce overfitting by slightly broadening the bell curve of each feature. This makes the model less overconfident on the training data and more capable of handling small changes in new or unseen emails.


## Investigating the Independence Assumption

This section tested the "naive" assumption of feature independence.
A highly discriminative feature, 'char_freq_!' (index 51), was selected.
This feature was duplicated one, two, three, and four times, increasing the dataset's feature count from 57 to 61.
The model (using the best alpha of 0.001) was retrained and evaluated at each step.
Findings: The test accuracy did not decrease; it remained stable at 0.8339 for 1 and 2 copies and slightly increased to 0.8350 for 3 and 4 copies.
Analysis: This result demonstrates the classifier's naive independence assumption. Because the model treats all features as independent, adding correlated copies (duplicates) of a strong predictor (char_freq_!) causes the model to "double-count" (or triple, quadruple, etc.) its evidence. This squared the feature's contribution (or raised it to the 5th power with 4 copies), amplifying its importance rather than penalizing the model for redundancy.

In [None]:
# Select one feature that you believe is highly discriminative

# Ans : char_freq_$ (index 51) is highly discriminative

def selectfeature(X, allfeature, best_alpha):

    feature_index_to_copy = 51
    feature_name = allfeature[feature_index_to_copy]

    print(f"Investigating Independence Assumption ")
    print(f"Using best alpha: {best_alpha}")
    print(f"Duplicating feature: '{feature_name}' (index {feature_index_to_copy})")

    col = X[:, [feature_index_to_copy]]

    return col, feature_index_to_copy, feature_name

featurecol, feature_index_to_copy, feature_name = selectfeature(X, allfeature, best_alpha)



In [None]:
dup_res = []
xmod = X.copy()
print(f"Original dataset has {xmod.shape[1]} features.")

def evaluatemodel(xdata, ncopies):
    xtrain, xtest, ytrain, ytest = train_test_split(xdata, y, test_size=0.2, random_state=42, stratify=y)
    model = NaiveBayesClassifier(alpha=best_alpha)
    model.fit(xtrain, ytrain)
    ypredtrain = model.predict(xtrain)
    ypredtest = model.predict(xtest)
    trainacc, trainpre, trainrec, trainf1 = findmetrics(ytrain, ypredtrain)
    testacc, testpre, testrec, testf1 = findmetrics(ytest, ypredtest)
    print(f"  Trained with {ncopies} added copies | Total Features: {xdata.shape[1]:<3} | Test Acc: {testacc:.4f}")
    return {
        'n_copies': ncopies,
        'train_accuracy': trainacc, 'test_accuracy': testacc,
        'train_precision': trainpre, 'test_precision': testpre,
        'train_recall': trainrec, 'test_recall': testrec,
        'train_f1': trainf1, 'test_f1': testf1
    }

res0 = evaluatemodel(xmod, 0)
dup_res.append(res0)


In [None]:
# Dataset 1(one copy)

xmod = np.hstack((xmod, featurecol))
res1 = evaluatemodel(xmod, 1)
dup_res.append(res1)

In [None]:
# Dataset 2(two copies)

xmod = np.hstack((xmod, featurecol))
res2 = evaluatemodel(xmod, 2)
dup_res.append(res2)

In [None]:
# Dataset 3(three copies)
xmod = np.hstack((xmod, featurecol))
res3 = evaluatemodel(xmod, 3)
dup_res.append(res3)

In [None]:
# Dataset 4(four copies)
xmod = np.hstack((xmod, featurecol))
res4 = evaluatemodel(xmod, 4)
dup_res.append(res4)

In [None]:
dup_res_df = pd.DataFrame(dup_res)

## Plot the results

In [None]:
metrics_to_plot = [('accuracy', 'Accuracy'), ('f1', 'F1-Score')]

plt.figure(figsize=(14, 6))
for i, (metric, title) in enumerate(metrics_to_plot, 1):
    plt.subplot(1, 2, i)
    plt.plot(dup_res_df['n_copies'], dup_res_df[f'train_{metric}'], 'o-', label=f'Train {title}')
    plt.plot(dup_res_df['n_copies'], dup_res_df[f'test_{metric}'], 'o-', label=f'Test {title}')
    plt.xlabel(f"Number of Added Copies of '{feature_name}'")
    plt.ylabel(title)
    plt.title(f'{title} vs. Duplicated Features')
    plt.xticks(range(5))
    plt.legend()
    plt.grid(True, which="both", ls="--", alpha=0.5)

plt.suptitle("Effect of Violating the Independence Assumption", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## Discussions
#### How does adding duplicate (and thus perfectly correlated) features affect the classifier's performance?
#### Explain this behavior by referencing the Naive Bayes decision rule.
#### What happens mathematically to the likelihood term when you add a copy of a feature?

---------------------------
1.
Adding duplicate features generally degrades the classifier's test performance and increases overfitting. You'll see the test accuracy go down, while the train accuracy might stay high. The model becomes overconfident based on this one amplified feature and fails to generalize to new data.

2.
The Naive Bayes decision rule is y=argmaxy​P(y)∏P(xi​∣y). The core of this rule is the "naive" assumption that all features (xi​) are independent. By adding a duplicate feature, we perfectly violate this assumption. The classifier, being "naive," doesn't know it's a copy and treats it as a new, independent piece of evidence. This "double counting" gives that one feature's "vote" an unfairly large influence on the final decision.

3.
When you add a copy xk​ of a feature xj​, the likelihood term in the product changes:

    Original: ⋯×P(xj​∣y)×…

    With 1 copy: ⋯×P(xj​∣y)×P(xk​∣y)×…

Since xj​=xk​, their probabilities are identical, and the new term becomes: ⋯×(P(xj​∣y))2×…

This squares the feature's contribution. If you add 4 copies, that single feature's probability is raised to the 5th power ((P(xj​∣y))5), massively amplifying its importance and drowning out the evidence from all other features.