# Spam Detection using Naive Bayes

This notebook implements a **Bernoulli Naive Bayes** classifier from scratch to distinguish spam emails from non-spam (ham) emails. We will use the [Spambase Data Set](https://archive.ics.uci.edu/ml/datasets/spambase) from the UCI Machine Learning Repository. 📧

The core idea is to calculate the probability of an email being spam given its features, and compare that to the probability of it being not spam. According to Bayes' theorem:

$$ P(\text{Class} | \text{Features}) = \frac{P(\text{Features} | \text{Class}) \cdot P(\text{Class})}{P(\text{Features})} $$

Since we only care about which class has a higher probability, we can ignore the denominator $P(\text{Features})$ and compare the posteriors:

$$ \text{posterior} \propto \text{likelihood} \times \text{prior} $$

To avoid numerical underflow (multiplying many small probabilities results in a number too small for the computer to store), we work with the log of the probabilities:

$$ \log(\text{posterior}) = \log(\text{likelihood}) + \log(\text{prior}) $$

## 1. Importing Necessary Libraries

We'll start by importing `numpy` for numerical operations and `train_test_split` from `scikit-learn` to divide our data into training and testing sets.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split

## 2. Defining the Classifier Functions

Here we define the core functions that will form our Naive Bayes classifier.

In [2]:
def calculate_log_likelihoods_with_naive_bayes(feature_vector, Class):
    """
    Calculates the log-likelihood of a feature vector for a given class.
    This is log(P(Features | Class)).
    
    Args:
        feature_vector (list): A binarized vector where 1 means a feature is present, 0 otherwise.
        Class (int): The class to calculate the likelihood for (0 for 'ham', 1 for 'spam').
        
    Returns:
        float: The total log-likelihood.
    """
    # Ensure the feature vector has the correct number of features.
    assert len(feature_vector) == num_features
    
    # Initialize log-likelihood to 0.0. We will add to this value.
    log_likelihood = 0.0
    
    # If the class is 'ham' (0)
    if Class == 0:
        # Iterate through each feature and its status (present/absent)
        for feature_index in range(len(feature_vector)):
            # If the feature is present in the email (value is 1)
            if feature_vector[feature_index] == 1:
                # Add the log-likelihood of this feature being PRESENT given class 0
                log_likelihood += np.log10(likelihoods_class_0[feature_index])
            # If the feature is absent from the email (value is 0)
            elif feature_vector[feature_index] == 0:
                # Add the log-likelihood of this feature being ABSENT given class 0
                # This is log(1 - P(feature=1 | class=0))
                log_likelihood += np.log10(1.0 - likelihoods_class_0[feature_index])
    
    # If the class is 'spam' (1)
    elif Class == 1:
        # Iterate through each feature and its status (present/absent)
        for feature_index in range(len(feature_vector)):
            # If the feature is present in the email (value is 1)
            if feature_vector[feature_index] == 1:
                # Add the log-likelihood of this feature being PRESENT given class 1
                log_likelihood += np.log10(likelihoods_class_1[feature_index])
            # If the feature is absent from the email (value is 0)
            elif feature_vector[feature_index] == 0:
                # Add the log-likelihood of this feature being ABSENT given class 1
                # This is log(1 - P(feature=1 | class=1))
                log_likelihood += np.log10(1.0 - likelihoods_class_1[feature_index])
    else:
        raise ValueError("Class takes integer values 0 or 1")
    
    return log_likelihood


def calculate_class_posteriors(feature_vector):
    """
    Calculates the log posterior probability for each class.
    This is log(P(Class | Features)) which is proportional to log(likelihood) + log(prior).
    
    Args:
        feature_vector (list): A binarized feature vector.
        
    Returns:
        tuple: A tuple containing the log posterior for class 0 and class 1.
    """
    # Calculate log likelihoods for both classes using the function defined above.
    log_likelihood_class_0 = calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=0)
    log_likelihood_class_1 = calculate_log_likelihoods_with_naive_bayes(feature_vector, Class=1)

    # Calculate log posteriors by adding the log priors (calculated during training).
    log_posterior_class_0 = log_likelihood_class_0 + log_prior_class_0
    log_posterior_class_1 = log_likelihood_class_1 + log_prior_class_1

    return log_posterior_class_0, log_posterior_class_1


def classify_spam(document_vector):
    """
    Classifies an email as spam or not based on its document vector.
    
    Args:
        document_vector (list): The original feature vector from the dataset (contains frequencies).
        
    Returns:
        int: The predicted class (0 for ham, 1 for spam).
    """
    # For Bernoulli Naive Bayes, we only care about the presence or absence of a feature.
    # We convert the frequency-based vector into a binary vector.
    # Any feature with a frequency > 0.0 is considered 'present' (1).
    feature_vector = [int(element > 0.0) for element in document_vector]
    
    # Calculate the log posteriors for this binarized feature vector.
    log_posterior_class_0, log_posterior_class_1 = calculate_class_posteriors(feature_vector)
    
    # The class with the higher log posterior probability is our prediction.
    if log_posterior_class_0 > log_posterior_class_1:
        return 0  # Predict 'ham'
    else:
        return 1  # Predict 'spam'

## 3. Performance Evaluation Function

This simple function calculates the accuracy of our model by comparing its predictions to the true labels.

In [3]:
def evaluate_performance(predictions, ground_truth_labels):
    """
    Calculates the accuracy of the classifier.
    
    Args:
        predictions (list): A list of predicted labels (0 or 1).
        ground_truth_labels (list): The list of true labels.
        
    Returns:
        float: The accuracy as a value between 0 and 1.
    """
    # Counter for correct predictions.
    correct_count = 0.0
    
    # Iterate through all predictions.
    for item_index in range(len(predictions)):
        # If the prediction matches the true label, increment the counter.
        if predictions[item_index] == ground_truth_labels[item_index]:
            correct_count += 1.0
            
    # Accuracy is the ratio of correct predictions to the total number of predictions.
    accuracy = correct_count / len(predictions)
    return accuracy

## 4. Main Execution Block

This is where the main logic resides. We'll load the data, process it, train the model, make predictions, and evaluate the results. ⚙️

In [4]:
# Set a fixed file path. IMPORTANT: You must change this to the location of 'spambase.data' on your system.
file_path = "spambase.data"

# --- Data Loading ---
try:
    with open(file_path, 'r') as datafile:
        data = []
        # Read the file line by line.
        for line in datafile:
            # Each line is a comma-separated list of values. We split it and convert to floats.
            line = [float(element) for element in line.strip().split(',')]
            # Append the processed line as a numpy array to our data list.
            data.append(np.asarray(line))
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
    print("Please download 'spambase.data' and update the 'file_path' variable.")
    data = []

if data:
    # --- Data Preparation ---
    # The first 48 columns are features (word frequencies).
    num_features = 48

    # Extract the feature vectors (first 48 columns) for all emails.
    X = [data[i][:num_features] for i in range(len(data))]

    # Extract the target labels (the last column), converting them to integers (0 or 1).
    y = [int(data[i][-1]) for i in range(len(data))]

    # --- Train-Test Split ---
    # Split the dataset into a training set (25%) and a testing set (75%).
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.25, random_state=42)

    # --- Model Training ---
    # The "training" in Naive Bayes involves calculating prior probabilities and likelihoods from the training data.

    # 1. Calculate Class Priors: P(Class)
    # Separate the training data by class (ham vs. spam).
    X_train_class_0 = [X_train[i] for i in range(len(X_train)) if y_train[i] == 0] # Ham emails
    X_train_class_1 = [X_train[i] for i in range(len(X_train)) if y_train[i] == 1] # Spam emails

    # Count the number of emails in each class.
    num_class_0 = float(len(X_train_class_0))
    num_class_1 = float(len(X_train_class_1))

    # Prior probability is the ratio of emails in a class to the total number of emails.
    prior_probability_class_0 = num_class_0 / (num_class_0 + num_class_1)
    prior_probability_class_1 = num_class_1 / (num_class_0 + num_class_1)

    # Calculate the log of the priors.
    log_prior_class_0 = np.log10(prior_probability_class_0)
    log_prior_class_1 = np.log10(prior_probability_class_1)

    # 2. Calculate Likelihoods: P(Feature | Class)
    # For Bernoulli Naive Bayes, we binarize the feature vectors first.
    X_train_class_0_binary = np.array([np.array(x) > 0.0 for x in X_train_class_0])
    X_train_class_1_binary = np.array([np.array(x) > 0.0 for x in X_train_class_1])

    # The likelihood of a feature being present for a class is the mean of that feature's column in the binarized data.
    # We add a small value (1e-6) for Laplace smoothing to avoid log(0) errors if a feature never appears in a class.
    likelihoods_class_0 = np.mean(X_train_class_0_binary, axis=0) + 1e-6
    likelihoods_class_1 = np.mean(X_train_class_1_binary, axis=0) + 1e-6

    # --- Prediction ---
    # Make predictions on the unseen test set.
    predictions = []
    for email in X_test:
        predictions.append(classify_spam(email))

    # --- Evaluation ---
    # Calculate and print the accuracy of the model.
    accuracy_of_naive_bayes = evaluate_performance(predictions, y_test)
    print(f"The accuracy of the Naive Bayes classifier is: {accuracy_of_naive_bayes:.4f}")

The accuracy of the Naive Bayes classifier is: 0.8705
