# Spam Email Classification Using Gaussian Naive Bayes

Goal: build a spam email classifier by converting emails into word count features, training a Gaussian Naive Bayes model on labeled training data, and evaluating its accuracy on test emails to determine how well it distinguishes spam from non spam messages.

In [7]:
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np


importing libraries we will be using for this model


In [8]:
!unzip -q Data.zip



Need to unzip the data folder and trash the folder so it dosent get read, first in order to properly read it in to colab or notebook

In [9]:
TRAIN_DIR = "./train-mails"
TEST_DIR  = "./test-mails"

Reading in our training and test data folders

# Dictionary Creation

Purpose: Scan training emails, clean tokens, and build a dictionary of the 3000 most frequent words that the model will use as features.

In [10]:
def make_Dictionary(root_dir):
  all_words = []
  emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
  for mail in emails:
    with open(mail) as m:
      for line in m:
        words = line.split()
        all_words += words
  dictionary = Counter(all_words)
  list_to_remove = list(dictionary)

  for item in list_to_remove:
    if item.isalpha() == False:
      del dictionary[item]
    elif len(item) == 1:
      del dictionary[item]
  dictionary = dictionary.most_common(3000)
  return dictionary


scanning all training emails, counting how often each word appears, removing junk words like symbols and single letters, and keeping the 3000 most common words. Resulting in a fixed dictionary that defines which words the spam detection model will pay attention to.

# Feature Extraction & Labeling

Purpose: Convert emails into numerical feature vectors (word counts) and assign spam or non spam labels based on file names

In [11]:
def extract_features(mail_dir):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000))
  train_labels = np.zeros(len(files))
  count = 1;
  docID = 0;
  for fil in files:
    with open(fil) as fi:
      for i, line in enumerate(fi):
        if i ==2:
          words = line.split()
          for word in words:
            wordID = 0
            for i, d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens)-1]
      if lastToken.startswith("spmsg"):
        train_labels[docID] = 1;
        count = count + 1
      docID = docID + 1
  return features_matrix, train_labels

converting each email into a numeric feature by counting how often each dictionary word appears and assiging a label based on whether the file name indicates spam or not. which produces a feature matrix and label array used to train and test the spam classification.

# Model Training

Pupose: Train the Gaussian Naive Bayes classifier using the extracted training features

In [12]:
dictionary = make_Dictionary(TRAIN_DIR)

print ("reading and processing emails from TRAIN and TEST folders")
features_matrix, labels = extract_features(TRAIN_DIR)
test_features_matrix, test_labels = extract_features(TEST_DIR)

reading and processing emails from TRAIN and TEST folders


building a dictionary of the 3000 most frequent words from the training emails, then converting both the training and test emails into numerical feature matrices with corresponding spam or non spam labels so they can be used by the Naive Bayes model for learning and evaluation

# Prediction & Evaluation

Purpose: Predict spam labels for test emails and evaluate model performance using accuracy.

In [13]:


print("Training Model using Gaussian Naibe Bayes algorithm")

model = GaussianNB()
model.fit(features_matrix, labels)

print("Training completed")
print("testing trained model to predict Test Data labels")

predicted_labels = model.predict(test_features_matrix)

accuracy = accuracy_score(test_labels, predicted_labels)

print(
    "Completed classification of the Test Data now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:",
    accuracy
)


Training Model using Gaussian Naibe Bayes algorithm
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data now printing Accuracy Score by comparing the Predicted Labels with the Test Labels: 0.9615384615384616


trained a Gaussian Naive Bayes model on labeled email features, using it to predict spam on test emails, and measure performance by calculating classification accuracy.