# __CA02:Spam eMail Detection using Naive Bayes Classification Algorithm__
2/4/2025

_In this code, we will explore training a model with a set of emails that are either spam or not spam. There are 702 emails total that have are equally divided between spam and not spam. We will test the model on 206 emails and ask the model to compare our known classification with the accuracy of its prediction._

In [2]:
import os
print(os.getcwd()) 

C:\Users\ashle\BSAN 6070 Intro to ML


In [4]:
#For my own work, I am seeing where my working directory is to ensure my file path is connected to where the emails are
os.chdir("C:/Users/ashle/OneDrive/Desktop/BSAN6070/CA02") 
print(os.getcwd()) 

C:\Users\ashle\OneDrive\Desktop\BSAN6070\CA02


In [6]:
#I just want to confirm that my working directory is correct
print("Exists:", os.path.exists("./train-mails"))
print("Is Directory:", os.path.isdir("./train-mails"))

Exists: True
Is Directory: True


In [8]:
#Importing neccessary libraries

import numpy as np
from collections import Counter
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import nltk
from nltk.corpus import stopwords

#I imported these two in order to proceed with Naive Bayes method
from sklearn.naive_bayes import GaussianNB      
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

In [9]:
#Stop words: for this exercise, we are only going to consider the most frequent 3000 words of dictionary from email.

nltk.download('stopwords')  # Download stopwords
stop_words = set(stopwords.words('english'))  # Here we're choosing the english stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ashle\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
#The following code reads all emails in a directory and extracts and counts word frequencies. In addition, it cleans the data by removing 
    #non-alphabetic words and single-letter words and returns the 3,000 most frequent words.

def make_Dictionary(root_dir):
    all_words = []  # Create an empty list
    emails = [os.path.join(root_dir, f) for f in os.listdir(root_dir)]  # Get file names and paths from the emails
    
    for mail in emails:  # This for loop is going to read each email and split them into individual words and append them to the initialized list.
        with open(mail) as m:
            for line in m:
                words = re.sub(r'[^\w\s]', '', line.lower()).split()  # Ensure text is lowercase and punctuation is removed.
                all_words += words

    dictionary = Counter(all_words)  # Create a Counter dictionary where keys are words, and values are their frequencies
    
    # Filter out non-alphabetic words and single-letter words using a dictionary comprehension
    filtered_dict = {word: count for word, count in dictionary.items() if word.isalpha() and len(word) > 1}
    
    # Get the 3000 most common words
    dictionary = Counter(filtered_dict).most_common(3000) # Get back the 3000 most frequently occurring words
    return dictionary


In [14]:
def extract_features(mail_dir, dictionary):  # with this function, we are converting emails into a numerical representation based on word frequencies
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files), 3000))  # Here we're initializing a feature matrix with each row representing an email; each column an email
    train_labels = np.zeros(len(files))  # Labels for training (0 for not spam, 1 for spam)
    docID = 0
    
    word_list = [word[0] for word in dictionary]  # Creating the word list from the dictionary (words only)
    word_to_index = {word: idx for idx, word in enumerate(word_list)}  # Map words to their index
    
    for fil in files:  # Looping through each email
        with open(fil) as fi:  # Open each email
            for i, line in enumerate(fi):
                if i == 2: # looking at just the third line (probably because the emails contain 2 lines of headers and we only want to look at the subjects)
                    words = line.split()
                    for word in words:
                        # Check if the word exists in the dictionary and update the features matrix
                        if word in word_to_index:
                            wordID = word_to_index[word]
                            features_matrix[docID, wordID] += 1  # Increment word count for this document
    
        # Assign label based on the filename (if it starts with 'spmsg' it's spam)
        train_labels[docID] = 1 if fil.lower().startswith("spmsg") else 0
        docID += 1
    
    return features_matrix, train_labels


In [16]:
#As per the "Important Note", make sure the the path of your data folders 'train-mails' and 'test-mails' are './train-mails' and './test-mails'.
    #This means you must have your .ipynb file and these folders in the SAME FOLDER in your laptop or Google Drive.

TRAIN_DIR = './train-mails'
TEST_DIR = './test-mails'


train_dict = make_Dictionary(TRAIN_DIR) #Here I am creating dictionaries for both training/testing based on the 3000 most frequent words
test_dict = make_Dictionary(TEST_DIR)


features_matrix, labels = extract_features(TRAIN_DIR, train_dict) #feature extraction from training dataset
test_features_matrix, test_labels = extract_features(TEST_DIR, test_dict) #feature extraction from testing dataset


In [17]:
print(train_dict[:20])

[('email', 1664), ('order', 1414), ('address', 1299), ('report', 1217), ('mail', 1133), ('language', 1099), ('send', 1080), ('program', 1009), ('our', 991), ('list', 946), ('one', 921), ('subject', 913), ('name', 883), ('receive', 826), ('free', 801), ('money', 797), ('nt', 759), ('work', 756), ('information', 684), ('business', 669)]


In [20]:
print(test_dict[:20])

[('university', 582), ('language', 497), ('email', 452), ('http', 397), ('information', 361), ('subject', 350), ('our', 342), ('address', 336), ('com', 331), ('de', 325), ('conference', 314), ('one', 269), ('order', 248), ('please', 248), ('paper', 237), ('program', 232), ('www', 226), ('include', 224), ('web', 223), ('workshop', 223)]


In [22]:
top_20_trainwords = train_dict[:20]  # Get the top 20 most common words from the dictionary
for word, count in top_20_trainwords:
    print(f"{word}: {count}")

email: 1664
order: 1414
address: 1299
report: 1217
mail: 1133
language: 1099
send: 1080
program: 1009
our: 991
list: 946
one: 921
subject: 913
name: 883
receive: 826
free: 801
money: 797
nt: 759
work: 756
information: 684
business: 669


In [24]:
top_20_testwords = test_dict[:20]  # Get the top 20 most common words from the dictionary
for word, count in top_20_testwords:
    print(f"{word}: {count}")

university: 582
language: 497
email: 452
http: 397
information: 361
subject: 350
our: 342
address: 336
com: 331
de: 325
conference: 314
one: 269
order: 248
please: 248
paper: 237
program: 232
www: 226
include: 224
web: 223
workshop: 223


In [26]:
#To make this code even better, I would further investigate some of these words and why they are appearing
#For example... "www", "de", "nt" and"come" do not seem to be english words. I want to know why these appear and why they are frequent.

## __Training, Predicting, and Evaluating Model__
_Now that we have processed our emails, we can proceed with model traning, predicting, and evaluating_

### Training our model using Naive Bayes algorithm

In [28]:
#model = GaussianNB() #got an accuracy of 1?
model = MultinomialNB() #again, got an accuracy of 1?
#model = BernoulliNB()

model.fit(features_matrix, labels)
print("Training completed")
print("Testing trained model to predict Test Data labels...")

Training completed
Testing trained model to predict Test Data labels...


### Predicting labels for the test data based on the training data

In [30]:
predicted_labels = model.predict(test_features_matrix)
print("Completed classification of the Test Data ....")

Completed classification of the Test Data ....


### Evaluating performance (Accuracy)

In [32]:
print("Now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:") 
accuracy = accuracy_score(test_labels, predicted_labels)
print(accuracy)

Now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
1.0


In [34]:
#I keep getting an accuracy of 1.0 regardless of model and attempting to look at overfitting issues...
    #I wanted to take a look at other metrics but this seems sketchy hm......

print(classification_report(test_labels, predicted_labels)) 

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       260

    accuracy                           1.00       260
   macro avg       1.00      1.00      1.00       260
weighted avg       1.00      1.00      1.00       260



In [370]:
for i in range(10):
    print("True:", test_labels[i], "Predicted:", predicted_labels[i])

True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0
True: 0.0 Predicted: 0.0


### My initial issue I ended up finding was that all of the messages were being incorrectly classified as "not spam" (labeled as 0) because none of them met the requirement "if lastToken.startswith("spmsg")". In looking deeper into what seems to be spam or not, there are many other details that may contribute to this

However, because the logic is using simply the file name in this case, it makes sense that they are being classified correctly. It would be a better model if we were looking into the messages too and trying to decipher if the content shows it is spam or not.

Desired output: reading and processing emails from TRAIN and TEST folders
Training Model using Gaussian Naibe Bayes algorithm .....
Training completed
testing trained model to predict Test Data labels
Completed classification of the Test Data .... now printing Accuracy Score by comparing the Predicted Labels with the Test Labels:
0.9653846153846154