## Spam Classifier
__Machine Learning & Data Science Assignment by Thöni Andreas__  
2024

---
Click [here](https://github.com/ajayhanssen/spam_filter_ml), to get to the Github repository

---

#### Preprocessing Infrastructure
This section introduces the __EmailPreprocessor__ class, which is able to convert a raw email file into a stemmed and normalized string. It inherits from sklearn's BaseEstimator and TransformerMixin classes, so that it is possible to use the class with sklearn's __make_pipeline()__ later.

In [23]:
from email import policy
from email.parser import BytesParser
from nltk.stem.snowball import SnowballStemmer
import re
from string import punctuation
from sklearn.base import BaseEstimator, TransformerMixin

class EmailPreprocessor(BaseEstimator, TransformerMixin):
    """
    class for processing emails, extracting email parts and applying regex and stemming
    """

    def __init__(self) -> None:
        pass

    def extract_email_from_file(self, file_path : str) -> str:
        """
        Extracts the subject, sender, recipient and body of an email file
        arguments: file_path - path to the email file
        return: string containing the email
        """

        # Open the email file
        with open(file_path, 'rb') as file:
            # Parse the email using the default policy
            email_message = BytesParser(policy=policy.default).parse(file)
        
        # Extract headers, like subject, sender and recipient
        # the .get() method is used to avoid errors if the header is not present (returns failobj instead)
        subject = email_message.get('Subject', '(No Subject)')
        sender = email_message.get('From', '(Unknown Sender)')
        recipient = email_message.get('To', '(Unknown Recipient)')
        
        # Extract body
        # initilaize empty string for storing the body
        body = ""
        if email_message.get_body(preferencelist=('plain', 'html')):
            body_content = email_message.get_body(preferencelist=('plain', 'html'))
            body = body_content.get_content()  # Automatically decodes and returns the content
        
        # return email as string
        return f"{subject} {sender} {recipient} {body.strip()}"

    def stem_and_regex(self, email_string : str) -> str:
        """
        Applies stemming and regex operations to the email
        arguments: email - dictionary with the email parts or string with the email
        return: string with the processed email
        """

        # convert to lowercase
        email_string = email_string.lower()

        ## perform regex operations
        # change email adresses to 'emailaddr'
        email_string = re.sub(r'\b[\w\.-]+@[\w\.-]+\.\w+\b', 'emailaddr', email_string)

        # change urls to 'httpaddr'
        email_string = re.sub(r'(http|https)://[^\s]*', 'httpaddr', email_string)

        # change time to 'time'
        email_string = re.sub(r'\b\d{1,2}:\d{1,2}(:\d{1,2})?\b', 'time', email_string)

        # change date to 'date'
        email_string = re.sub(r'\b\d{1,2}/\d{1,2}/\d{4}\b', 'date', email_string)

        # change dollar to 'dollar'
        email_string = re.sub(r'\$\S+', 'dollar', email_string)

        # change www-URLs to 'wwwaddr'
        email_string = re.sub(r'\bwww\.[^\s]*\b', 'wwwaddr', email_string)

        # change percentages to 'percent'
        email_string = re.sub(r'\b\d+%', 'percent', email_string)

        # change ip to 'ipaddr'
        email_string = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', 'ipaddr', email_string)

        # change numbers to 'number'
        email_string = re.sub(r'\b\d+\b', 'number', email_string)

        # Remove itemization prefixes (bullets, numbered lists, etc.)
        email_string = re.sub(r'^\s*[\d\w][\.\)\-]\s*|[\u2022\u2219\u25CB·]\s*', '', email_string, flags=re.MULTILINE)

        ## other processing tasks
        # remove all kinds of punctuation
        email_string = email_string.translate(str.maketrans('', '', punctuation))

        # remove newlines and tabs
        email_string = email_string.replace('\n', ' ')
        email_string = email_string.replace('\t', ' ')

        # remove multiple spaces
        email_string = re.sub(r'\s+', ' ', email_string)

        ## stemming
        # stem the words using a Snowball stemmer (newer version of Porter stemmer apparently)
        stemmer = SnowballStemmer('english')

        # iterate through the words in the email-string and apply stemming, then join them back together
        email_string = ' '.join([stemmer.stem(word) for word in email_string.split()])

        # return single string containing the processed email
        return email_string

    def process_email(self, email_file_path : str) -> str:
        """
        Process an email file, calls the extract_email_from_file and stem_and_regex functions
        arguments: email_file_path - path to the email file
        return: string with the processed email
        """

        # call the extract_email_from_file function to get the email parts from the file
        email_parts = self.extract_email_from_file(email_file_path)

        # call the stem_and_regex function to process the email parts
        processed = self.stem_and_regex(email_parts)

        # return the processed email
        return processed
    
    ## methods for the transformer (needed to work with sklearn pipelines)
    def fit(self, X, y=None) -> None:
        """
        Fit method for the transformer, does nothing in this case
        """

        return
    
    def transform(self, X : list[str]) -> list[str]:
        """
        Transform method for the transformer, applies the stem_and_regex function to the emails
        arguments: X - list of emails
        return: list of processed emails
        """

        return [self.stem_and_regex(email) for email in X]


---
#### Load and preprocess ham and spam emails
This section introduces a function which can be used to extract email-files from a given directory, and preprocess them.

In [24]:
# if this is set to false, only ham and spam 1 will be used
# if this is set to true, ham and spam 2 will also be used
full_dataset = True

# If this is set to false, only the easy_ham folders will be used
# If this is set to true, the hard_ham directories are also used
include_hard_ham = True

In [25]:
import os


def load_data_from_directories(directory : str, label : str) -> tuple[list[str], list[int]]:
    """
    Load email data from given directories (ham and spam)
    arguments: directory - path to the file directory
               label - 'ham' or 'spam' label
    return: emails - list of email strings
            labels - list of labels (0 for ham, 1 for spam)
    """

    # Create an instance of the EmailPreprocessor class defined above
    preprocessor = EmailPreprocessor()

    # Initialize lists to store the emails and labels
    emails = []
    labels = []

    ## Load ham emails
    # iterate through the files in the ham directory
    for file_name in os.listdir(directory):
        file_path = os.path.join(directory, file_name)

        # as the open() function sometimes throwed an error caused by encoding issues,
        # these parts are wrapped in a try-except block (happens a few times in the code)
        try:
            # process the email file using the process_email method of the EmailPreprocessor class
            email = preprocessor.process_email(file_path)
            # append the processed email to the emails list
            emails.append(email)

            if label == 'ham':
                # append the label (0 for ham) to the labels list
                labels.append(0)
            elif label == 'spam':
                # append the label (1 for spam) to the labels list
                labels.append(1)
            else:
                # if the label is not 'ham' or 'spam', raise a ValueError
                raise ValueError("Label must be 'ham' or 'spam'")

        except:
            # if an error ocurred while reading from the file, print an error message and proceed to the next file
            print(f"Error processing file {file_path}")

    # return the emails and labels lists
    return emails, labels


## use the function to load the data
# define the paths to the ham and spam directories
easy_ham_1_dir = "./datasets/20030228_easy_ham/easy_ham"
hard_ham_1_dir = "./datasets/20030228_hard_ham/hard_ham"
spam_1_dir = "./datasets/20030228_spam/spam"

easy_ham_2_dir = "./datasets/20030228_easy_ham_2/easy_ham_2"
spam_2_dir = "./datasets/20050311_spam_2/spam_2"

# call the function to load and preprocess the data
emails, labels = load_data_from_directories(easy_ham_1_dir, 'ham')
emails_spam, labels_spam = load_data_from_directories(spam_1_dir, 'spam')

# concatenate the spam emails and labels as well
emails += emails_spam
labels += labels_spam

# if the full_dataset flag is set to True, load the hard_ham directory too
if include_hard_ham:
    emails_hard_ham, labels_hard_ham = load_data_from_directories(hard_ham_1_dir, 'ham')

    # concatenate the hard_ham emails and labels to the easy_ham ones
    emails += emails_hard_ham
    labels += labels_hard_ham

if full_dataset:
    # load the emails from the second set of ham and spam directories
    emails_easy_ham_2, labels_easy_ham_2 = load_data_from_directories(easy_ham_2_dir, 'ham')
    emails_spam_2, labels_spam_2 = load_data_from_directories(spam_2_dir, 'spam')

    # concatenate the easy_ham_2 emails and labels to the easy_ham ones
    emails += emails_easy_ham_2
    labels += labels_easy_ham_2

# print the number of emails loaded
print(f"Number of emails loaded: {len(emails)}")

Error processing file ./datasets/20030228_spam/spam\00217.43b4ef3d9c56cf42be9c37b546a19e78
Error processing file ./datasets/20030228_spam/spam\00319.a99dff9c010e00ec182ed5701556d330
Error processing file ./datasets/20030228_spam/spam\00388.53eae0055e66fcb7194f9cca080fdefe
Error processing file ./datasets/20050311_spam_2/spam_2\00002.9438920e9a55591b18e60d1ed37d992b
Error processing file ./datasets/20050311_spam_2/spam_2\00003.590eff932f8704d8b0fcbe69d023b54d
Error processing file ./datasets/20050311_spam_2/spam_2\00004.bdcc075fa4beb5157b5dd6cd41d8887b
Error processing file ./datasets/20050311_spam_2/spam_2\00005.ed0aba4d386c5e62bc737cf3f0ed9589
Error processing file ./datasets/20050311_spam_2/spam_2\00006.3ca1f399ccda5d897fecb8c57669a283
Error processing file ./datasets/20050311_spam_2/spam_2\00106.09988f439b8547dc90efb1530c02329b
Error processing file ./datasets/20050311_spam_2/spam_2\00108.813fc6306b631b5c58ecfc26caf3a8dc
Error processing file ./datasets/20050311_spam_2/spam_2\00293.

In [26]:
# print the first email, just to check if everything is working
print(emails[0])

re new sequenc window robert elz emailaddr chris garrigu emailaddr date wed number aug number time number from chris garrigu emailaddr messageid emailaddr i cant reproduc this error for me it is veri repeat like everi time without fail this is the debug log of the pick happen time pickit exec pick inbox list lbrace lbrace subject ftp rbrace rbrace numbernumb sequenc mercuri time exec pick inbox list lbrace lbrace subject ftp rbrace rbrace numbernumb sequenc mercuri time ftocpickmsg number hit time mark number hit time tkerror syntax error in express int note if i run the pick command by hand delta pick inbox list lbrace lbrace subject ftp rbrace rbrace numbernumb sequenc mercuri number hit that where the number hit come from obvious the version of nmh im use is delta pick version pick nmhnumbernumbernumb compil on fuchsiacsmuozau at sun mar number time ict number and the relev part of my mhprofil delta mhparam pick seq sel list sinc the pick command work the sequenc actual both of them

---
#### Vectorize the data using a TF-IDF Vectorizer
A TF-IDF Vectorizer is able to convert a collection of text objects into numerical vectors, which made it excellent for this use-case. TF-IDF stands for __Term Frequency-Inverse Document Frequency__ and evaluates how important a certain word is to a single instance in a collection of objects.  
If a words appears often in one instance, it is considered important at first, but if it also appears in many other instances, its importance is decreased.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create a TfidfVectorizer object with the desired parameters
# stop_words='english' uses the built-in list of English stop words and removes them
# max_features=2500 limits the number of features to the 2500 most important words
vectorizer = TfidfVectorizer(stop_words="english", max_features=2000)

# fit the vectorizer to the emails and transform them into a matrix
X = vectorizer.fit_transform(emails).toarray()
y = labels

# X is now a matrix, with rows representing the emails and columns the features
print(X.shape)

(4651, 2000)


---
#### Splitting into training and test set
In this section, the classic split in to the training and test set is conducted. The distribution used is 80% to 20%.

In [28]:
from sklearn.model_selection import train_test_split

# split the data into training and testing sets (80% training, 20% testing, classic 42 random state)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

---
#### Trying different classifiers
In the following sections, diffferent classifiers are used and evaluated using the following metrics:

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

__Naive Bayes model__ (is apparently often used for spam classifiers)

In [30]:
from sklearn.naive_bayes import MultinomialNB

mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

y_pred = mnb_model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")

Accuracy: 0.9462943071965628
Precision: 0.7444444444444445
Recall: 0.7127659574468085
F1 Score: 0.7282608695652174


__Support Vector Classifier__

In [31]:
from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(X_train, y_train)

y_pred = svc_model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")

Accuracy: 0.9849624060150376
Precision: 0.9761904761904762
Recall: 0.8723404255319149
F1 Score: 0.9213483146067416


__Random Forest Classifier__

In [32]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

y_pred = rf_model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print(f"F1 Score: {f1_score(y_test, y_pred)}")

Accuracy: 0.9817400644468314
Precision: 1.0
Recall: 0.8191489361702128
F1 Score: 0.9005847953216374


---
#### Transforming new data to use with the models
In this short section, it is shown how the functions and classes implemented before could be used in a pipeline to classify new, incoming emails

In [33]:
from sklearn.pipeline import make_pipeline

# create a pipeline with an EmailPreprocessor and the TfidfVectorizer that was fitted before
pipeline = make_pipeline(EmailPreprocessor(), vectorizer)


# define new emails
some_new_spam = "Hello, I am a Nigerian prince and I would like to offer you a million dollars. Please send me your bank account details."
some_new_ham = "Hi, I was wondering if i could get my Nintendo Switch back. I only lent it to you for a week and it's been a month now."

# transform the new emails using the pipeline
X_new = pipeline.transform([some_new_spam, some_new_ham])

# convert sparse matrix to dense (SVC was trained on dense matrix, TF-IDF outputs a sparse vector, needs to be converted)
X_new = X_new.toarray()

# predict the label of the new emails using the model
labels = svc_model.predict(X_new)

# the model correctly classifies the emails as spam ([1]) and ham ([0])	
print(labels)

[1 0]


---
## Classification results
This section offers a quick overview over the performance of the three classifiers (Multinomial Naive-Bayes, SVC, Random Forest) used on varying datasets.

The length of the feature vector (being controlled by changing the _max_features_ parameter of the TF-IDF Vectorizer) had significant impact on the performance of the model. Using a Trial & Error approach, the value leading to best results was __3000__. Interestingly, on earlier versions of the classifier, using a different random seed, __2000__ yielded better results.

__Ham and Spam 1 only:__ (using a feature vector with a length of 3000)

| __Classifier__     | __Accuracy__ | __Precision__ | __Recall__ | __F1-Score__ |
|--------------------|--------------|---------------|------------|--------------|
| Multinomial NB     | 0.9817       | 0.9782        | 0.9091     | 0.9424       |
| __SVM Classifier__ | __0.9867__   | __1.0__       | __0.9192__ | __0.9579__   |
| Random Forest      | 0.9817       | 0.9782        | 0.9091     | 0.9424       |


__Ham and Spam 1 with hard ham:__ (using a feature vector with a length of 3000)

| __Classifier__     | __Accuracy__ | __Precision__ | __Recall__ | __F1-Score__ |
|--------------------|--------------|---------------|------------|--------------|
| Multinomial NB     | 0.9154       | 0.6667        | 0.8298     | 0.7393       |
| __SVM Classifier__ | __0.9877__   | __0.9778__    | __0.9362__ | __0.9565__   |
| Random Forest      | 0.9769       | 0.9540        | 0.8830     | 0.9171       |


__Full Dataset, including hard ham:__ (using a feature vector with a length of 3000)
| __Classifier__     | __Accuracy__ | __Precision__ | __Recall__ | __F1-Score__ |
|--------------------|--------------|---------------|------------|--------------|
| Multinomial NB     | 0.9474       | 0.7473        | 0.7234     | 0.7351       |
| __SVM Classifier__ | __0.9860__   | __0.9880__    | __0.8723__ | __0.9266__   |
| Random Forest      | 0.9839       | 1.0           | 0.8404     | 0.9132       |

As can be seen in the tables above, a __SVM classifier__ led to the __best__ results in all occations compared to the Multinomial Naive-Bayes and the Random-Forest classifier. The MNB classifier specifically started to struggle tremendously when introduced to the hard ham emails.

---