# DAT565/DIT407 Assignment 3

Author: Group 26 | Wenjun Tian wenjunt@chalmers.se | Yifan Tang yifant@chalmers.se

Date: 2024-11-20

# Problem 1: Spam & Ham

## A. Data exploration

By the judgment as a human being, these following features makes me able to tell spam apart from ham:

1. Topic and related key words: Spam emails mostly focus on commodity promotion, dating/porn website promotion, and pure scam. For commodity promotion emails, there are key words related to prices, such as "$", "cash", and "cheap"; for dating/porn website promotion, there are key words concerning sexual features, such as "amateur", "wives", and "girls"; for scam emails, mostly, might focus on topics like unexpected fortune or job opportunities. By contrast, ham emails have various topics, including work-related content, personal communication, and legitimate newsletters.

2. Structure of the `HTML` content: Spam emails are usually generated from a fixed template, thus most of them have complex and fancy `HTML` structures. These can include various fonts, colors, and embedded media. On the other hand, ham emails are simple and neat in most cases, which focuses on direct communication.

3. Spam markers: Spam emails tend to be marked as AD by mail server admins and users, while ham emails do not. Moreover, spam emails sometimes explicitly assert "This is NOT spam!" or something similar, which is a very poor lie that reveals the truth. On the other hand, ham emails lack such markers and are usually consistent in terms of content and senders.

Furthermore, the reasons that make hard ham emails different from easy ham emails but similar to spam emails are as follows:

1. Similar content to spams: Hard ham emails also focus on promotion of commodities, companies, etc. This makes hard ham emails look like spams, especially in their subject and promotion words.

2. Created from templates: Hard ham emails are also created from templates and sent to a large amount of people, which resembles spams in terms of `HTML` structure.

Though hard ham emails are hard to distinguish from spams, there is a key feature that differentiates those two: Most hard ham emails have "unsubcribe" key word, meaning that the receivers can reject further emailing. However, spam emails rarely provide such function.

## B. Data splitting

We perform the train-test set split with a ratio of 3:1 by invoking `train_test_split()` function in `sklean.model_selection` package.

# Problem 2: Preprocessing

We read email files from specified categories (`easy_ham`, `hard_ham`, and `spam`) and stores their content in a DataFrame for further analysis.
In the `read_email_file`, we try to read the content of an email file using different encodings (ascii, iso-8859-1, and utf-8) to handle potential encoding issues. If one encoding fails, the function attempts the next one, ensuring that most email files can be read without errors.
For each email file, the content is read using `read_email_file` and a dictionary containing the content (`BOW`) and its category (`Category`) is created.

In [8]:
import os
import pandas as pd

'''
description: read email file, return the content
param {str} file_path
return {str} content of file 
'''
def read_email_file(file_path: str) -> str:
    try:
        with open(file_path, 'r', encoding='ascii') as f:
            content = f.read()
        return content
    except UnicodeDecodeError as e:
        try:
            with open(file_path, 'r', encoding='iso-8859-1') as f:
                content = f.read()
            return content 
        except UnicodeDecodeError as e:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            return content 
    
categories = ["easy_ham", "hard_ham", "spam"]

rows = []

#read from email files
for root, _, files in os.walk("."):
    for category in categories:
        if category in root:
            for f in files:
                rows.append({
                    "BOW": read_email_file(os.path.join(root, f)),
                    "Category": category
                })

df = pd.DataFrame(rows)

df

Unnamed: 0,BOW,Category
0,From rpm-list-admin@freshrpms.net Wed Aug 28 ...,easy_ham
1,From fork-admin@xent.com Tue Oct 8 10:56:40 ...,easy_ham
2,From fork-admin@xent.com Thu Sep 26 11:04:48 ...,easy_ham
3,From garym@canada.com Tue Sep 17 23:29:41 200...,easy_ham
4,From spamassassin-talk-admin@lists.sourceforge...,easy_ham
...,...,...
3297,Return-Path: <bounce-lglinux-2534371@sprocket....,hard_ham
3298,Return-Path: <bounce-lghtml-2534368@sprocket.l...,hard_ham
3299,From OneIncomeLiving-bounce@groups.msn.com Mo...,hard_ham
3300,Return-Path: <Online#3.20516.fc-LmYwlaX_4cin49...,hard_ham


# Problem 3: Easy Ham

We implement `analyze` to train and evaluate models for classifying spam versus ham emails. In this fucntion, it uses `CountVectorizer` to convert email text into numerical values and `LabelEncoder` to encode the labels (`easy_ham`, `hard_ham`, `spam`). Then, the function defines `train_and_test` that fits a Naive Bayes classifier (BernoulliNB or MultinomialNB) to the training data and evaluates it on the test set. 

In [10]:
'''
Author: amamiya-yuuko-1225 1913250675@qq.com
Date: 2024-11-19 17:18:58
LastEditors: amamiya-yuuko-1225 1913250675@qq.com
Description: 
'''
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

'''
description: analyze based on "easy_ham" for q3 or "hard_ham" for q4
param {*} df_train: training set
param {*} df_test test set
param {*} type: "easy_ham" for q3 or "hard_ham" for q4
return {*} no return, but print acc, precision, recall, and confusion matrix
'''
def analyze(df_train, df_test, type):
    
    cv = CountVectorizer(); le = LabelEncoder()

    X_train, X_test = cv.fit_transform(df_train['BOW']), cv.transform(df_test['BOW'])
    y_train, y_test = le.fit_transform(df_train['Category']), le.transform(df_test['Category'])

    '''
    description: train and test using specified classifer
    param {*} classifier: BernoulliNB or MultinomialNB 
    param {*} name: name of the classifier
    return {*} no return, but print acc, precision, recall, and confusion matrix
    '''
    def train_and_test(classifier, name):
        classifier.fit(X_train,y_train)
        y_pred = classifier.predict(X_test)
        y_test_inv = le.inverse_transform(y_test)
        y_pred_inv = le.inverse_transform(y_pred)
        tp = ((y_test_inv == 'spam') & (y_pred_inv == 'spam')).sum()
        fp = ((y_test_inv == type) & (y_pred_inv == 'spam')).sum()
        fn = ((y_test_inv == 'spam') & (y_pred_inv == type)).sum()
        tn = ((y_test_inv == type) & (y_pred_inv == type)).sum()
        acc = (tp+tn)/(tp+fp+tn+fn)
        precision = tp/(tp+fp)
        recall = tp/(tp+fn) 
        print(f"{name}: accuracy:{acc},precision:{precision},recall:{recall}")
        print(f"confusion matrix:\n tp: {tp}, fn: {fn} \n fp: {fp}, tn: {tn} \n")

    train_and_test(BernoulliNB(), "BernoulliNB")

    train_and_test(MultinomialNB(), "MultinomialNB")


SEED = 1919810

df_train, df_test = train_test_split(df[df['Category'].isin(['easy_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'easy_ham')

BernoulliNB: accuracy:0.9043250327653998,precision:0.9807692307692307,recall:0.4146341463414634
confusion matrix:
 tp: 51, fn: 72 
 fp: 1, tn: 639 

MultinomialNB: accuracy:0.9659239842726082,precision:0.9801980198019802,recall:0.8048780487804879
confusion matrix:
 tp: 99, fn: 24 
 fp: 2, tn: 638 



# Problem 4: Hard Ham

1. Accuracy: The accuracy for both classifiers for `easy_ham` is higher than for `hard_ham`. It is reasonable since `easy_ham` is easier to differentiate from spam, while `hard_ham` often contains promotional content.

2. Precision: For `easy_ham`, the precision for both classifiers is quite high. It implys that most emails predictions of spam are correct, while `hard_ham` has lower precision according to confusing content.

3. Recall: The recall is much lower for BernoulliNB on `easy_ham`, which means it cannot distinguish the spam email in many times. On the other hand, `hard_ham` shows a higher recall for both classifiers. Overall, the recall classifier for `easy_ham` is lower than for `hard_ham`.

4. Confusion Matrix: Both classifiers have a higher value of false positives (FP) for `hard_ham` than for `easy_ham`, indicating that legitimate but promotion emails are often detected as spam by mistake. However, both classifiers have a very low false negative (FN) value, meaning that actual spam emails are almost always correctly identified.


In [11]:
df_train, df_test = train_test_split(df[df['Category'].isin(['hard_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'hard_ham')

BernoulliNB: accuracy:0.8617021276595744,precision:0.8366013071895425,recall:0.9922480620155039
confusion matrix:
 tp: 128, fn: 1 
 fp: 25, tn: 34 

MultinomialNB: accuracy:0.9202127659574468,precision:0.9014084507042254,recall:0.9922480620155039
confusion matrix:
 tp: 128, fn: 1 
 fp: 14, tn: 45 

