# DAT565/DIT407 Assignment 3

Author: Group 26 | Wenjun Tian wenjunt@chalmers.se | Yifan Tang yifant@chalmers.se

Date: 2024-11-20

# Problem 1: Spam & Ham

## A. Data exploration

By the judgment as a human being, these following features makes me able to tell spam apart from ham:

1. Topic and related key words: Spam emails mostly focus on commodity promotion, dating/porn website promotion, and pure scam. For commodity promotion emails, there are key words related to prices, such as "$", "cash", and "cheap"; for dating/porn website promotion, there are key words concerning sexual features, such as "amateur", "wives", and "girls"; for scam emails, mostly, might focus on topics like unexpected fortune or job opportunities. By contrast, ham emails have various topics, including work-related content, personal communication, and legitimate newsletters.

2. Structure of the `HTML` content: Spam emails are usually generated from a fixed template, thus most of them have complex and fancy `HTML` structures. These can include various fonts, colors, and embedded media. On the other hand, ham emails are simple and neat in most cases, which focuses on direct communication.

3. Spam markers: Spam emails tend to be marked as AD by mail server admins and users, while ham emails do not. Moreover, spam emails sometimes explicitly assert "This is NOT spam!" or something similar, which is a very poor lie that reveals the truth. On the other hand, ham emails lack such markers and are usually consistent in terms of content and senders.

Furthermore, the reasons that make hard ham emails different from easy ham emails but similar to spam emails are as follows:

1. Similar content to spams: Hard ham emails also focus on promotion of commodities, companies, etc. This makes hard ham emails look like spams, especially in their subject and promotion words.

2. Created from templates: Hard ham emails are also created from templates and sent to a large amount of people, which resembles spams in terms of `HTML` structure.

Though hard ham emails are hard to distinguish from spams, there is a key feature that differentiates those two: Most hard ham emails have "unsubcribe" key word, meaning that the receivers can reject further emailing. However, spam emails rarely provide such function.

## B. Data splitting

We perform the train-test set split with a ratio of 3:1 by invoking `train_test_split()` function in `sklean.model_selection` package.

# Problem 2: Preprocessing

We read email files from specified categories (`easy_ham`, `hard_ham`, and `spam`) and stores their content in a `DataFrame` for further analysis.

In the `read_email_file()` function, we try to read the content of an email file using different encodings (ascii, iso-8859-1, and utf-8) to handle potential encoding issues. If one encoding fails, the function attempts the next one, ensuring that most email files can be read without errors.

For each email file, the content is read using `read_email_file()` and a dictionary containing the content (`Content`) and its category (`Category`) is created.

After reading all email files, we convert the list of dictionaries into a `DataFrame` where each line represents an email file.

In [7]:
import os
import pandas as pd

'''
description: read email file, return the content
param {str} file_path
return {str} content of file 
'''
def read_email_file(file_path: str) -> str:
    try:
        with open(file_path, 'r', encoding='ascii') as f:
            content = f.read()
        return content
    except UnicodeDecodeError as e:
        try:
            with open(file_path, 'r', encoding='iso-8859-1') as f:
                content = f.read()
            return content 
        except UnicodeDecodeError as e:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            return content 
    
categories = ["easy_ham", "hard_ham", "spam"]

rows = []

#read from email files
for root, _, files in os.walk("."):
    for category in categories:
        if category in root:
            for f in files:
                rows.append({
                    "Content": read_email_file(os.path.join(root, f)),
                    "Category": category
                })

df = pd.DataFrame(rows)

df

Unnamed: 0,Content,Category
0,Return-Path: <bounce-lgmedia-2534370@sprocket....,hard_ham
1,Return-Path: <Online#3.19725.55-A8YAgb1NX5rYkd...,hard_ham
2,Return-Path: <Online#3.19592.a8-JNyKlW9O8FdiLs...,hard_ham
3,From bounce-neatnettricks-2424157@silver.lyris...,hard_ham
4,Return-Path: <Online#3.20115.09-rB-TgEkNwY9w6R...,hard_ham
...,...,...
3297,From Alex-09242002-HTML@frugaljoe.330w.com Th...,spam
3298,From mando@insiq.us Mon Aug 26 15:49:52 2002\...,spam
3299,Return-Path: ler@lerami.lerctr.org\nDelivery-D...,spam
3300,From fholland@bigfoot.com Wed Sep 11 19:43:52...,spam


# Problem 3: Easy Ham

## 3.1 Code logic
First, we perform the train-test set split with a ratio of 3:1 by invoking `train_test_split()` function.

Then, we use function `analyze()` to train and evaluate models for classifying spam versus ham emails. 

In this fucntion, it uses `CountVectorizer` to convert email content into bag of words vectors(in the form of `CRS`) and `LabelEncoder` to encode the categories into integers (`easy_ham`, `hard_ham`, `spam`). 

Note that we apply `fit_transform()` on the training set and only `transform()` on the test set. The `fit()` function is to "learn" from the data (e.g., `CountVectorizer.fit()` is to create the dictionary of words). Thus it is only applied on the training set; the `transform()` function is to transform data according to the principles learned from `fit()` (e.g, `CountVectorizer.tranform()` is to transform plain text into bag of words vectors in the form of CRS based on the dictionary learned from `fit()`). Thus it is applied on both training and test set.

Then, the function defines `train_and_test` that fits a Naive Bayes classifier (BernoulliNB or MultinomialNB) to the training data and evaluates it on the test set. We `fit()` the training set into the classifier, and then use the classifier to `predict()` the categories of content in the test set. 

After that we compare and contrast the predicted and actual categories of the test set, computing `accuracy`, `recall`, `precision` and the confusion matrix.



In [8]:
'''
Author: amamiya-yuuko-1225 1913250675@qq.com
Date: 2024-11-19 17:18:58
LastEditors: amamiya-yuuko-1225 1913250675@qq.com
Description: 
'''
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

'''
description: analyze based on "easy_ham" for q3 or "hard_ham" for q4
param {*} df_train: training set
param {*} df_test test set
param {*} type: "easy_ham" for q3 or "hard_ham" for q4
return {*} no return, but print acc, precision, recall, and confusion matrix
'''
def analyze(df_train, df_test, type):
    
    cv = CountVectorizer(); le = LabelEncoder()

    X_train, X_test = cv.fit_transform(df_train['Content']), cv.transform(df_test['Content'])
    y_train, y_test = le.fit_transform(df_train['Category']), le.transform(df_test['Category'])

    '''
    description: train and test using specified classifer
    param {*} classifier: BernoulliNB or MultinomialNB 
    param {*} name: name of the classifier
    return {*} no return, but print acc, precision, recall, and confusion matrix
    '''
    def train_and_test(classifier, name):
        classifier.fit(X_train,y_train)
        y_pred = classifier.predict(X_test)
        y_test_inv = le.inverse_transform(y_test)
        y_pred_inv = le.inverse_transform(y_pred)
        tp = ((y_test_inv == 'spam') & (y_pred_inv == 'spam')).sum()
        fp = ((y_test_inv == type) & (y_pred_inv == 'spam')).sum()
        fn = ((y_test_inv == 'spam') & (y_pred_inv == type)).sum()
        tn = ((y_test_inv == type) & (y_pred_inv == type)).sum()
        acc = (tp+tn)/(tp+fp+tn+fn)
        precision = tp/(tp+fp)
        recall = tp/(tp+fn) 
        print(f"{name}:\naccuracy:{acc},precision:{precision},recall:{recall}")
        print(f"confusion matrix:\n tp: {tp}, fn: {fn} \n fp: {fp}, tn: {tn} \n")

    train_and_test(BernoulliNB(), "BernoulliNB")

    train_and_test(MultinomialNB(), "MultinomialNB")

# random seed for training and test set split
SEED = 1919810

# split training and test set
df_train, df_test = train_test_split(df[df['Category'].isin(['easy_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'easy_ham')

BernoulliNB:
accuracy:0.9043250327653998,precision:0.9166666666666666,recall:0.44715447154471544
confusion matrix:
 tp: 55, fn: 68 
 fp: 5, tn: 635 

MultinomialNB:
accuracy:0.9672346002621232,precision:0.9803921568627451,recall:0.8130081300813008
confusion matrix:
 tp: 100, fn: 23 
 fp: 2, tn: 638 



## 3.2 Results

Thus, the results are as follows:

### BernoulliNB
accuracy: 0.9043250327653998; precision: 0.9166666666666666; recall: 0.44715447154471544

confusion matrix:

||Pred. pos.|Pred. neg.|Marginal sum|
| --- | --- | --- | --- |
|Actual pos.|55|68|123|
|Actual neg.|5|635|640|
|Marginal sum|60|703|
### MultinomialNB
accuracy: 0.9672346002621232; precision: 0.9803921568627451; recall: 0.8130081300813008

confusion matrix:

||Pred. pos.|Pred. neg.|Marginal sum|
| --- | --- | --- | --- |
|Actual pos.|100|23|123|
|Actual neg.|2|638|640|
|Marginal sum|102|661|


# Problem 4: Hard Ham
## 4.1 Code logic
The logic is the same as problem 3, just consider `hard_ham` instead of `easy_ham`

In [9]:
df_train, df_test = train_test_split(df[df['Category'].isin(['hard_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'hard_ham')

BernoulliNB:
accuracy:0.8882978723404256,precision:0.8642857142857143,recall:0.983739837398374
confusion matrix:
 tp: 121, fn: 2 
 fp: 19, tn: 46 

MultinomialNB:
accuracy:0.9361702127659575,precision:0.9172932330827067,recall:0.991869918699187
confusion matrix:
 tp: 122, fn: 1 
 fp: 11, tn: 54 




## 4.2 Results

Thus, the results are as follows:

### BernoulliNB
accuracy: 0.8882978723404256
; precision: 0.8642857142857143
; recall: 0.983739837398374

confusion matrix:

||Pred. pos.|Pred. neg.|Marginal sum|
| --- | --- | --- | --- |
|Actual pos.|121|2|123|
|Actual neg.|19|46|65|
|Marginal sum|140|48|
### MultinomialNB
accuracy: 0.9361702127659575
; precision: 0.9172932330827067
; recall: 0.991869918699187

confusion matrix:

||Pred. pos.|Pred. neg.|Marginal sum|
| --- | --- | --- | --- |
|Actual pos.|122|1|123|
|Actual neg.|11|54|65|
|Marginal sum|133|55|

## 4.3 Differences between problem 3 and 4

1. Accuracy: The accuracy for both classifiers for `easy_ham` is higher than for `hard_ham`. It is reasonable since `easy_ham` is easier to differentiate from spam, while `hard_ham` often contains promotional content.

2. Precision: For `easy_ham`, the precision for both classifiers is quite high. It implys that most emails predictions of spam are correct, while `hard_ham` has lower precision according to confusing content.

3. Recall: The recall is much lower for BernoulliNB on `easy_ham`, which means it cannot distinguish the spam email in many times. On the other hand, `hard_ham` shows a higher recall for both classifiers. Overall, the recall classifier for `easy_ham` is lower than for `hard_ham`.

4. Confusion Matrix: Both classifiers have a higher value of false positives (FP) for `hard_ham` than for `easy_ham`, indicating that legitimate but promotion emails are often detected as spam by mistake. However, both classifiers have a very low false negative (FN) value, meaning that actual spam emails are almost always correctly identified.