# DAT565/DIT407 Assignment 3

Author: Group 26 | Wenjun Tian wenjunt@chalmers.se | Yifan Tang yifant@chalmers.se

Date: 2024-11-20

# Problem 1: Spam & Ham

## A. Data exploration

By the judgment as a human being, these following features makes me able to tell spam apart from ham:

1. Topic and related key words: Spam emails mostly focus on commodity promotion, dating/porn website promotion, and pure scam. For commodity promotion emails, there are key words related to prices, such as "$", "cash", and "cheap"; for dating/porn website promotion, there are key words concerning sexual features, such as "amateur", "wives", and "girls"; for scam emails, mostly, the content is about "fortune" or potential job opportunity. By contrast, ham emails have various topics.

2. Structure of the `HTML` content: Spam emails are usually generated from a fixed template, thus most of them have complex and fancy `HTML` structure. On the other hand, ham emails are simple and neat in most cases.

3. Spam markers: Spam emails tend to be marked as AD by mail server admins and users, while ham emails do not. Moreover, spam emails sometimes explicitly assert "This is NOT spam!" or something similar, which is a very poor lie that reveals the truth.

Furthermore, the reasons that make hard ham emails different from easy ham emails but similar to spam emails are as follows:

1. Similar content to spams: Hard ham emails also focus on promotion of commodities, companies, etc. This makes hard ham emails look like spams.

2. Created from templates: Hard ham emails are also created from templates and sent to a large amount of people, which resembles spams in terms of `HTML` structure.

Though hard ham emails are hard to distinguish from spams, there is a key feature that differentiates those two: Most hard ham emails have "unsubcribe" key word, meaning that the receivers can reject further emailing. However, spam emails rarely provide such function.

## B. Data splitting

We perform the train-test set split with a ratio of 3:1 by invoking `train_test_split()` function in `sklean.model_selection` package.

# Problem 2: Preprocessing



In [67]:
import os
import pandas as pd

'''
description: read email file, return the content
param {str} file_path
return {str} content of file 
'''
def read_email_file(file_path: str) -> str:
    try:
        with open(file_path, 'r', encoding='ascii') as f:
            content = f.read()
        return content
    except UnicodeDecodeError as e:
        try:
            with open(file_path, 'r', encoding='iso-8859-1') as f:
                content = f.read()
            return content 
        except UnicodeDecodeError as e:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            return content 
    
categories = ["easy_ham", "hard_ham", "spam"]

rows = []

#read from email files
for root, _, files in os.walk("."):
    for category in categories:
        if category in root:
            for f in files:
                rows.append({
                    "BOW": read_email_file(os.path.join(root, f)),
                    "Category": category
                })

df = pd.DataFrame(rows)

df

Unnamed: 0,BOW,Category
0,Return-Path: <bounce-lgmedia-2534370@sprocket....,hard_ham
1,Return-Path: <Online#3.19725.55-A8YAgb1NX5rYkd...,hard_ham
2,Return-Path: <Online#3.19592.a8-JNyKlW9O8FdiLs...,hard_ham
3,From bounce-neatnettricks-2424157@silver.lyris...,hard_ham
4,Return-Path: <Online#3.20115.09-rB-TgEkNwY9w6R...,hard_ham
...,...,...
3297,From Alex-09242002-HTML@frugaljoe.330w.com Th...,spam
3298,From mando@insiq.us Mon Aug 26 15:49:52 2002\...,spam
3299,Return-Path: ler@lerami.lerctr.org\nDelivery-D...,spam
3300,From fholland@bigfoot.com Wed Sep 11 19:43:52...,spam


# Problem 3: Easy Ham

In [68]:
'''
Author: amamiya-yuuko-1225 1913250675@qq.com
Date: 2024-11-19 17:18:58
LastEditors: amamiya-yuuko-1225 1913250675@qq.com
Description: 
'''
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

'''
description: analyze based on "easy_ham" for q3 or "hard_ham" for q4
param {*} df_train: training set
param {*} df_test test set
param {*} type: "easy_ham" for q3 or "hard_ham" for q4
return {*} no return, but print acc, precision, recall, and confusion matrix
'''
def analyze(df_train, df_test, type):
    
    cv = CountVectorizer(); le = LabelEncoder()

    X_train, X_test = cv.fit_transform(df_train['BOW']), cv.transform(df_test['BOW'])
    y_train, y_test = le.fit_transform(df_train['Category']), le.transform(df_test['Category'])

    '''
    description: train and test using specified classifer
    param {*} classifier: BernoulliNB or MultinomialNB 
    param {*} name: name of the classifier
    return {*} no return, but print acc, precision, recall, and confusion matrix
    '''
    def train_and_test(classifier, name):
        classifier.fit(X_train,y_train)
        y_pred = classifier.predict(X_test)
        y_test_inv = le.inverse_transform(y_test)
        y_pred_inv = le.inverse_transform(y_pred)
        tp = ((y_test_inv == 'spam') & (y_pred_inv == 'spam')).sum()
        fp = ((y_test_inv == type) & (y_pred_inv == 'spam')).sum()
        fn = ((y_test_inv == 'spam') & (y_pred_inv == type)).sum()
        tn = ((y_test_inv == type) & (y_pred_inv == type)).sum()
        acc = (tp+tn)/(tp+fp+tn+fn)
        precision = tp/(tp+fp)
        recall = tp/(tp+fn) 
        print(f"{name}: acc:{acc},precision:{precision},recall:{recall}")
        print(f"confusion matrix:\n tp: {tp}, fn: {fn} \n fp: {fp}, tn: {tn} \n")

    train_and_test(BernoulliNB(), "BernoulliNB")

    train_and_test(MultinomialNB(), "MultinomialNB")


SEED = 1919810

df_train, df_test = train_test_split(df[df['Category'].isin(['easy_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'easy_ham')

BernoulliNB: acc:0.9043250327653998,precision:0.9166666666666666,recall:0.44715447154471544
confusion matrix:
 tp: 55, fn: 68 
 fp: 5, tn: 635 

MultinomialNB: acc:0.9672346002621232,precision:0.9803921568627451,recall:0.8130081300813008
confusion matrix:
 tp: 100, fn: 23 
 fp: 2, tn: 638 



# Problem 4: Hard Ham

In [69]:
df_train, df_test = train_test_split(df[df['Category'].isin(['hard_ham', 'spam'])], random_state=SEED)

analyze(df_train, df_test, 'hard_ham')

BernoulliNB: acc:0.8882978723404256,precision:0.8642857142857143,recall:0.983739837398374
confusion matrix:
 tp: 121, fn: 2 
 fp: 19, tn: 46 

MultinomialNB: acc:0.9361702127659575,precision:0.9172932330827067,recall:0.991869918699187
confusion matrix:
 tp: 122, fn: 1 
 fp: 11, tn: 54 

