# DAT565 Introduction to Data Science and AI 
## 2023-2024, LP1
## Assignment 4: Spam classification using Naïve Bayes 
This assignment has three obligatory questions. Questions 4-5 are optional and will not be graded.

The exercise takes place in this notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 

*Tips:* 
* You can execute certain Linux shell commands by prefixing the command with a `!`. 
* You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results, the second you can use to write code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 

You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** 

If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer. Note that if you are using Windows, you can instead use (7zip)[https://www.7-zip.org/download.html] to decompress the data (you will have to modify the cell below).

**What to submit:** 
* Convert the notebook to a PDF file by compiling it, and submit the PDF file. 
* Make sure all cells are executed so all your code and its results are included. 
* Double-check that the PDF displays correctly before you submit it.

In [2]:
# download and extract the data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2023-09-21 14:30:35--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 2a04:4e42::644, 151.101.2.132
Connecting to spamassassin.apache.org (spamassassin.apache.org)|2a04:4e42::644|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2’


2023-09-21 14:30:35 (15.9 MB/s) - ‘20021010_easy_ham.tar.bz2’ saved [1677144/1677144]

--2023-09-21 14:30:35--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 2a04:4e42::644, 151.101.2.132
Connecting to spamassassin.apache.org (spamassassin.apache.org)|2a04:4e42::644|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2’


2023-09-21 14:30:36 (3.27 MB/s) - ‘20021010_hard_ham.tar.bz2’ sav

The data is now in the following three folders: `easy_ham`, `hard_ham`, and `spam`. You can confirm this via the following command:

In [3]:
!ls -lah

total 7640
drwxr-xr-x    10 yuchuan.dong  staff   320B Sep 21 14:30 [1m[36m.[m[m
drwxr-xr-x    11 yuchuan.dong  staff   352B Sep 21 14:24 [1m[36m..[m[m
drwxr-xr-x     3 yuchuan.dong  staff    96B Sep 21 14:25 [1m[36m.ipynb_checkpoints[m[m
-rw-r--r--     1 yuchuan.dong  staff   1.6M Jun 29  2004 20021010_easy_ham.tar.bz2
-rw-r--r--     1 yuchuan.dong  staff   997K Dec 16  2004 20021010_hard_ham.tar.bz2
-rw-r--r--     1 yuchuan.dong  staff   1.1M Jun 29  2004 20021010_spam.tar.bz2
-rw-r--r--@    1 yuchuan.dong  staff   8.5K Sep 21 14:29 assignment-4.ipynb
drwx--x--x  2553 yuchuan.dong  staff    80K Oct 10  2002 [1m[36measy_ham[m[m
drwx--x--x   252 yuchuan.dong  staff   7.9K Dec 16  2004 [1m[36mhard_ham[m[m
drwxr-xr-x   503 yuchuan.dong  staff    16K Oct 10  2002 [1m[36mspam[m[m


### 1. Preprocessing: 
Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text (in the optional part further down, you can experiment with filtering out the headers and footers). 
1.	We don’t want to train and test on the same data (it might help to reflect on **why** ,if you don't recall). Split the spam and ham datasets into a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`). Use `easy_ham` for quesions 1 and 2.


In [1]:
# write your import os
import random
from sklearn.model_selection import train_test_split

easy_ham_file = "./easy_ham"
hard_ham_file = "./hard_ham"
spam_file = "./spam"

def read_emails(dir):
    email = []
    for filename in os.listdir(dir):
        with open(os.path.join(dir, filename), "r", encoding='latin-1' ) as file:
            email_text = file.read()
            email.append(email_text)
    return email

EasyHamEmail = read_emails(easy_ham_file)
HardHamEmail = read_emails(hard_ham_file)
SpamEmail = read_emails(spam_file)

EasyHam_Train, EasyHam_Test = train_test_split(EasyHamEmail, test_size=0.2, random_state=42)
Span_Train, Spam_Test = train_test_split(SpamEmail, test_size=0.2, random_state=42)

print("Ham Training Set Size:", len(EasyHam_Train))
print("Ham Test Set Size:", len(EasyHam_Test))
print("Spam Training Set Size:", len(Span_Train))
print("Spam Test Set Size:", len(Spam_Test))



NameError: name 'os' is not defined

### 2. Write a Python program that: 
1.	Uses the four datasets from Question 1 (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`).
2.	Trains a Naïve Bayes classifier (use the [scikit-learn library](https://scikit-learn.org/stable/)) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` ([documentation here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)) to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifiers available in *scikit-learn* ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Here, you will test two of these classifiers that are well suited for this problem:
- Multinomial Naive Bayes
- Bernoulli Naive Bayes.

Please inspect the documentation to ensure input to the classifiers is appropriate before you start coding. You may have to modify your input.

In [None]:
# write your code here

### 3. Run on hard ham:
Run the two models from Question 2 on `spam` versus `hard-ham`, and compare to the `easy-ham` results.

In [None]:
# code to report results here

### 4.	OPTIONAL - NOT MARKED: 
To avoid classification based on common and uninformative words, it is common practice to filter these out. 

**a.** Think about why this may be useful. Show a few examples of too common and too uncommon words. 

**b.** Use the parameters in *scikit-learn*’s `CountVectorizer` to filter out these words. Update the program from Question 2 and run it on `easy-ham` vs `spam` and `hard-ham` vs `spam`. Report your results.

In [None]:
# write your code here

### 5. OPTIONAL - NOT MARKED: Further improving performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions: 
- Does the result improve from those obtained in Questions 3 and 4? 
- What do you expect would happen if your training set consisted mostly of spam messages, while your test set consisted mostly of ham messages, or vice versa? 
- Look at the `fit_prior` parameter. What does this parameter mean? Discuss in what settings it can be helpful (you can also test your hypothesis). 

In [None]:
# write your code here