In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
data = pd.read_csv("emails.csv").to_numpy()
corpus = data[:,0]
y = data[:,1]

In [3]:
vectorizer = CountVectorizer(analyzer = "word", ngram_range = (1,1)) # only consider unigrams
X = vectorizer.fit_transform(corpus).toarray()
X = np.squeeze(np.asarray(X))
N, M = X.shape

In [4]:
freq_words = np.zeros((M, 2))
spam_idx, ham_idx = y == 1, y == 0

print("There are total of %d spam and %d ham mails" % (X[spam_idx].shape[0], X[ham_idx].shape[0]))
print(X[spam_idx].shape, X[ham_idx].shape)

There are total of 1368 spam and 4360 ham mails
(1368, 37303) (4360, 37303)


In [5]:
spam_num, ham_num = X[spam_idx].sum(axis=0), X[ham_idx].sum(axis=0)
freq_words[:,0] += spam_num
freq_words[:,1] += ham_num

print("Number of words only seen in spam mails:", np.sum(freq_words[:,1] == 0))
print("Number of words only seen in ham mails:", np.sum(freq_words[:,0] == 0))

Number of words only seen in spam mails: 10229
Number of words only seen in ham mails: 18529


Some statistics about the data:
1. There are total of **37303** distinct words in the dataset and **5728** lines of mails.
2. In these words, **10229** of them is seen only in spam mails and **18529** of them is seen only in ham mails.

In [6]:
ratios_s = freq_words[:,0] / X.sum(axis=0) # total times in spam / total usage of the word
ratios_h = freq_words[:,1] / X.sum(axis=0) # total times in ham / total usage of the word
idx_s_rats = np.argsort(ratios_s)[::-1]
idx_h_rats = np.argsort(ratios_h)[::-1]
words = vectorizer.get_feature_names_out()


print("#### Words with highest R_s ####")
count = 0
print("Word","R_s","N_s","N", sep="\t")
for i in idx_s_rats:
    if X[:,i].sum() > 100:       
        print(words[i], "%.2f" % ratios_s[i], freq_words[i,0], X[:,i].sum(), sep='\t')
        count += 1
        if count == 10:
            break

print("#### Words with highest R_h ####")
count = 0
print("Word","R_h","N_h","N", sep="\t")
for i in idx_h_rats:
    if ratios_h[i] < 0.99 and X[:,i].sum() > 500:       
        print(words[i], "%.2f" % ratios_h[i], freq_words[i,1], X[:,i].sum(), sep='\t')
        count += 1
        if count == 10:
            break

#### Words with highest R_s ####
Word	R_s	N_s	N
projecthoneypot	1.00	110.0	110
viagra	1.00	174.0	174
stationery	1.00	120.0	120
2005	0.99	374.0	379
engines	0.97	112.0	115
advertisement	0.97	102.0	105
adobe	0.97	462.0	476
jul	0.96	162.0	168
2004	0.95	169.0	177
grants	0.95	110.0	116
#### Words with highest R_h ####
Word	R_h	N_h	N
na	0.99	616.0	623
model	0.99	1287.0	1306
attached	0.98	898.0	912
schedule	0.98	637.0	647
option	0.98	561.0	570
london	0.98	828.0	843
09	0.98	1085.0	1105
john	0.98	1016.0	1035
summer	0.98	617.0	629
08	0.98	1192.0	1216


We compare the words according to their spam ratio which is defined as follows:<br>
<br>
$$\large R_s = N_s / N,\ R_h = N_h / N $$<br>
where:
- $N_s$ number of occurances in a spam mail of the word.
- $N_h$ number of occurances in a ham mail of the word
- $N$ is the total occurances.<br>

In the upper cell, we print the 10 words with highest $R_s$ and $N > 100$, highest $R_h$ and $N > 500$. We selected the 3 words among them and inspect their statistics:
1. **viagra**: We see that in this dataset all the mails that includes "viagra" are **spam**, since $R_s = 1.0$. Even though the $N$ is quite small (174), from prior experience we know that these type of mails are usually spam.
2. **adobe**: We see that in this dataset most of the mails that includes "adobe" are spam, with $R_s = 0.97$. Furthermore because $N = 476$ and $N_s = 462$ (which are quite high occurances), we can conclude that this word provides a useful distinction between two type of mails.
3. **schedule**: We see that in this dataset mos of the mails that includes "schedule" are ham, with $R_h = 0.98$. We know that from prior experience that mails that mentions schedule are usually not spam.

We can conclude that it is feasible to label mails as spam or ham by looking at the words. However there are few drawbacks in this dataset:
1. Even though some of the words has high $R_s$ their $N$ is quite low (< 100). This will result in a **biased prediction**.
2. There are some words that are only numbers (09, 08, 2005, 2004) which shouldn't be telling a much about the type of the mail. However because of the dataset, some of these words has high $R_s$ and $R_h$.