# Yaseen Haffejee: 1827555 
# Spam Classification Using Naive Bayes

# Importing Libraries 

In [1]:
pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score,recall_score,f1_score
from collections import Counter
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yasee\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\yasee\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

# Helper Functions


In [3]:
def get_metrics(model_output,true_output):

    accuracy = np.round(accuracy_score(model_output,true_output),3)
    precision = np.round(precision_score(model_output,true_output),3)
    recall = np.round(recall_score(model_output,true_output),3)
    f1 = np.round(f1_score(model_output,true_output),3)

    print(f"The accuracy score is {accuracy}.")
    print(f"The precision score is {precision}.")
    print(f"The recall score is {recall}.")
    print(f"The f1 score is {f1}.")

    return [accuracy,precision,recall,f1]

# Tasks

## Task 1

### 1. Preprocess the data, tokenize the text, and get a list of words for each sentence.

In [4]:
train = pd.read_csv("./project_1_data/train.csv")

In [5]:
train.head(10)['text']

0    Subject: thank you\r\nami and daren , , , ,\r\...
1    Subject: spot or firm tickets\r\nvance ,\r\nth...
2    Subject: software\r\nmicrosoft windows xp prof...
3    Subject: noms / actual flow for 2 / 27\r\nwe a...
4    Subject: superb so . ftware\r\nyoull discover ...
5    Subject: hpl nomination changes for july 25 an...
6    Subject: dear customer your details have been ...
7    Subject: woww . . 8 o - % off abazis\r\nthe lo...
8    Subject: 3 / 1 / 2000 noms\r\neffective 3 / 1 ...
9    Subject: hl & p\r\ndaren - also , the deal mig...
Name: text, dtype: object

- We can see that every new line is denoted by a '\r\n' combination. We can thus replace these with a space.
- For standardisation, we will convert all emails into lowercase.

In [6]:
def preprocess(text):
    if(type(text) is int):
        return None

  # Convert to lower case 
    text = text.lower()

  # Remove all unncessary characters
    text = re.sub('[^a-zA-Z0-9\n]', ' ', text)
  # Splitting into a list of words
    list_of_words = text.split()
    lemmatizer = WordNetLemmatizer()
    list_of_words = [lemmatizer.lemmatize(word) for word in list_of_words]
    return list_of_words

In [7]:
train['filtered_text'] = train['text'].apply(lambda email: preprocess(email))

In [8]:
train.head(10)

Unnamed: 0,text,label,filtered_text
0,"Subject: thank you\r\nami and daren , , , ,\r\...",0,"[subject, thank, you, ami, and, daren, just, w..."
1,"Subject: spot or firm tickets\r\nvance ,\r\nth...",0,"[subject, spot, or, firm, ticket, vance, the, ..."
2,Subject: software\r\nmicrosoft windows xp prof...,1,"[subject, software, microsoft, window, xp, pro..."
3,Subject: noms / actual flow for 2 / 27\r\nwe a...,0,"[subject, noms, actual, flow, for, 2, 27, we, ..."
4,Subject: superb so . ftware\r\nyoull discover ...,1,"[subject, superb, so, ftware, youll, discover,..."
5,Subject: hpl nomination changes for july 25 an...,0,"[subject, hpl, nomination, change, for, july, ..."
6,Subject: dear customer your details have been ...,1,"[subject, dear, customer, your, detail, have, ..."
7,Subject: woww . . 8 o - % off abazis\r\nthe lo...,1,"[subject, woww, 8, o, off, abazis, the, lowest..."
8,Subject: 3 / 1 / 2000 noms\r\neffective 3 / 1 ...,0,"[subject, 3, 1, 2000, noms, effective, 3, 1, 2..."
9,"Subject: hl & p\r\ndaren - also , the deal mig...",0,"[subject, hl, p, daren, also, the, deal, might..."


#### Extracting the vocabulary from the corpus for personal testing

- Vocabulary is a dictionary sorted in descending order by frequency.
- We can use this to test if the extracted features are correct etc

In [9]:
vocabulary = []

for email in train.filtered_text:
    vocabulary.extend(email)

## Getting a count of each word in the vocabulary
vocabulary = Counter(vocabulary)
## Sorting the words in the vocabulary by their frequency in descending order
vocabulary = dict(sorted(vocabulary.items(), key=lambda item: item[1],reverse=True))

## Task 2
### 2. Train a standard Naive Bayes model. Call this Model1

 - Naive Bayes cannot just take in random strings. 
 - We need to extract the features, which is the frequency of words.
 - Therefore we will use Count vectorizer to get a a sparse matrix of the shape (n_samples x n_features).
 - n_features is the number of words we are using as features which changes in subtasks

In [10]:
cv = CountVectorizer(analyzer=lambda x: x)
X = cv.fit_transform(train.filtered_text)

In [11]:
Model1 = MultinomialNB()
Model1.fit(X,train.label)

## Task 3
### 3. Train 4 additional Naïve Bayes models with the following variations:

#### a. Use only the 10 most frequent words as features. Call this Model2 

In [12]:
## Change the count vectorizer to get only the top 10 most fequent words as features
CV_top_10 = CountVectorizer(analyzer=lambda x: x, max_features=10 )
X_top_10 = CV_top_10.fit_transform(train.filtered_text)
assert(X_top_10.toarray().shape[1] == 10),print("Incorrect Number of features extracted")
assert(sorted(CV_top_10.get_feature_names_out()) == sorted(list(vocabulary.keys())[:10])),print("Incorrect features are being extracted")
print("Correct Number of features extracted: Passed !")

Correct Number of features extracted: Passed !


In [13]:
Model2 = MultinomialNB()
Model2.fit(X_top_10,train.label)

#### b. Use only the 100 most frequent words as features. Call this Model3

In [14]:
## Change the count vectorizer to get only the top 100 most fequent words as features
CV_top_100 = CountVectorizer(analyzer=lambda x: x, max_features=100 )
X_top_100 = CV_top_100.fit_transform(train.filtered_text)
assert(X_top_100.toarray().shape[1] == 100),print("Incorrect Number of features extracted")
assert(sorted(CV_top_100.get_feature_names_out()) == sorted(list(vocabulary.keys())[:100])),print("Incorrect features are being extracted")
print("Correct Number of features extracted: Passed !")

Correct Number of features extracted: Passed !


In [15]:
Model3 = MultinomialNB()
Model3.fit(X_top_100,train.label)

#### c. Remove the 100 most frequent words from the features. Call this Model4

- In order to remove the top 100 features, we can simply make sure the vocabulary excludes these words.
- Consequently, all words with a frequency in the top 100 will be removed

In [16]:
new_vocab_excluding_top_100 = list(vocabulary.keys())[100:]
CV = CountVectorizer(analyzer=lambda x: x,vocabulary=new_vocab_excluding_top_100)
X_remove_top_100 = CV.fit_transform(train.filtered_text)
assert(X_remove_top_100.toarray().shape[1] == X.shape[1]-100),print("Incorrect Number of features extracted")
assert(sorted(CV.get_feature_names_out()) == sorted(list(vocabulary.keys())[100:])),print("Incorrect features are being extracted")
print("Correct Number of features extracted: Passed !")

Correct Number of features extracted: Passed !


In [17]:
Model4 = MultinomialNB()
Model4.fit(X_remove_top_100,train.label)

#### d. Use only the subject line (see the data) as the feature set. Call this Model5

- We need to extract the subject line from each email.

In [18]:
train.text.head(5)

0    Subject: thank you\r\nami and daren , , , ,\r\...
1    Subject: spot or firm tickets\r\nvance ,\r\nth...
2    Subject: software\r\nmicrosoft windows xp prof...
3    Subject: noms / actual flow for 2 / 27\r\nwe a...
4    Subject: superb so . ftware\r\nyoull discover ...
Name: text, dtype: object

- We can see that the first encounter of the "\r\n" denotes the end of the Subject line. We just need to filter everything before that out.

In [19]:
def extract_subject_line(email):
    subject = email.split("\r\n")[0]
    subject = re.sub("Subject: ",'',subject)
    return subject

In [20]:
train['subject_line'] = train['text'].apply(lambda x: extract_subject_line(x))

In [21]:
train.subject_line.head(5)

0                        thank you
1             spot or firm tickets
2                         software
3    noms / actual flow for 2 / 27
4               superb so . ftware
Name: subject_line, dtype: object

In [22]:
cv_subject = CountVectorizer(analyzer=lambda x:x)
X_Subject_Line = cv_subject.fit_transform(train.subject_line)

In [23]:
Model5 = MultinomialNB()
Model5.fit(X_Subject_Line,train.label)

## Task 4
### 4. Evaluate the performance of the first model and all 4 variations using the validation set. 

### a. Calculate the evaluation metrics

In [24]:
validation = pd.read_csv("./project_1_data/val.csv")

validation['filtered_text'] = validation['text'].apply(lambda x: preprocess(x))
validation['subject_line'] = validation['text'].apply(lambda x: extract_subject_line(x))

#### Evaulating Model 1.

In [25]:
X_Validation = cv.transform(validation.filtered_text)

Model1_valid_results = Model1.predict(X_Validation)

Model1_metrics = get_metrics(Model1_valid_results,validation.label)

The accuracy score is 0.973.
The precision score is 0.94.
The recall score is 0.966.
The f1 score is 0.953.


#### Evaluating Model 2.

In [26]:
X_Validation_Top_10 = CV_top_10.transform(validation.filtered_text)

Model2_valid_results = Model2.predict(X_Validation_Top_10)

Model2_metrics = get_metrics(Model2_valid_results,validation.label)

The accuracy score is 0.666.
The precision score is 0.553.
The recall score is 0.439.
The f1 score is 0.49.


#### Evaluating Model 3.

In [27]:
X_Validation_Top_100 = CV_top_100.transform(validation.filtered_text)

Model3_valid_results = Model3.predict(X_Validation_Top_100)

Model3_metrics = get_metrics(Model3_valid_results,validation.label)

The accuracy score is 0.851.
The precision score is 0.84.
The recall score is 0.704.
The f1 score is 0.766.


#### Evaluating Model 4.

In [28]:
X_Validation_Remove_Top_100 = CV.transform(validation.filtered_text)

Model4_valid_results = Model4.predict(X_Validation_Remove_Top_100)

Model4_metrics = get_metrics(Model4_valid_results,validation.label)

The accuracy score is 0.971.
The precision score is 0.93.
The recall score is 0.969.
The f1 score is 0.949.


#### Evaluating Model 5.

In [29]:
X_Validation_Subject_Line = cv_subject.transform(validation.subject_line)

Model5_valid_results = Model5.predict(X_Validation_Subject_Line)

Model5_metrics= get_metrics(Model5_valid_results,validation.label)

The accuracy score is 0.688.
The precision score is 0.733.
The recall score is 0.475.
The f1 score is 0.577.


### b. Compare all 5 models
#### Summarised Validation Results

In [30]:
validation_results = pd.DataFrame(columns=['Model','Accuracy','Precision','Recall','F1 Score'])
Model1_metrics.insert(0,"Model1")
Model2_metrics.insert(0,"Model2")
Model3_metrics.insert(0,"Model3")
Model4_metrics.insert(0,"Model4")
Model5_metrics.insert(0,"Model5")
validation_results.loc[-1] = Model1_metrics
validation_results.index = validation_results.index + 1  # shifting index
validation_results.loc[-1] = Model2_metrics
validation_results.index = validation_results.index + 1  # shifting index
validation_results.loc[-1] = Model3_metrics
validation_results.index = validation_results.index + 1  # shifting index
validation_results.loc[-1] = Model4_metrics
validation_results.index = validation_results.index + 1  # shifting index
validation_results.loc[-1] = Model5_metrics
validation_results.index = validation_results.index + 1  # shifting index
validation_results.reset_index(drop = True,inplace=True)
validation_results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Model1,0.973,0.94,0.966,0.953
1,Model2,0.666,0.553,0.439,0.49
2,Model3,0.851,0.84,0.704,0.766
3,Model4,0.971,0.93,0.969,0.949
4,Model5,0.688,0.733,0.475,0.577


- Looking at the performance of all the models, we can see that Model1 and Model4 are the best performers. Since the only difference between these models are the presence of the 100 most frequent words, we can conclude that the 100 most frequent words do not influence the models much since Model4 was trained excluding these words an managed to achieve similar metrics.
<hr>
- Model2 is the worst performing model. This is due to the fact that we utilised a small subset of 10 features, and coupled with that we utilised the 10 most frequent words as features. If a word is frequently occurring in spam and not spam emails, it is difficult to utilise it to differentiate between spam and not spam emails.
<hr>
- Model3 was trained on the using the 100 most frequent words as features. Due to the increase in features, the model managed to perform much better than Model2. However since these features are frequent in all emails, the performance of the model is limited. We can also imagine that the increase in performance could be attributed to the words in the top 100 features that have the lowest frequency, since they might produce a greater extent of differentiability between spam and not spam emails.
<hr>
- Model5 was trained solely utilising the subject line. The model performed relatively poorly since spam emails very seldom include a lot of information in the Subject line that is indicative of it being a spam email. This is due to the fact that if the subject line seems misleading, people would not open them. This can be seen in the cell below where I filtered for spam emails and displayed the extracted Subject lines.

In [31]:
train[(train['label'] == 1)].subject_line.tail(10)

3074                        hi , great news ciais viagrra
3077                                weekend entertainment
3082            save up to 89 % on ink + no shipping cost
3083                      get ahead in life with the euro
3084    hi paliourg get all pills . everything for you...
3088           better pricing means more savings to you .
3093                       my site links to your site now
3095                  garth howard wanted you to get this
3097                                                     
3100                                    need legal help ?
Name: subject_line, dtype: object

### c. Make a recommendation about which model to use with reasons.

- Even though Model1 and Model4 have extremely similar performance in terms of all the metrics, we opt to utilise Model4 as the best model for the following reasons: <br>
<ol>
    <li> Since the objectective is <strong>spam classifcation</strong> we prefer models that have a lower chance of making a Type 2 error.i.e we do not want to classify an email as not spam when it is an actual fact spam. Consequently we prefer models which provide a higher recall and Model4 has the highest recall.
<li> Model4 was also trained without the 100 most frequent words. This ensures that the model is not reliant on these frequent words in order to make any classifications. Consequently, the presence of these frequently used words will have no bearing on the classification of Model4.


## Task 5
### 5. Now, evaluate all the models with the test set

### a. Calculate the evaluation metrics

In [32]:
test = pd.read_csv("./project_1_data/test.csv")

test['filtered_text'] = test['text'].apply(lambda x: preprocess(x))
test['subject_line'] = test['text'].apply(lambda x: extract_subject_line(x))

#### Evaulating Model 1.

In [33]:
X_Test = cv.transform(test.filtered_text)

Model1_test_results = Model1.predict(X_Test)

Model1_test_metrics = get_metrics(Model1_test_results,test.label)

The accuracy score is 0.969.
The precision score is 0.917.
The recall score is 0.975.
The f1 score is 0.945.


#### Evaluating Model 2.

In [34]:
X_Test_Top_10 = CV_top_10.transform(test.filtered_text)

Model2_test_results = Model2.predict(X_Test_Top_10)

Model2_test_metrics = get_metrics(Model2_test_results,test.label)

The accuracy score is 0.665.
The precision score is 0.573.
The recall score is 0.441.
The f1 score is 0.499.


#### Evaluating Model 3.

In [35]:
X_Test_Top_100 = CV_top_100.transform(test.filtered_text)

Model3_test_results = Model3.predict(X_Test_Top_100)

Model3_test_metrics = get_metrics(Model3_test_results,test.label)

The accuracy score is 0.86.
The precision score is 0.87.
The recall score is 0.711.
The f1 score is 0.783.


#### Evaluating Model 4.

In [36]:
X_Test_Remove_Top_100 = CV.transform(test.filtered_text)

Model4_test_results = Model4.predict(X_Test_Remove_Top_100)

Model4_test_metrics = get_metrics(Model4_test_results,test.label)

The accuracy score is 0.971.
The precision score is 0.917.
The recall score is 0.982.
The f1 score is 0.948.


#### Evaluating Model 5.

In [37]:
X_Test_Subject_Line = cv_subject.transform(test.subject_line)

Model5_test_results = Model5.predict(X_Test_Subject_Line)

Model5_test_metrics= get_metrics(Model5_test_results,test.label)

The accuracy score is 0.699.
The precision score is 0.773.
The recall score is 0.488.
The f1 score is 0.599.


### Summarised Test Results

In [38]:
test_results = pd.DataFrame(columns=['Model','Accuracy','Precision','Recall','F1 Score'])
Model1_test_metrics.insert(0,"Model1")
Model2_test_metrics.insert(0,"Model2")
Model3_test_metrics.insert(0,"Model3")
Model4_test_metrics.insert(0,"Model4")
Model5_test_metrics.insert(0,"Model5")
test_results.loc[-1] = Model1_test_metrics
test_results.index = test_results.index + 1  # shifting index
test_results.loc[-1] = Model2_test_metrics
test_results.index = test_results.index + 1  # shifting index
test_results.loc[-1] = Model3_test_metrics
test_results.index = test_results.index + 1  # shifting index
test_results.loc[-1] = Model4_test_metrics
test_results.index = test_results.index + 1  # shifting index
test_results.loc[-1] = Model5_test_metrics
test_results.index = test_results.index + 1  # shifting index
test_results.reset_index(drop=True,inplace=True)
test_results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Model1,0.969,0.917,0.975,0.945
1,Model2,0.665,0.573,0.441,0.499
2,Model3,0.86,0.87,0.711,0.783
3,Model4,0.971,0.917,0.982,0.948
4,Model5,0.699,0.773,0.488,0.599


### b. Is your recommendation (in 4c above) still is valid? Explain

- Yes, Model4 was still the best performing Model on the test as well. This is due to the fact that the model does not use the frequent words to classify an email. Consequently, the presence or absence of these has no bearing on the performance of the model which increases the models ability to generalise well on unseen data.

## Task 6

### 6. Implement at least one more variation to improve the performance of the best model. For instance, use frequent n-grams as features. The improvement should be on at least one of the evaluation metrics. Cleary describe the improved model and explain how it is an improvement.


In [39]:
## Extracting all the emails that are spam in the training data
spam_training_data = train[(train['label'] == 1)]
ham_training_data = train[(train['label'] == 0)]

spam_vocab = []
for spam in spam_training_data.filtered_text:
    spam_vocab.extend(spam)

ham_vocab = []
for ham in ham_training_data.filtered_text:
    ham_vocab.extend(ham)
    
## Creating a vocabulary of all the words used in the spam emails 
spam_vocab = Counter(spam_vocab)
spam_vocab = dict(sorted(spam_vocab.items(), key=lambda item: item[1],reverse=True))

## Creating a vocabulary of all the words used in the spam emails 
ham_vocab = Counter(ham_vocab)
ham_vocab = dict(sorted(ham_vocab.items(), key=lambda item: item[1],reverse=True))

In [40]:
ham_vocab_words = set(ham_vocab.keys())
spam_vocab_words = set(spam_vocab.keys())
# shared_words = ham_vocab_words.intersection(spam_vocab_words)
not_shared_words = ham_vocab_words.difference(spam_vocab_words)

In [41]:
print(f"The total number of shared words between Ham and Spam emails is {len(not_shared_words)}.")

The total number of shared words between Ham and Spam emails is 9416.


- We can add the words that are unique to ham emails and not included in the current vocabulary to the vocabulary.

In [42]:
## Finding the words that belong to Ham emails that are not in the vocab used in Model4
new_vocabulary = not_shared_words.difference(set(new_vocab_excluding_top_100))

In [43]:
## Adding the words found above to the vocabulary
updated_vocab = new_vocab_excluding_top_100 + list(new_vocabulary)
assert(len(updated_vocab) == (len(new_vocab_excluding_top_100) + len(new_vocabulary))),print("Incorrect Number of words. Please fix merge")
print(f"Lists merged correctly !\nThe number of new wordsin the vocabulary are {len(new_vocabulary)}.")

Lists merged correctly !
The number of new wordsin the vocabulary are 5.


In [44]:
CV_Final = CountVectorizer(analyzer = lambda x: x, vocabulary =updated_vocab )
X_Final = CV_Final.fit_transform(train.filtered_text)
assert(X_Final.toarray().shape[1] == len(updated_vocab)),print("Incorrect Number of features extracted")
assert(sorted(CV.get_feature_names_out()) == sorted(list(vocabulary.keys())[100:])),print("Incorrect features are being extracted")
print("Correct Number of features extracted: Passed !")

Correct Number of features extracted: Passed !


In [45]:
Model_Final = MultinomialNB()
Model_Final.fit(X_Final,train.label)

### Evaluating improved model on Validation set

In [46]:
X_Validation_Final= CV_Final.transform(validation.filtered_text)

Model_Final_valid_results = Model_Final.predict(X_Validation_Final)

Model_Final_metrics = get_metrics(Model_Final_valid_results,validation.label)

The accuracy score is 0.976.
The precision score is 0.95.
The recall score is 0.966.
The f1 score is 0.958.


#### Comparing validation metrics for the impoved model and model 4.


In [47]:
improved_valid = pd.DataFrame(columns = ["Model","Accuracy","Precision","Recall",'F1 '])
Model_Final_metrics.insert(0,"Improved Model")

improved_valid.loc[-1] = Model4_metrics
improved_valid.index = improved_valid.index + 1  # shifting index

improved_valid.loc[-1] = Model_Final_metrics
improved_valid.index = improved_valid.index + 1  # shifting index

improved_valid.reset_index(drop=True,inplace=True)

improved_valid

Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Model4,0.971,0.93,0.969,0.949
1,Improved Model,0.976,0.95,0.966,0.958


### Evaluating improved Model on Test set

In [48]:
X_Test_Final= CV_Final.transform(test.filtered_text)

Model_Final_Test_results = Model_Final.predict(X_Test_Final)

Model_Final_Test_metrics = get_metrics(Model_Final_Test_results,test.label)

The accuracy score is 0.976.
The precision score is 0.933.
The recall score is 0.982.
The f1 score is 0.957.


#### Comparing test metrics for the impoved model and model 4.


In [49]:
improved_test = pd.DataFrame(columns = ["Model","Accuracy","Precision","Recall",'F1 '])
Model_Final_Test_metrics.insert(0,"Improved Model")

improved_test.loc[-1] = Model4_test_metrics
improved_test.index = improved_test.index + 1  # shifting index

improved_test.loc[-1] = Model_Final_Test_metrics
improved_test.index = improved_test.index + 1  # shifting index

improved_test.reset_index(drop=True,inplace=True)

improved_test

Unnamed: 0,Model,Accuracy,Precision,Recall,F1
0,Model4,0.971,0.917,0.982,0.948
1,Improved Model,0.976,0.933,0.982,0.957


- The improvement made to the feature set was based on the vocabulary.
- In order to improve the models ability at detecting Ham/Non-spam emails, we found the words that were **uniquely** in Ham emails and were not in the vocabulary initially used for Model4. We emphasize uniquely since these words **DO NOT** occur in spam emails.
- By adding these unique words, they become greater differentiatiors between ham and spam emails. Since these words indicate ham emails, the models ability to reduce the amount of False Positives is increased since it is better at detecting Ham emails. Since the total number of False Positives are decreased, the model achieves a better **Precision** which indirectly increases the accuracy as well.
- The afore mentioned increase in **Precision** and **Accuracy** is achieved on both, the validation and test set.