# Spam classifier

## Loading the data

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

emails = pd.read_csv("email_spam_dataset.csv", delimiter = ",")

print(emails.info())
display(emails)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320 entries, 0 to 319
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   email_text  320 non-null    object
 1   label       320 non-null    object
dtypes: object(2)
memory usage: 5.1+ KB
None


Unnamed: 0,email_text,label
0,Here is the project update you asked for.,ham
1,Limited offer!!! Buy now and get 50% discount.,spam
2,Win cash prizes instantly by replying to this ...,spam
3,Urgent! Your account has been suspended. Verif...,spam
4,"Hi, please find the meeting agenda attached.",ham
...,...,...
315,Here is the project update you asked for.,ham
316,You are selected for a free gift card. Act fast.,spam
317,Win cash prizes instantly by replying to this ...,spam
318,Limited offer!!! Buy now and get 50% discount.,spam


## Explore the data

We'll start by changing the way the labels are coded:
- spam = 1
- ham = 0

In [2]:
# change label encoding
emails["label"] = [1 if i == "spam" else 0 for i in emails["label"]]

print(emails["label"].value_counts(normalize = True))

label
1    0.528125
0    0.471875
Name: proportion, dtype: float64


Spam and non-spam emails appear to be represented almost evenly, with ~53% and ~47% respectively.

## Naive Bayes Classifier

### Data manipulation prior to fitting the model

Since this is a supervised learning model, we'll start by splitting our data into training and test sets.

In [3]:
x_train, x_test, y_train, y_test = train_test_split(emails["email_text"], 
                                                    emails["label"], 
                                                    test_size = 0.2, 
                                                    random_state = 42)

### Hyperparameter tuning

#### Grid search

In the context of a Naive Bayes Classifier, there are two factors that require special attention:
- The use of n-grams.
    - While not a hyperparameter per se, it can contribute to the model's accuracy.
- Alpha.
    - The smoothing factor (if any) implemented into the model.

Also, since the volume of data we have available is fairly small, we'll use cross-validation.

In [4]:
# hyperparameters to test
params = {
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
    "classifier__alpha": [i/10 for i in range(11)]
}

### we'll start by setting up a pipeline
pipeline = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("classifier", MultinomialNB())
])

### we'll then perform a grid search with cross validation
skfoldcv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
grid_search_cv = GridSearchCV(pipeline, 
                              params, 
                              scoring = "f1", 
                              cv = skfoldcv, 
                              refit = True, 
                              return_train_score = True,
                              verbose = False)

grid_search_cv.fit(x_train, y_train)

print("Best model according to the grid search: {}".format(grid_search_cv.best_estimator_))
print("Best score according to the grid search: {}".format(grid_search_cv.best_score_))

  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(
  self.feature_log_prob_ = np.l

Best model according to the grid search: Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('classifier', MultinomialNB(alpha=0.0))])
Best score according to the grid search: 1.0


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


In [5]:
gs_results = pd.DataFrame(grid_search_cv.cv_results_)
display(gs_results)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__alpha,param_vectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.007693,0.001783,0.005349,0.001072,0.0,"(1, 1)","{'classifier__alpha': 0.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,0.007313,0.001245,0.004983,0.001056,0.0,"(1, 2)","{'classifier__alpha': 0.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,0.007375,0.000551,0.004583,0.000618,0.0,"(1, 3)","{'classifier__alpha': 0.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
3,0.006146,0.000572,0.004046,0.000542,0.0,"(2, 2)","{'classifier__alpha': 0.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0.007625,0.000827,0.004834,0.000447,0.0,"(2, 3)","{'classifier__alpha': 0.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,0.005827,0.000436,0.003699,0.000155,1.0,"(1, 2)","{'classifier__alpha': 1.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
62,0.006836,0.000352,0.004510,0.000546,1.0,"(1, 3)","{'classifier__alpha': 1.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
63,0.005617,0.000568,0.003977,0.000552,1.0,"(2, 2)","{'classifier__alpha': 1.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
64,0.007535,0.001150,0.004523,0.000876,1.0,"(2, 3)","{'classifier__alpha': 1.0, 'vectorizer__ngram_...",1.0,1.0,1.0,...,1.0,0.0,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Given that the grid search resulted in a perfect F1 score and no implementation of smoothing whatsoever, this could suggest that we're running into one of the following issues:
- There's data leakage and the hyperparameters are being tuned to the test data.
- The model is overfitting the data.
- The volume of data is too little and it just happens to allow for perfect separability.
    - This would result in the inflated score we're seeing and would generalize poorly to unseen data.

#### Assessing potential issues

We'll start by looking at how the classifier generalizes to the validation set (the resulting x_test in train_test_split since we used cross validation)

In [6]:
pred = grid_search_cv.predict(x_test)
accuracy = accuracy_score(y_test, pred)
precision = precision_score(y_test, pred)
recall = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

print("Validation accuracy: {}".format(accuracy))
print("Validation precision: {}".format(precision))
print("Validation recall: {}".format(recall))
print("Validation F1 score: {}".format(f1))

Validation accuracy: 1.0
Validation precision: 1.0
Validation recall: 1.0
Validation F1 score: 1.0


The fact that all performance metrics came back as 1 after testing the model with the validation set (which was never seen by the grid search) suggests that this dataset is most likely perfectly separable. This is probably due to the extremely low volume of data, as there are only 320 rows in the entire dataset and the messages are fairly small and don't include things such as titles or long paragraphs.<br><br>
We can still take this analysis one step further and check which words are separating the classes.

In [7]:
# isolate the classifier and the vectorizer
best_model = grid_search_cv.best_estimator_
vectorizer = best_model.named_steps["vectorizer"]
classifier = best_model.named_steps["classifier"]

### we'll create a df in which we can see the features (words) and their log probabilities
feature_probs = pd.DataFrame({
    "words": vectorizer.get_feature_names_out(),
    "non_spam_prob": classifier.feature_log_prob_[0],
    "spam_prob": classifier.feature_log_prob_[1]
})

feature_probs["diff"] = feature_probs["spam_prob"] - feature_probs["non_spam_prob"]
display(feature_probs.sort_values(by = "diff", ascending = False))

Unnamed: 0,words,non_spam_prob,spam_prob,diff
0,50,-inf,-3.719247,inf
1,account,-inf,-4.075922,inf
2,act,-inf,-3.788240,inf
4,and,-inf,-3.719247,inf
13,card,-inf,-3.788240,inf
...,...,...,...,...
57,the,-3.247724,-inf,-inf
56,thank,-3.495560,-inf,-inf
61,up,-3.825802,-inf,-inf
66,well,-3.825802,-inf,-inf


The difference between each feature's log probability of being spam and not being spam can be interpreted as how much more (or less) likely that specific feature is to be spam. There are two interesting things to note about these results:
- The words that have a very high probability of being spam and a minus infinite probability of not being spam are meaningful in this context.
    - We can see that what the algorithm marks as highly likely to be spam are things like "account", "card", "buy", "immediately", "selected", "cash", "congratulations", "claim", etc.
- There is practically no overlap between the words likely to be spam and the words not likely to be spam.
    - This further cements the idea of this particular dataset being perfectly separable.
    - The only words in which we see a little bit of an overlap are the words "to", "are", "here", "your", "for", and "you", which is to be expected.

## Final considerations

The already short nature of the messages in conjunction with the reduced volume of data leads to a dataset that, while perfectly separable, is very likely not able to train a model that generalizes well to unseen data.<br><br>
The Naive Bayes Classifier built in this project is misleadingly accurate, as it obtains perfect performance metrics even when put to the test with the validation set. The very limited access to information the model had during its training phase means that, while it might perform perfectly on this particular set of emails, longer, more complex messages with previously unseen words are very likely to throw the model off. This is especially apparent by the fact that, even with no smoothing, we were able to achieve perfect performance on all metrics in the validation set.<br><br>
With all of this in mind, a model such as this one should not be deployed at a larger scale, as it was not designed to work that way. Unless there is access to a larger dataset that allows us to introduce more vocabulary diversity (keep in mind this model's entire training vocabulary consisted of 71 words) and longer sentences to even get context clues from higher order n-grams, it would be unwise to trust the predictions of this algorithm.