# EE 467 Lab 1: ML Pipeline for Spam Detection

In this lab, we will go through the process of a typical machine learning task, and apply it to a cyber-security problem. We will build a binary classifier that detects spam emails. Like previous lab, we will leave out some code for you to complete. Refer to API references and search on Google for usage of libraries and functions. Refer to previous labs and search on Google for usage of libraries and functions, and ask TA or Instructor if you don't really have a clue.

Before working on the code, we will need to install `NLTK` and `scikit-learn` for this lab:

In [1]:
%pip install nltk scikit-learn

Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting click (from nltk)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2026.1.15-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/41.5 kB ? eta -:--:--
     ------------------ ------------------- 20.5/41.5 kB 640.0 kB/s eta 0:00:01
     -------------------------------------- 41.5/41.5 kB 496.6 kB/s eta 0:00:00
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.7 kB ? eta -:--:--
     ----------------------------------- ---- 51.2/57.7 kB 2.6 MB/s eta 0:00:01
     ---------------------------------------- 57.7/57.7 kB


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\Andulka\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


And ensure the dataset is extracted from the archive:

In [2]:
# Extract data
!tar -xf emails.tar.xz

Then import the libraries we will use here:

In [3]:
# =============================================================================
# IMPORT REQUIRED LIBRARIES
# =============================================================================
# string   - Python's built-in module for string operations (punctuation list)
# numpy    - Numerical computing (we use 'np' as the standard alias)
# pandas   - Data manipulation and analysis (we use 'pd' as the standard alias)
# nltk     - Natural Language Toolkit for text processing
# =============================================================================

import string

import numpy as np
import pandas as pd

# NLTK (Natural Language Toolkit) - the most popular Python library for NLP
import nltk
from nltk.corpus import stopwords

# Download stop words (common words like "the", "a", "is" that add no meaning)
# These need to be downloaded once before use
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Andulka\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Pre-processing

All machine learning tasks begin with the **pre-processing** step, during which we load the dataset into memory and "clean" the data so that they are suitable for subsequent steps. For spam email detection task, here we will load all emails into the memory, tokenize each email into a list of words and then remove words that are useless for analysis.

All emails are stored in `emails.csv` under the same directory as this notebook. Feel free to open the file, take a look and get familiar with the format of the email dataset, then go back here to load the data.

In [4]:
# =============================================================================
# LOADING THE DATASET
# =============================================================================
# pd.read_csv() reads a CSV file and returns a DataFrame
# A DataFrame is like a spreadsheet - rows are samples, columns are features
# =============================================================================

# Load email dataset into a DataFrame
df = pd.read_csv("emails.csv")

# Preview first 5 rows
print(df.head(5), "\n")

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1 



In [5]:
# Check dataset size and columns
print("Shape:", df.shape)      # (rows, columns)
print("Columns:", df.columns)  # 'text' = email, 'spam' = label (1=spam, 0=ham)

Shape: (5728, 2)
Columns: Index(['text', 'spam'], dtype='object')


In [8]:
## [ TODO 1 ] Remove duplicate rows from the DataFrame
#
# Hint: DataFrames have a method for removing duplicates in-place.
#       After removing, print the shape to verify - expect fewer rows.
#       Look up: pandas DataFrame drop_duplicates documentation
#
set_df = df.drop_duplicates()   # removes duplicates creating a set-like output
print("Shape (new):", set_df.shape)       #(rows, columns)

#pass

Shape (new): (5695, 2)


In [9]:
# Number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

text    0
spam    0
dtype: int64

After loading the email dataset into memory, we will need to remove punctuations and stop words from these emails. Stop words are common, useless words that should be ignored in analysis (such as a, an, the, ...).

In [10]:
# Text tokenizer: removes punctuation and stop words
def process_text(text):
    """Convert email text to list of meaningful words."""

    # Remove punctuation (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~)
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    # Remove stop words ("the", "a", "is", etc.) - case insensitive
    clean_words = [word for word in nopunc.split()
                   if word.lower() not in stopwords.words('english')]

    return clean_words

In [11]:
# Preview the result of tokenization
df['text'].head().apply(process_text)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

## Feature Extraction

We have obtained semi-structured tokenized email texts in the pre-processing step; however, machine learning algorithms usually operate on fully-structured numerical features. Hence, we need to find a way to convert the email texts to numeric vectors. This process is called **feature extraction**, and is necessary in data mining and analysis tasks where input data is semi-structured or even unstructured. In the following part we will make use of `scikit-learn`, which is a library for classic machine learning and feature extraction.

We will use **token count features** to represent the characteristics of each email. This turns a piece of text into a vector, each dimension of which contains the number of occurance of a particular word. In practice, we process many texts at once and end up getting a token count matrix. Below is simple demo on a toy dataset with only two emails:

In [12]:
# DEMO: Bag-of-Words converts text → word count vectors

message4 = 'hello world hello hello world play'
message5 = 'test test test test one hello'

from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer: text → matrix where each column = a word, values = counts
cv = CountVectorizer(analyzer=process_text)
bow4 = cv.fit_transform([[message4], [message5]])

In [13]:
# Vocabulary = unique words (these become column names)
print(cv.get_feature_names_out(), "\n")

# Count matrix: rows = documents, columns = word counts
print(bow4.toarray(), "\n")

['hello' 'one' 'play' 'test' 'world'] 

[[3 0 1 0 2]
 [1 1 0 4 0]] 



In [14]:
# Sparse format: only stores non-zero values (saves memory)
print(bow4, type(bow4), "\n")

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6 stored elements and shape (2, 5)>
  Coords	Values
  (0, 0)	3
  (0, 4)	2
  (0, 2)	1
  (1, 0)	1
  (1, 3)	4
  (1, 1)	1 <class 'scipy.sparse._csr.csr_matrix'> 



Now let's compute and store token count matrix for real data:

## Create bag-of-words matrix for all emails

In this step, you will convert the email **text content** into a **Bag-of-Words (BoW)** representation using `CountVectorizer`.

✅ **Important note:**  
In the in-class demo, we used `CountVectorizer(analyzer=process_text)`, where `process_text` performs custom text processing.  
That approach can be **slow** and may produce **many printed outputs** because the custom analyzer shows intermediate processing steps.

For this lab, we will use a simpler and faster approach by letting `CountVectorizer` handle the tokenization internally, and we will enable English stop-word removal using:

- `stop_words="english"`

➡️ Your task: apply `CountVectorizer(stop_words="english")` on the `text` column and store the result in `messages_bow` as a **sparse matrix**.


In [20]:
## [ TODO 2 ] Create bag-of-words matrix for all emails
#
# Hint: Use CountVectorizer with stop-word removal to fit and transform email text
#       into a Bag-of-Words matrix.
#       Apply it to the 'text' column of df. Store result in 'messages_bow'.
#       Note: Keep it as sparse matrix (don't convert to array).
#
vectorizer = CountVectorizer(stop_words="english")
messages_bow = vectorizer.fit_transform(df['text'])

print(messages_bow, type(messages_bow), "\n")
#pass

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 508419 stored elements and shape (5728, 36996)>
  Coords	Values
  (0, 32145)	1
  (0, 23219)	1
  (0, 18705)	1
  (0, 9986)	1
  (0, 17562)	1
  (0, 21006)	1
  (0, 27817)	1
  (0, 16546)	1
  (0, 27941)	1
  (0, 9223)	3
  (0, 21520)	2
  (0, 32408)	1
  (0, 18103)	1
  (0, 18751)	1
  (0, 15964)	2
  (0, 7986)	1
  (0, 20818)	3
  (0, 32126)	1
  (0, 31776)	1
  (0, 24679)	1
  (0, 35805)	2
  (0, 21296)	2
  (0, 32839)	1
  (0, 12539)	1
  (0, 26937)	2
  :	:
  (5727, 24659)	2
  (5727, 21490)	1
  (5727, 5683)	9
  (5727, 30755)	1
  (5727, 2807)	3
  (5727, 13246)	1
  (5727, 13036)	1
  (5727, 17257)	1
  (5727, 14028)	1
  (5727, 20137)	1
  (5727, 31635)	1
  (5727, 13037)	1
  (5727, 20329)	1
  (5727, 35066)	1
  (5727, 8557)	1
  (5727, 29914)	1
  (5727, 13428)	5
  (5727, 35964)	1
  (5727, 943)	2
  (5727, 2776)	1
  (5727, 30109)	1
  (5727, 17456)	1
  (5727, 33710)	1
  (5727, 10293)	1
  (5727, 11304)	1 <class 'scipy.sparse._csr.csr_matrix'> 



## Training

Now that we have loaded and pre-processed the email dataset, it's time to **train** a classifier model that does the job. First, we will split the email dataset into a 80% **training set** and a 20% **test set**. Each set will contain sample features as well as corresponding labels.

In [21]:
from sklearn.model_selection import train_test_split

# Split the data into 80% training (X_train & y_train)
# and 20% testing (X_test & y_test) data sets
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'], test_size = 0.20, random_state = 0)

Then, we train a **logistic regression** classifier on the training set. We determine the class of the sample through its probability which is computed from the following formula:

$$
P(Y = 1|X = x) = \frac{e^{\mathbf{X}^T \mathbf{b}}}{(1+e^{\mathbf{X}^T \mathbf{b}})} \\
P(Y = 0|X = x) = 1 - P(Y = 1|X = x)
$$

Where $\mathbf{b}$ is a trainable vector. During training, we will try to maximize the **cross entropy loss** by performing **stochastic gradient descent** on parameter $\mathbf{b}$:

$$
l_{CE} = -(y \log P(Y = 1|X = x) + (1 - y) \log P(Y = 0|X = x))
$$

In [23]:
from sklearn.linear_model import LogisticRegression

## [ TODO 3 ] Create and train a logistic regression classifier
#
# Hint: Instantiate LogisticRegression (use random_state=0 for reproducibility).
#       Then call the appropriate method to train on X_train and y_train.
#       Store the model in a variable called 'classifier'.
#
model = LogisticRegression(random_state=0)
classifier = model.fit(X_train, y_train)
#pass

## Evaluation

Finally, we need to determine how good our classification model is. This is known as **evaluation**. We will use our trained model to make predictions for both training and testing data, and calculate various metrics with the predictions and actual labels.

In [24]:
# Print predictions on training data
# `predict` function compute model predictions from input data
print("Training prediction:\n", classifier.predict(X_train), "\n")

# Print the actual labels
print("Training actual:\n", y_train.values, "\n")

Training prediction:
 [0 0 1 ... 0 0 0] 

Training actual:
 [0 0 1 ... 0 0 0] 



There are a number of useful metrics for evaluation of binary classifiers, available through `classification_report`, `confusion_matrix` and `accuracy_score` functions:

* **Confusion Matrix**: a matrix that indicates how many samples are correctly or incorrectly classified. The cell at $i$-th row and $j$-th column represents how many samples that belong to $i$-th class and are predicted as $j$-th class. For binary classification, the confusion matrix has only two columns and two rows:

|Class|True               |False              |
|-----|-------------------|-------------------|
|True |True Positive (TP) |False Negative (FN)|
|False|False Positive (FP)|True Negative (TN) |

* **Accuracy**: proportion of samples that are correctly classified.

$$
Accuracy = \frac{TP+TN}{TP+FP+TN+FN}
$$

* **Precision**: of all positive predictions, how many of them are actually correct?

$$
Precision = \frac{TP}{TP+FP}
$$

* **Recall**: of all actually positive samples, how many of them are predicted correctly?

$$
Recall = \frac{TP}{TP+FN}
$$

* **F1 Score**: the harmonic mean of precision and recall.

$$
F1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}
$$

We first calculates and prints various metrics for training data:

In [25]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Predict and evaluate on training data
pred = classifier.predict(X_train)

# `classification_report` outputs classification metrics
# such as precision, recall and F1 score
print(classification_report(y_train, pred))

# `confusion_matrix` outputs how many samples are correctly or incorrectly classified
print('Confusion Matrix: \n', confusion_matrix(y_train, pred), "\n")

# `accuracy` computes classification accuracy
print('Accuracy: ', accuracy_score(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3475
           1       1.00      1.00      1.00      1107

    accuracy                           1.00      4582
   macro avg       1.00      1.00      1.00      4582
weighted avg       1.00      1.00      1.00      4582

Confusion Matrix: 
 [[3475    0]
 [   0 1107]] 

Accuracy:  1.0


We now calculates and prints the same metrics for testing data. This measures the ability of the classification model to generalize to similar yet unknown data. The less difference in training and testing data, the better the model is.

In [None]:
## [ TODO 4 ] Print test predictions and actual labels
#
# Hint: Use the trained classifier to predict on X_test.
#       Print both the predictions and y_test values side by side.
#       Follow the same pattern used for training data in Cell 25.
#
pred_X_test = classifier.predict(X_test)
# `predict` function compute model predictions from input data
print("Training prediction:\n", classifier.predict(X_test), "\n")

# Print the actual labels
print("Training actual:\n", y_test.values, "\n")

#pass

Training prediction:
 [0 0 1 ... 0 0 1] 

Training actual:
 [0 0 1 ... 0 0 1] 



In [27]:
## [ TODO 5 ] Evaluate classifier on test data
#
# Hint: Follow the same evaluation pattern used for training data previously.
#       Use the three imported metrics functions on X_test/y_test.
#       Expected accuracy should be around 98-99%.
#
pred_X_test = classifier.predict(X_test)
print(classification_report(y_test, pred_X_test))

# `confusion_matrix` outputs how many samples are correctly or incorrectly classified
print('Confusion Matrix: \n', confusion_matrix(y_test, pred_X_test), "\n")

# `accuracy` computes classification accuracy
print('Accuracy: ', accuracy_score(y_test, pred_X_test))
#pass

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       885
           1       0.97      0.98      0.98       261

    accuracy                           0.99      1146
   macro avg       0.98      0.99      0.99      1146
weighted avg       0.99      0.99      0.99      1146

Confusion Matrix: 
 [[878   7]
 [  5 256]] 

Accuracy:  0.9895287958115183


## Discussion Question: Why Bag-of-Words (BoW) Still Works (and its Limitations)

In this lab, we used **Bag-of-Words (BoW)** to convert email text into numerical features that a machine learning model can understand.

### A common concern with BoW
In the in-class discussion, we learned that some words appear in **many** documents (examples: *“the”*, *“and”*, *“hello”*, *“thanks”*). These very frequent words can cause two issues:

1. **They do not help distinguish spam vs. ham**  
   If a word appears in almost every email, it does not provide useful information for classification.

2. **Different emails can look similar in feature space**  
   Two different messages may share many common words, which can lead to **similar BoW representations**, even if their meaning is different.

---

### ✅ Your Task (Short Answer)
Even with the limitations above, BoW often performs surprisingly well for spam detection.

**Why does the Bag-of-Words method still work well in this lab?**  
Write a **short explanation** (2–4 sentences) and include **at least one clear reason** supported by what you observe in the dataset or model behavior.


In [None]:
### Please include your Answer here
"""
Bag-of-Words method still works well in this lab because this lab already filters most common words by using stop_words="english."
So the model focuses on remaining words (has more meaning) rather than most common words. Also, in this lab Logistic Regression is used.
Logistic Regression assigns weights to features and learns to ignore frequent words that are useless. Predictive words get high weights.
Overall, by focusing on words that are statistically less common makes it easier to pinpoint deceptive vocabulary that spam emails usually have.
"""

## References
1. https://github.com/randerson112358/Python/blob/master/Email_Spam_Detection/Email_Spam_Detection.ipynb
2. https://stackoverflow.com/questions/27488446/how-do-i-get-word-frequency-in-a-corpus-using-scikit-learn-countvectorizer