# DATA 607 - Summer 2025

## 2025.07.21

### In-class activity - SMS spam filtering

#### The dataset

In this activity, we'll try to flag spam SMS messages based on the text of the message.

The dataset comes from the UCI Machine Learning Repository ([link](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)).

I preprocessed it a bit. Load it from `data/sms_spam.csv`:

In [None]:
import pandas as pd

df = pd.read_csv("../data/sms_spam.csv", dtype={"is_spam": bool})
df.head()

Unnamed: 0,is_spam,message
0,False,"Go until jurong point, crazy.. Available only ..."
1,False,Ok lar... Joking wif u oni...
2,True,Free entry in 2 a wkly comp to win FA Cup fina...
3,False,U dun say so early hor... U c already then say...
4,False,"Nah I don't think he goes to usf, he lives aro..."


The dataset is unbalanced, with spam messages making up the minority (positive) class.
What proportion of messages in the dataset are spam?

#### Manual feature extraction

If you examine some of the messages in the dataset, you'll notice some patterns that you might exploit for classifying spam messages. 

Add the following features to the dataframe:

- `length`, the length of a message, in characters,

- `num_caps`, the number of capital letters in a message,

- `proportion_caps` the proportion of capital letters in a message,

- `num_digits`, the number of digits in a message,

- `proportion_digits` the proportion of digits in a message,

- binary features `contains_<char>` indicating whether each of the following characters occurs in a message:
`@`, `#`, `$`, `*`, `/`, `:`, `-`, `+`, `£`, `(`, `)`, `[`, `]`, `;`, `<`, `>`, `?`

Compute cross-validated accuracy, $F_1$, precision, and recall metrics for `LogisticRegression`, `SGDClassifier`, and `LinearSVC` models fit to the the data using these features that you extracted.
If you do this using `cross_val_score`, you'll need to loop over the metrics yourself. If you use [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) instead (try it!) you can pass in a list of metrics.

How do the models compare across the various metrics?

#### Cross-validated metrics

In applications like spam filtering, misclassifications may have asymmetric costs. The cost of missing an important message because it was classified as spam may be significantly higher than the cost of having to read and delete a spam message. For this reason, it's useful compute false positive and false negative rates (see below), rather than just the overall misclassification rate.

These rates can be computed from the data in the [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix).


There is a also a nice utility `ConfusionMatrixDisplay` for displaying confusion matrices with row and column labels. Again, see [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay).

Plot cross-validated confusion matrix displays for the various classifiers listed above. Which ones have the best false positive and false negative rates? Recall that a prediction is a

- **true positive** if the true label is $1$ and the predicted label is $1$,
- **false negative** if the true label is $1$ and the predicted label is $0$.
- **false postive** if the true label is $0$ and the predicted label is $1$,
- **true negative** if the true label is $0$ and the predicted label is $0$.

Let $\text{TP}$, $\text{FN}$, $\text{FP}$, and $\text{TN}$ be the numbers of true positive, false negative, false positive, and false negative predictions, respectively.

The **confusion matrix** of a binary classifier is the matrix
$$
\begin{pmatrix}
\text{TN}&\text{FP}\\
\text{FN}&\text{TP}
\end{pmatrix}.
$$

The confusion matrix can be computed using the compound metric [`confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html), exported from the `sklearn.metrics` module. 
That module also exports a nice utility [`ConfusionMatrixDisplay`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay) for displaying confusion matrices together with row and column labels. For more expository material on confusion matrices, see the [User Guide](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix).


 The **true positive rate**, $\text{TPR}$, of a predictor is defined by

$$
\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}.
$$ 

Similarly, the **false negative rate**, $\text{FNR}$, of a predictor is defined by
$$
\text{FNR} = \frac{\text{FN}}{\text{TP} + \text{FN}}.
$$

Because the confusion matrix isn't a score, you can't simply pass is as the `scoring` argument of `cross_val_score` or `cross_validated` to get a cross-validated version. Instead, you need to generate *cross-validated predictions* using [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html), and then pass these to `confusion_matrix` together with the true labels.

Compute and display confusion matrices for the classifier listed abouve. Use them to extract the true positive rate and false positive rate of each classifier.



#### Feature extraction with `CountVectorizer`

Compute cross-validated accuracy, $F_1$, precision, and recall metrics for `LogisticRegression`, `SGDClassifier`, `LinearSVC`, and `MultinomialNB` models fit to the data, extracting features using `CountVectorizer`.
How do the models compare across the various metrics?

Can you improve performance by tuning parameters of the vectorizer or the classifier? Does swapping out `CountVectorizer` with `TfidfVectorizer` improve any metrics?

#### Extensions

At this point, I expect you to be out of time. If you're not...

- Try using dense embeddings to extract features instead of `CountVectorizer`. Follow the approach I demonstrated in Lecture 5 with the GTE.

- If you've seen ROC curves before (I haven't discussed them in class), use `roc_curve` to find a threshold probability for a logistic regression classifier ensuring a small false positive rate, say $0.02$. What is the corresponding true positive rate? How about accuracy, $F_1$, precision, and recall scores?

