# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys 
if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aml24-master-class.git
        
    # Copy files required to run the code
    !cp -r "aml24-master-class/text_classification/data" "aml24-master-class/text_classification/tools.py" .
    
    # Install packages via pip
    !pip install -r "aml24-master-class/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)

# Text classification :: Overview

### Task 

We want to build a Spam detector which, given examples of spam emails (e.g. flagged by users) and examples of regular (non-spam, also called "ham") emails, learns how to flag new unseen emails as spam or non-spam.

### Data

We will use the [SpamAssassin](https://spamassassin.apache.org/) public email corpus. This dataset contains ~6'000 labeled emails with a ~30% spam ratio. If you want to learn more about this dataset, check [this](https://spamassassin.apache.org/old/publiccorpus/). (*Note: Datasets of text are called corpora and samples are called documents.*) 

The dataset has been downloaded for you and is available in the *data* folder.

### Notebook overview

* Load the data
* Text preprocessing
* Data exploration
* Feature extraction
* Build a spam detector
* What did our model learn? Error analysis

# Text classification :: Spam detection


## Load the data

In [None]:
# Load libraries and helper functions
%run tools.py

In [None]:
# Load the data
df = load_data()

Let's check the number of samples per class in the data.

In [None]:
plot_class_frequency(df)

Now, let's have a look at a few rows from the dataset.

***Note:*** The *label* is 0 for *non-spam* and 1 for *spam*.

In [None]:
# If you rerun this cell then you get a different set of samples displayed
df.sample(3)

## Text preprocessing

Good text preprocessing is an essential part of every NLP project!

Our goal here is to build a model that distinguishes non-spam from spam. The idea here is to "clean" and "standardize" raw text before feeding it to our machine learning model. We need to keep as many "informative" words as possible, while discarding the "uniformative" ones. Removing unnecessary content, i.e. the "noise", from our texts will help to improve the accuracy of our models.

## 💡 Observations

- There are some items in the text that should be removed to make it readable. Here are some suggestions:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...

- It is likely that the number of occurrences of the above items (HTML tags, URLs, etc) is helpful to distinguish spam from non-spam. Similarly, the length of the emails and the frequency of punctuation marks or upper case letters may also give us clues as to whether we are dealing with spam or not.

The *clean_corpus* function below will take care of the parts raised in the 1st observation. For the ideas from the 2nd observation we will create new features and investigate their effects in the subsection **What about "spammish" signatures?**. 

In [None]:
df = clean_corpus(df)

print("Data cleaned")

Let's have a look at a few "cleaned" examples.

In [None]:
show_clean_text(df)

## Data Exploration :: What makes spam distinct?

### Frequent words

Which words distinguish spam from non-spam? Can we  identify the words in a text that are the most informative about its topic?

Let's find the 10 most frequent words in spam and non-spam and compare them.

In [None]:
plot_most_common_words(df=df, N=10)

## 💡 Observations

**Frequent "spammish" words**: 

* free
* email
* click
* business
* money

**Frequent "non-spammish" words**:

* just
* like
* linux
* wrote
* users  

**Occur in both top 10 but could be useful for distinctions**:

* list
* time

**Occur in both top 10 but are unlikely to be useful**:

* people
* mail

In [None]:
plot_most_common_words(df=df, N=10)

## 💡 Observations

As we use more top words we get more overlap between the classes.  
However words like _email_ or _free_ are still mch more frequent in the **spam** class 

<div class="alert alert-success">
    
Let's change `N=10` to `N=20` and compare the outcome.
</div>

### What about "spammish" signatures?

* Do spams contain more HTML tags? 
* Does non-spam contain more URLs and E-mail adresses? 
* Are spams mails longer than non-spam? 
* ...

Let's find out!

In [None]:
features = get_features(df=df)

## Feature engineering :: Extracting features from text

Computers don't understand natural language. So, how do we represent text?

One of the simplest but effective and commonly used models to represent text for machine learning is the ***Bag of Words*** model ([online documentation](https://en.wikipedia.org/wiki/Bag-of-words_model)). When using this model, we discard most of the structure of the input text (word order, chapters, paragraphs, sentences and formating) and only count how often each word appears in each text. Discarding the structure and counting only word occurencies leads to the mental image of representing text as a "bag".  

**Example:** Let our toy corpus contain four documents.

$ corpus = ['I\;enjoy\;paragliding.',  $  
$\hspace{2cm}'I\;like\;NLP.',$  
$\hspace{2cm}'I\;like\;deep\;learning.',$  
$\hspace{2cm}'O\;Captain!\;my\;Captain!']$ 

In [None]:
show_bag_of_words_vector()

Bag of Words has converted all documents into numeric vectors. Each column represents a word from the corpus and each row one of the four documents. The value in each cell represents the number of times that word appears in a specific document. For example, the fourth document has the word `captain` occuring twice and the words `my` and `O` occuring once.

## Build a spam detector

In the previous section, we saw how to perform text preprocessing and feature extraction from text. We are now ready to build our machine learning model for detecting spams. We will use a Logistic Regression classifier.

First, split the data into two sets: the `train` set and the `test` set. We will then use the train set to `fit` our model. We will use 5-fold cross-validation. So the validation sets are automatically created internally. The test set will be used to `evaluate` the performance of our model. 

### Baseline

70.3% of samples are non-spam. This naive baseline model would reach 70% for doing very little.

### Spam classification

In [None]:
# Train/test splitting
df_train, df_test = train_test_split_(df)

# Fit model on the train data
model = fit_model(df_train)

# Print predictions on test set
plot_confusion_matrix(df_test, model);

**Confusion matrices**  

Confusion matrices are a nice way of evaluating the performance of models for classification models. Rows correspond to the true classes and the columns to the predicted classes. Entries on the main diagonal of the confusion matrix correspond to correct predictions while the other cells tell us how many mistakes made our model ([online documentation](https://en.wikipedia.org/wiki/Confusion_matrix)).

* The first row represents non-spam mails: 1'192 were correctly classified as 'non-spam', while 24 (~1,9%) were misclassified as 'spam'.
* The second row represents spam mails: 435 were correctly classified as 'spam', while 15 (~3,3%) were misclassified as 'non-spam'.

Our model did quite well!

### What did our model learn from the data?

Our logistic regression model has learned which words are the most indicative of non-spam and which words are the most indicative of spam. The positive coefficients on the right correspond to words that, according to the model, are indicative of spam. The negative coefficients on the left correspond to words that, according to the model, are indicative of non-spam.

In [None]:
visualize_coefficients(model, n_top_features=20)

## 💡 Observations

- According to the model, words such as "date", "wrote", "yahoo", "supplied", ... are strong indicators of non-spam.  

- Words such as "click", "removed", "sightings",  ... indicate spam.

- These results are consistent with our earlier analysis. For example we had identified "wrote", "said" and "linux" as potential indicators of non-spam ealier. Similarly "click", "credit", "free", and "money" suggested spam.

### Error analysis :: Where does our model fail?

We will now analyze the misclassified mails in order to get some insights on where the model failed to make correct predictions. The *error_analysis* function below will show us the top features responsible for the model making a decision of prediction whether the mail is spam or non-spam.

In [None]:
error_analysis(df_test, model, doc_nbr=16)