# Google Colab Setup

In [1]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys 
if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aml24-master-class.git
        
    # Copy files required to run the code
    !cp -r "aml24-master-class/text_classification/data" "aml24-master-class/text_classification/tools.py" .
    
    # Install packages via pip
    !pip install -r "aml24-master-class/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)


# Data

We will use the [SpamAssassin](https://spamassassin.apache.org/) public email corpus. This dataset contains ~6'000 labeled emails. If you want to learn more about this dataset, check [this](https://spamassassin.apache.org/old/publiccorpus/). (*Note: Datasets of text are called corpora and samples are called documents.*) 

The dataset has been downloaded for you and is available in the *data* folder.

The dataset has been labelled, i.e. we are told whether an email has been designated as spam, .e.g. if it was flagged by a user, or whether it is considered an example of regular emails (non-spam, also called "ham"). 

Our goal is to explore and compare various features space and machine learning approaches. The use of spam emails is just for demonstration and learning purpose as it is a text-based example that everyone is easily familiar with and that allows us to highlight different stages of developing a machine learning application and the decision making processes involved along the way.


## Data preparation :: Overview

In this notebook we will explore the dataset, do a first analysis and prepare it for different machine learning tasks.

### Task 

We will process the raw data, clean the text and extract additional features ain order to prepare it for further analysis and for building our machine learning models.

### Notebook overview

* Load the data
* Text preprocessing
* Feature extraction
* Store cleaned data


## Load the data

In [1]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py

In [2]:
# Load the data
df_source = load_source_data()

8546 emails loaded
Cleaning data set:
2710 duplicate emails found and removed
4 empty emails found and removed

5832 emails remaining

Number of columns: 2
Columns names:
spam_label, text


In [3]:
# If you rerun this cell multiple times you get different samples displayed each time
# OR you can replace the number 3 with a number of your choice
display(df_source.sample(3))

Unnamed: 0,spam_label,text
845,1,"<html> <head> <title>The Soft2Reg Team</title> </head> <link rel=STYLESHEET type=text/css href=css/main.css> <STYLE type=text/css> .Black11 {FONT-SIZE: 11px; COLOR: black; FONT-FAMILY: Verdana; FONT-WEIGHT: normal; TEXT-DECORATION: none} .Red13 {font-size : 13px; font-family :Verdana, Arial, Helvetica, sans-serif;font-weight : bold;color : Red} </style> <body bgcolor=ffffff> <center> <table border=0 width=600 cellpadding=2 cellspacing=0> <tr> <td width=275> <a href=http://www.soft2reg.com> <img src=http://www.soft2reg.com/images/logo.gif border=0 alt=www.soft2reg ></a> </td> <td align=right width=100%> <hr size=0> </td> </tr> </table> <p> <table border=0 width=600 cellpadding=4 cellspacing=0 bgcolor=6699cc> <tr> <td> <font face=arial size=+1 color=000000> <b>Soft2reg.com -- Service Update</b></font> </td> </tr> </table> <p> <table border=0 width=600 cellpadding=4 cellspacing=0> <tr> <td class=Black11> Hello,<br><br> <br>We apologize for the unsolicited e-mail, but we have noticed that you are using an online credit card processor on your web site that charges you much more than what you should be paying. <br><br> We are Soft2Reg.com and we <b>specialize in low cost online credit card processing for developers of electronic products</b>. Our rates are the lowest in the industry - <span class=Red13>8% flat fee</span> for all online credit card transactions. We provide you with a link for every one of your products that contains your logo and you put it on your web site for seamless credit card processing integration. For that simple, yet efficient service we charge only 8% of your online sales. <br><br> If you would like to see more information about us, please visit <a href=http://www.soft2reg.com>http://www.soft2reg.com</a>. <br><br> <HR noShade SIZE=1> <br>Thank you, <br>The Soft2Reg Team <br>----------------------------------- <br><a href=http://www.soft2reg.com>www.soft2reg.com</a>. </td></tr></table> </center> </body> </html>"
2256,0,"URL: http://diveintomark.org/archives/2002/09/23.html#now_heavily_medicated Date: 2002-09-23T16:11:02-05:00 Trust me when I tell you that heavy medication and RDF do not mix. Here is a list of things I intend to re-read once the fog lifts: - _Phil Ringnalda_: Using FOAF relationships[1] and Just say no to Trackback in index.html[2]. - _Les Orchard_: Per-post comment RSS feed[3]. - _Phil Wainewright_: The bare necessities of RSS[4] and What to do about RDF [5]. The beginning of an RSS 2.0 best practices document. - _Jonathon Delacour_: Trying to score a goal[6]. &#8220;As the best and the brightest focus on the possibilities of FOAF, I turned my attention to yesterday's news: RSS.&#8221; No, RSS will always be today's news. Get it? Today's newzzz... Never mind. - Comments on Ben Hammersley's Friend of a Friend[7]. Various ways to link to a FOAF file from an RSS feed. - _Nicholas Chase_: The Web's future: XHTML 2.0[8]. We're losing backward compatibility, isn't that great? Well, he seems to think so. - mod_cc[9], a module for including copyright information in RDF documents such as RSS 1.0 feeds, and, I hope, FOAF files. - _Shelley Powers_: Who is your audience, and what are you trying to accomplish?[10] Addressing the growing identity crisis on the RSS-DEV mailing list[11]. Also the comments on Shelley's article[12]. - _Ian Hickson_: Pingback 1.0[13]. &#8220;The best thing about this idea is that unlike similar schemes like TrackBack, it is totally transparent to both users.&#8221; - New software helps in building of accessible web sites[14]. A press release for a new edition of LIFT[15], which I have never used. - Forget Mars bars, Twinkies now the deep-fried treat[16]. &#8220;The secret to making a deep-fried Twinkie, he says, is to place it in the fridge first to give it more stability. He then rolls it in flour, covers it with batter ... and plunks it into the oil.&#8221; [1] http://philringnalda.com/archives/002324.php [2] http://philringnalda.com/archives/0..."
3915,0,"To update spamassasin, all I need to do is install the new tar.gz file as if it were a new installation? I don't need to stop incoming mail or anything like that? Thanks, Mike -- Michael Clark, Webmaster Center for Democracy and Technology 1634 Eye Street NW, Suite 1100 Washington, DC 20006 voice: 202-637-9800 http://www.cdt.org/ Join our Activist Network! Your participation can make a difference! http://www.cdt.org/join/ ------------------------------------------------------- This sf.net email is sponsored by: Jabber - The world's fastest growing real-time communications platform! Don't just IM. Build it in! http://www.jabber.com/osdn/xim _______________________________________________ Spamassassin-talk mailing list Spamassassin-talk@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/spamassassin-talk"


## Text preprocessing

Good text preprocessing is an essential part of every NLP project. It is the first step in the machine learning pipeline and it is important to get it right. The goal of text preprocessing is to transform the raw text into a format that can be used by machine learning algorithms.

Our overall goal is to build models that can help us distinguish non-spam from spam. 

The examples above have shown us that some samples are quite messy and contain a lot of content unnecessary for understanding the text as a human, i.e. they contain "noise". As a first step we will "*clean*" and "*standardize*" raw text. Our aim is to keep as many "*informative*" words as possible, while discarding the "*uniformative*" ones. Removing the noise from our texts will help to improve the accuracy of our models.

We thus need to identify which parts of the text are acting as "*noise*" in our text and remove it.

## Your Task:

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ What parts of the text do you think are noise?
   
__Q2.__ What should we do with these parts of the text?
</div>


#### Give your answer here:

1.    





2. 







## 💡 Observations

1. There are some items in the text that should be removed to make it readable. Here are some suggestions:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...

2. From experience, we know that the number of occurrences of the above items (HTML tags, URLs, etc) can be helpful to distinguish spam from non-spam. Similarly, the length of the emails and the frequency of punctuation marks or upper case letters could also give us clues as to whether we are dealing with spam or not.

The *clean_corpus* function below will take care of the parts raised in the 1st set of observations.

In [5]:
df_cleaned = clean_corpus(df_source)

Number of samples: 5832
Number of columns: 3
Columns names:
spam_label, text, text_cleaned

Number of duplicate cleaned texts found: 279
Number of empty texts found: 27

Email texts cleaned
Number of samples: 5832


In [14]:
# Let's look at some examples.
# You can rerun this cell to get a different sample
show_clean_text(df_cleaned)


Original document:

Best bang for the buck: 390 hp Mustang Cobra @ $35k. GG, recently emerged from the passenger side of
a 2003 with the largest smile. -----Original Message----- From: fork-admin@xent.com [mailto:fork-
admin@xent.com]On Behalf Of Joseph S. Barrera III Sent: Sunday, December 01, 2002 1:40 PM To:
fork@spamassassin.taint.org Subject: Re: Mercedes-Benz G55 On the other end of the spectrum, I just
bought a Honda del Sol for my new commute to San Jose (from San Bruno)... I wonder if it would fit
in the cargo area of a G55. - Joe

Cleaned document:

best bang buck mustang cobra recently emerged passenger largest smile original message behalf joseph
barrera sent sunday december subject mercedes benz spectrum just bought honda commute jose bruno
wonder cargo area


## Feature engineering 

## Part 1: Extracting numeric features

We start with the ideas from the 2nd observation and create new features that count different noise components of the text.

In [15]:
num_features_df = extract_numeric_features(df=df_source, with_labels=True)

Number of samples and columns of input: (5832, 2)
Number of columns: 2
Columns names:
spam_label, text

Numeric features extracted
Data size: (5832, 14)
Number of columns: 14
Columns names:
email_counts, html tag_counts, url_counts, Twitter username_counts, hashtag_counts
character_counts, word_counts, unique word_counts, punctuation mark_counts, uppercase word_counts
lowercase word_counts, digit_counts, alphabetic char_counts, spam_label
Numeric features saved to data/num_features.csv


## Feature engineering

## Part 2: Extracting features from text

Computers don't understand natural language and its unstructured form. So, how do we represent text?

### Bag of words

One of the simplest but in the early days of NLP effective and commonly used models to represent text for machine learning is the ***Bag of Words*** model ([link](https://en.wikipedia.org/wiki/Bag-of-words_model)). When using this model, we discard most of the structure of the input text (word order, chapters, paragraphs, sentences or formatting) and only count how often each word appears in each text. Discarding the structure and counting only word occurrences leads to the mental image of representing text as a "bag". 


**Example:** Let our toy corpus contain four documents.

```python
corpus = [
    'I enjoy paragliding.',
    'I do like NLP.',
    'I like deep learning.',
    'O Captain! my Captain!'
]
```

In [16]:
show_bag_of_words_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0,0,0,1,1,0,0,0,0,0,1
I do like NLP.,0,0,1,0,1,0,1,0,1,0,0
I like deep learning.,0,1,0,0,1,1,1,0,0,0,0
O Captain! my Captain!,2,0,0,0,0,0,0,1,0,1,0


In the table above, each column represents a word from the corpus and each row one of the four documents. The value in each cell represents the number of times that word appears in a specific document. For example, the fourth document has the word `captain` occurring twice and the words `my` and `O` occurring once.

The technical implementation of  Bag of Words is called a CountVectorizer. It converts each document into a rows of numbers, i.e. a numeric vector. Thus the name vectorizer.  

While this kind of transformation allows machine learning algorithms to process text data effectively, it has a drawback. It treats all words as independent and ignores the context in which they appear. For example, losing information about the order of the words in the text can change the meaning of a sentence. The sentences "I do like NLP", "Do I like NLP" or "NLP like I do" have the same set of words but different meanings. 

### TF-IDF

The **Term Frequency–Inverse Document Frequency** approach aims to address this limitation, by measuring how important a word is for a document relative to a collection of documents (the corpus). 

We use the implementation by scikit-learn. It calculates the TF-IDF score as the product of :
- The **term frequency TF**, which is the ratio of the frequency of the word $w$ in the given document $d$ divided by the total number of words in the given document.   
  So $TF(w, d) = \frac{f(w, d)}{N(d)}$
- and the (smoothed) )**inverse document frequency IDF**, which is given by 
$$IDF(w, D) = \log\left(\frac{size(D)+1}{df(w, D)+1}\right)+1$$ 
where $df(w, D)$ is the number of documents in the corpus $D$ that contain the word $w$. Adding `1` in the numerator and denominator keeps the IDF value finite and stable.

This way, common words that appear in many documents (small IDF) are given less weight while rare words that appear in only a few documents get a higher weight (high IDF).

In [17]:
show_tfidf_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0.0,0.0,0.0,0.644503,0.411378,0.0,0.0,0.0,0.0,0.0,0.644503
I do like NLP.,0.0,0.0,0.57458,0.0,0.366747,0.0,0.453005,0.0,0.57458,0.0,0.0
I like deep learning.,0.0,0.57458,0.0,0.0,0.366747,0.57458,0.453005,0.0,0.0,0.0,0.0
O Captain! my Captain!,0.816497,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.408248,0.0


Below you can extract the text features using either the CountVectorizer (`vectorizer="count"`) or the TfidfVectorizer (`vectorizer="tfidf"`). Please note that this process takes a while, so be patient.

For that reason, we have already pre-computed the features using `"tfidf"`and stored them in the `features` folder. You can load them using the command `load_feature_space(features="text")`.

In [18]:
text_features_df = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
text_features_df.shape

TF-IDF Vectorizer selected
Number of columns: 10001
First 5 names:
aalib, aall, aaron, abacha, abandon
Last 5 columns:
zoom, zope, zurich, zyban, spam_label
Text features saved to data/text_features_tfidf.csv


(5832, 10001)

### Embeddings

The Bag of Words and TF-IDF approaches cannot capture the meaning of words or the relationships between them. They also lead to very high-dimensional and sparse representations of the text which are not very efficient and can lead to overfitting.
To address these limitations, we can use **embeddings** or transformer based models. Embeddings are denser vector representations of words are learned from large corpora of text. By representing similar words as similar vectors they can capture meaning and relationships in a continuous lower-dimensional vector space.

We have passed the email texts through a language model to generate the associated embeddings. Since the feature extraction takes some time we have stored these embeddings and made them available for you in the file named `email_embeddings.csv`.

You can load them using the command `load_feature_space(features="embeddings")`.

In [19]:
embeddings_df = load_feature_space(features="embedding")

Email embeddings loaded
Data includes labels in the column 'spam_label'
The data set has 5832 rows, 769 columns
