# Google Colab Setup

In [1]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys 
if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aiml2days.git
        
    # Copy files required to run the code
    !cp -r "aiml2days/notebooks/data" "aiml2days/notebooks/data_prep_tools.py" "aiml2days/notebooks/EDA_tools.py" "aiml2days/notebooks/modeling_tools.py"
    
    # Install packages via pip
    !pip install -r "aiml2days/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)


# Data

We will use the [SpamAssassin](https://spamassassin.apache.org/) public email corpus. This dataset contains ~6'000 labeled emails. If you want to learn more about this dataset, check [this](https://spamassassin.apache.org/old/publiccorpus/). (*Note: Datasets of text are called corpora and samples are called documents.*) 

The dataset has been downloaded for you and is available in the *data* folder.

The dataset has been labelled, i.e. we are told whether an email has been designated as spam, .e.g. if it was flagged by a user, or whether it is considered an example of regular emails (non-spam, also called "ham"). 

Our goal is to explore and compare various features space and machine learning approaches. The use of spam emails is just for demonstration and learning purpose as it is a text-based example that everyone is easily familiar with and that allows us to highlight different stages of developing a machine learning application and the decision making processes involved along the way.


## Data preparation :: Overview

In this notebook we will explore the dataset, do a first analysis and prepare it for different machine learning tasks.

### Task 

We will process the raw data, clean the text and extract additional features ain order to prepare it for further analysis and for building our machine learning models.

### Notebook overview

* Load the data
* Text preprocessing
* Feature extraction
* Store cleaned data


## Load the data

In [1]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py

In [2]:
# Load the data
df_source = load_source_data()

8546 emails loaded
Cleaning data set:
2710 duplicate emails found and removed
4 empty emails found and removed

5832 emails remaining

Number of columns: 2
Columns names:
spam_label, text


In [3]:
# If you rerun this cell multiple times you get different samples displayed each time
# OR you can replace the number 3 with a number of your choice
display(df_source.sample(3))

Unnamed: 0,spam_label,text
1780,1,"<META HTTP-EQUIV=3D""Content-Type"" CONTENT=3D""text/html;charset=3Diso-8859-1= ""> <!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Transitional//EN""> <html xmlns:v=3D""urn:schemas-microsoft-com:vml"" xmlns:o=3D""urn:schemas-microsoft-com:office:office"" xmlns:w=3D""urn:schemas-microsoft-com:office:word"" xmlns=3D""http://www.w3.org/TR/REC-html40""> <head> <meta http-equiv=3DContent-Type content=3D""text/html; charset=3Diso-8859-1= ""> <meta name=3DProgId content=3DWord.Document> <meta name=3DGenerator content=3D""Microsoft Word 9""> <meta name=3DOriginator content=3D""Microsoft Word 9""> <link rel=3DFile-List href=3D""./Alarms,%20New%20Hours_files/filelist.xml""> <title>Alarms, New Hours .htm</title> <!--[if gte mso 9]><xml> <o:DocumentProperties> <o:Author>TGSWEL</o:Author> <o:LastAuthor>TGSWEL</o:LastAuthor> <o:Revision>2</o:Revision> <o:TotalTime>8</o:TotalTime> <o:Created>2002-05-11T16:41:00Z</o:Created> <o:LastSaved>2002-05-11T16:41:00Z</o:LastSaved> <o:Pages>2</o:Pages> <o:Words>415</o:Words> <o:Characters>2370</o:Characters> <o:Company>WCS</o:Company> <o:Lines>19</o:Lines> <o:Paragraphs>4</o:Paragraphs> <o:CharactersWithSpaces>2910</o:CharactersWithSpaces> <o:Version>9.4402</o:Version> </o:DocumentProperties> </xml><![endif]--> <style> <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""""; margin:0in; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:""Times New Roman""; mso-fareast-font-family:""Times New Roman"";} p {font-size:12.0pt; font-family:""Times New Roman""; mso-fareast-font-family:""Times New Roman"";} @page Section1 {size:8.5in 11.0in; margin:1.0in 1.25in 1.0in 1.25in; mso-header-margin:.5in; mso-footer-margin:.5in; mso-paper-source:0;} div.Section1 {page:Section1;} --> </style> <!--[if gte mso 9]><xml> <o:shapedefaults v:ext=3D""edit"" spidmax=3D""1027""/> </xml><![endif]--><!--[if gte mso 9]><xml> <o:shapelayout v:ext=3D""edit""> <o:idmap v:ext=3D""edit"" data=3D""1""/> </o:shapelayout></xml><![endif]--> </head> ..."
682,0,"Netscape Valued Customer: Announcing the NEW Netscape 7.0 Browser! Download it now for FREE! For more information and to download now, click here: http://dms-www01.netcenter.com/cgi-bin/gx.cgi/mcp?p=041Lnf041LoE51b9er012000W2TBvFJ Get Netscape 7.0 - the Fastest Netscape Browser Ever! It's easier than ever to upgrade to Netscape 7.0, and keep your bookmarks and browser settings. Upgrade to the NEW Netscape 7.0, and get the most from your time on-line. And Netscape 7.0 is also available on CD for FREE! Click here to order the CD and Guidebook: http://dms-www01.netcenter.com/cgi-bin/gx.cgi/mcp?p=041Lnh041LoE51b9er012000W2TBvFJ The Netscape Browser Team ----------------------------------------------------------- Netscape respects your online time and Internet privacy. If you would prefer not to receive future email messages from Netscape Netbusiness, please click on the following link or simply reply to this email and type ""REMOVE"" in the subject line. PLEASE NOTE: DO NOT CLICK ON THE LINK UNLESS YOU WANT TO UNSUBSCRIBE. http://dms-www01.netcenter.com/cgi-bin/gx.cgi/mcp?p=041LoB3SLb41LoE51b9er012000W2TBvFJ You are subscribed with:[ler@lerami.lerctr.org] (c) 2002 Netscape. All Rights Reserved. Privacy Policy, Terms of Service. http://channels.netscape.com/ns/browsers/download.jsp"
368,1,"Hi Job Seeker, When you create a FREE My Net-Temps account, you have access to the tools and resources for finding your next job more effectively. It only takes a minute. Plus you get news and tips specific to your profession. Everything you need is in your My Net-Temps account! * Post up to 3 resumes with the Build My Resume and Copy & Paste tools * Setup custom search agents to automatically receive job leads * Resume statistics to track your success * Resume, cover letter and thank you letter writing tips * Salary calculator * Career articles and weekly newsletter * Access to over 7500 recruiters that can help your job search Setup your account now at http://www.net-temps.com/careerdev/ The Net-Temps Team www.net-temps.com"


## Text preprocessing

Good text preprocessing is an essential part of every NLP project. It is the first step in the machine learning pipeline and it is important to get it right. The goal of text preprocessing is to transform the raw text into a format that can be used by machine learning algorithms.

Our overall goal is to build models that can help us distinguish non-spam from spam. 

The examples above have shown us that some samples are quite messy and contain a lot of content unnecessary for understanding the text as a human, i.e. they contain "noise". As a first step we will "*clean*" and "*standardize*" raw text. Our aim is to keep as many "*informative*" words as possible, while discarding the "*uniformative*" ones. Removing the noise from our texts will help to improve the accuracy of our models.

We thus need to identify which parts of the text are acting as "*noise*" in our text and remove it.

## Your Task:

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ What parts of the text do you think are noise?
   
__Q2.__ What should we do with these parts of the text?
</div>


#### Give your answer here:

1.    





2. 







## 💡 Observations

1. There are some items in the text that should be removed to make it readable. Here are some suggestions:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...

2. From experience, we know that the number of occurrences of the above items (HTML tags, URLs, etc) can be helpful to distinguish spam from non-spam. Similarly, the length of the emails and the frequency of punctuation marks or upper case letters could also give us clues as to whether we are dealing with spam or not.

The *clean_corpus* function below will take care of the parts raised in the 1st set of observations.

In [4]:
df_cleaned = clean_corpus(df_source)

Number of samples: 5832
Number of columns: 3
Columns names:
spam_label, text, text_cleaned

Number of duplicate cleaned texts found: 279
Number of empty texts found: 27

Email texts cleaned
Number of samples: 5832


In [5]:
# Let's look at some examples.
# You can rerun this cell to get a different sample
show_clean_text(df_cleaned)


Original document:

Hello Adam, Thursday, September 05, 2002, 11:33:18 PM, you wrote: ALB> So, you're saying that
product bundling works? Good point. Sometimes I wish I was still in CA. You deserve a good beating
every so often... (anyone else want to do the honors?) ALB> And how is this any different from
"normal" marriage exactly? Other then ALB> that the woman not only gets a man, but one in a country
where both she and ALB> her offspring will have actual opportunities? Oh and the lack of ALB> "de-
feminized, over-sized, self-centered, mercenary-minded" choices? Mmkay. For the nth time Adam, we
don't live in the land of Adam-fantasy. Women actually are allowed to do things productive,
independent and entirely free of their male counterparts. They aren't forced to cook and clean and
merely be sexual vessels. Sometimes, and this will come as a shock to you, no doubt, men and women
even find -love- (which is the crucial distinction between this system) and they marry one another
for t

## Feature engineering 

## Part 1: Extracting numeric features

We start with the ideas from the 2nd observation and create new features that count different noise components of the text.

In [6]:
num_features_df = extract_numeric_features(df=df_source, with_labels=True)

Number of samples and columns of input: (5832, 2)
Number of columns: 2
Columns names:
spam_label, text

Numeric features extracted
Data size: (5832, 14)
Number of columns: 14
Columns names:
email_counts, html tag_counts, url_counts, Twitter username_counts, hashtag_counts
character_counts, word_counts, unique word_counts, punctuation mark_counts, uppercase word_counts
lowercase word_counts, digit_counts, alphabetic char_counts, spam_label
Numeric features saved to data/num_features.csv


## Feature engineering

## Part 2: Extracting features from text

Computers don't understand natural language and its unstructured form. So, how do we represent text?

### Bag of words

One of the simplest but in the early days of NLP effective and commonly used models to represent text for machine learning is the ***Bag of Words*** model ([link](https://en.wikipedia.org/wiki/Bag-of-words_model)). When using this model, we discard most of the structure of the input text (word order, chapters, paragraphs, sentences or formatting) and only count how often each word appears in each text. Discarding the structure and counting only word occurrences leads to the mental image of representing text as a "bag". 


**Example:** Let our toy corpus contain four documents.

```python
corpus = [
    'I enjoy paragliding.',
    'I do like NLP.',
    'I like deep learning.',
    'O Captain! my Captain!'
]
```

In [13]:
show_bag_of_words_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0,0,0,1,1,0,0,0,0,0,1
I do like NLP.,0,0,1,0,1,0,1,0,1,0,0
I like deep learning.,0,1,0,0,1,1,1,0,0,0,0
O Captain! my Captain!,2,0,0,0,0,0,0,1,0,1,0


In the table above, each column represents a word from the corpus and each row one of the four documents. The value in each cell represents the number of times that word appears in a specific document. For example, the fourth document has the word `captain` occurring twice and the words `my` and `O` occurring once.

The technical implementation of  Bag of Words is called a CountVectorizer. It converts each document into a rows of numbers, i.e. a numeric vector. Thus the name vectorizer.  

While this kind of transformation allows machine learning algorithms to process text data effectively, it has a drawback. It treats all words as independent and ignores the context in which they appear. For example, losing information about the order of the words in the text can change the meaning of a sentence. The sentences "I do like NLP", "Do I like NLP" or "NLP like I do" have the same set of words but different meanings. 

### TF-IDF

The **Term Frequency–Inverse Document Frequency** approach aims to address this limitation, by measuring how important a word is for a document relative to a collection of documents (the corpus). 

We use the implementation by scikit-learn. It calculates the TF-IDF score as the product of :
- The **term frequency TF**, which is the ratio of the frequency of the word $w$ in the given document $d$ divided by the total number of words in the given document.   
  So $TF(w, d) = \frac{f(w, d)}{N(d)}$
- and the (smoothed) )**inverse document frequency IDF**, which is given by 
$$IDF(w, D) = \log\left(\frac{size(D)+1}{df(w, D)+1}\right)+1$$ 
where $df(w, D)$ is the number of documents in the corpus $D$ that contain the word $w$. Adding `1` in the numerator and denominator keeps the IDF value finite and stable.

This way, common words that appear in many documents (small IDF) are given less weight while rare words that appear in only a few documents get a higher weight (high IDF).

In [14]:
show_tfidf_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0.0,0.0,0.0,0.644503,0.411378,0.0,0.0,0.0,0.0,0.0,0.644503
I do like NLP.,0.0,0.0,0.57458,0.0,0.366747,0.0,0.453005,0.0,0.57458,0.0,0.0
I like deep learning.,0.0,0.57458,0.0,0.0,0.366747,0.57458,0.453005,0.0,0.0,0.0,0.0
O Captain! my Captain!,0.816497,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.408248,0.0


Below you can extract the text features using either the CountVectorizer (`vectorizer="count"`) or the TfidfVectorizer (`vectorizer="tfidf"`). Please note that this process takes a while, so be patient.

For that reason, we have already pre-computed the features using `"tfidf"`and stored them in the `features` folder. You can load them using the command `load_feature_space(features="text")`.

In [None]:
text_features_df = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
text_features_df.shape

Number of columns: 10001
First 5 names:
aalib, aall, aaron, abacha, abandon
Last 5 columns:
zoom, zope, zurich, zyban, spam_label
Text features saved to data/text_features.csv


(5832, 10001)

### Embeddings

The Bag of Words and TF-IDF approaches cannot capture the meaning of words or the relationships between them. They also lead to very high-dimensional and sparse representations of the text which are not very efficient and can lead to overfitting.
To address these limitations, we can use **embeddings** or transformer based models. Embeddings are denser vector representations of words are learned from large corpora of text. By representing similar words as similar vectors they can capture meaning and relationships in a continuous lower-dimensional vector space.

We have passed the email texts through a language model to generate the associated embeddings. Since the feature extraction takes some time we have stored these embeddings and made them available for you in the file named `email_embeddings.csv`.

You can load them using the command `load_feature_space(features="embeddings")`.

In [None]:
embeddings_df = load_feature_space(features="embedding")

Email embeddings loaded
Data includes labels in the column 'spam_label'
The data set has 5832 rows, 768 columns
