# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys

if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aiml2days.git
        
    # Copy files required to run the code
    !cp -r "aiml2days/notebooks/data" "aiml2days/notebooks/data_prep_tools.py" "aiml2days/notebooks/EDA_tools.py" "aiml2days/notebooks/modeling_tools.py" . 
    
    # Install packages via pip
    !pip install -r "aiml2days/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)


# Data

We will use the [SpamAssassin](https://spamassassin.apache.org/) public email corpus. This dataset contains ~6'000 labeled emails. If you want to learn more about this dataset, check [this](https://spamassassin.apache.org/old/publiccorpus/). (*Note: Datasets of text are called corpora and samples are called documents.*) 

The dataset has been downloaded for you and is available in the *data* folder.

The dataset has been labelled, i.e. we are told whether an email has been designated as spam, .e.g. if it was flagged by a user, or whether it is considered an example of regular emails (non-spam, also called "ham"). 

Our goal is to explore and compare various features space and machine learning approaches. The use of spam emails is just for demonstration and learning purpose as it is a text-based example that everyone is easily familiar with and that allows us to highlight different stages of developing a machine learning application and the decision making processes involved along the way.


## Data preparation :: Overview

In this notebook we will explore the dataset, do a first analysis and prepare it for different machine learning tasks.

### Task 

We will process the raw data, clean the text and extract additional features ain order to prepare it for further analysis and for building our machine learning models.

### Notebook overview

* Load the data
* Text preprocessing
* Feature extraction
* Store cleaned data


## Load the data

In [1]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py

In [2]:
# Load the data
df_source = load_source_data()

8546 emails loaded
Cleaning data set:
2710 duplicate emails found and removed
4 empty emails found and removed

5832 emails remaining

Number of columns: 2
Columns names:
spam_label, text


In [3]:
# If you rerun this cell multiple times you get different samples displayed each time
# OR you can replace the number 3 with a number of your choice
display(df_source.sample(3))

Unnamed: 0,spam_label,text
5426,0,"RAH quoted: >Indians are not poor because there are too many of them; they are poor >because there are too many regulations and too much government intervention >-- even today, a decade after reforms were begun. India's greatest problems >arise from a political culture guided by socialist instincts on the one >hand and an imbedded legal obligation on the other hand. Nice theory and all, but s/India/France/g and the statements hold just as true, yet France is #12 in the UN's HDI ranking, not #124. >Since all parties must stand for socialism, no party espouses >classical liberalism I'm not convinced that that classical liberalism is a good solution for countries in real difficulty. See Joseph Stiglitz (Nobel for Economics) on the FMI's failed remedies. Of course googling on ""Stiglitz FMI"" only brings up links in Spanish and French. I guess that variety of spin is non grata in many anglo circles. R http://xent.com/mailman/listinfo/fork"
2918,0,"Update on this for anyone that's interested, and because I like closed threads... nothing worse than an infinite while loop, is there? I ended up formatting a floppy on my flatmate's (un-networked) P100 running FAT16 Win95, and mcopied the contents of the bootdisk across. Now I have a FAT16 Win98 install running alongside Slackware, and can play Metal Gear Solid when the mood takes me ;) /Ciaran. On Wednesday 21 August 2002 16:21, Ciaran Johnston wrote: > Dublin said: > > If you copy the files from your disk to the c: partition and mark it as > > active it should work ... > > Yeah, I figured that, but it doesn't seem to ... well, if that's the case > I'll give it another go tonight, maybe come back with some error messages. > > Just to clarify for those who didn't understand me initially - I have a > floppy drive installed, but it doesn't physically work. There's nowhere > handy to pick one up where I am, and I don't fancy waiting a few days for > one to arrive from Peats. > > Thanks for the answers, > Ciaran. > > > You especially need io.sys, command.com and msdos.sys > > > > your cd driver .sys and read the autoexec.bat and config.sys files for > > hints on what you did with your boot floppy <g> > > > > P > > > > On Wed, 2002-08-21 at 14:07, Ciaran Johnston wrote: > >> Hi folks, > >> The situation is this: at home, I have a PC with 2 10Gig HDDs, and no > >> (working) floppy drive. I have been running Linux solely for the last > >> year, but recently got the urge to, among other things, play some of > >> my Windoze games. I normally install the windows partition using a > >> boot floppy which I have conveniently zipped up, but I haven't any way > >> of writing or reading a floppy. > >> So, how do I go about: > >> 1. formatting a C: drive with system files (normally I would use > >> format /s c: from the floppy). > >> 2. Installing the CDROM drivers (my bootdisk (I wrote it many years > >> ago) does this normally). > >> 3. Booting from the partition? > >> > >> I..."
5312,0,"This is the Postfix program at host kci.kciLink.com. #################################################################### # THIS IS A WARNING ONLY. YOU DO NOT NEED TO RESEND YOUR MESSAGE. # #################################################################### Your message could not be delivered for 4.0 hours. It will be retried until it is 5.0 days old. For further assistance, please send mail to <postmaster> The Postfix program <khera@kcilink.com>: connect to yertle.kcilink.com[216.194.193.105]: Operation timed out"


## Text preprocessing

Good text preprocessing is an essential part of every NLP project. It is the first step in the machine learning pipeline and it is important to get it right. The goal of text preprocessing is to transform the raw text into a format that can be used by machine learning algorithms.

Our overall goal is to build models that can help us distinguish non-spam from spam. 

The examples above have shown us that some samples are quite messy and contain a lot of content unnecessary for understanding the text as a human, i.e. they contain "noise". As a first step we will "*clean*" and "*standardize*" raw text. Our aim is to keep as many "*informative*" words as possible, while discarding the "*uniformative*" ones. Removing the noise from our texts will help to improve the accuracy of our models.

We thus need to identify which parts of the text are acting as "*noise*" in our text and remove it.

## Your Task:

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ What parts of the text do you think are noise?
   
__Q2.__ What should we do with these parts of the text?
</div>


## 💡 Observations

Observations from the discussions in the slide presentation:

1. There are some items in the text that should be removed to make it readable. Here are some suggestions:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...

2. From experience, we know that the number of occurrences of the above items (HTML tags, URLs, etc) can be helpful to distinguish spam from non-spam. Similarly, the length of the emails and the frequency of punctuation marks or upper case letters could also give us clues as to whether we are dealing with spam or not.

The *clean_corpus* function below will take care of the parts raised in the 1st set of observations.

In [4]:
df_cleaned = clean_corpus(df_source)

Number of samples: 5832
Number of columns: 3
Columns names:
spam_label, text, text_cleaned

Number of duplicate cleaned texts found: 279
Number of empty texts found: 27

Email texts cleaned
Number of samples: 5832


In [5]:
# Let's look at some examples.
# You can rerun this cell to get a different sample
show_clean_text(df_cleaned)


Original document:

SUMMARY: Everything you need to know about every registered US Business is on one single CD-ROM. The
latest version of the USA Business Search CD contains a wealth of information on 18 million business
listings in the US. So whether you're after new business, building or enhancing your own database or
simply want cheap quality leads, USA Business Search CD will pinpoint exactly what you are looking
for. USA Business Search is an easy to use CD-ROM compiled using the most accurate and comprehensive
USA Yellow Pages (11.6 million records) and USA Domain Names (6.2 million records) databases. At
just $429.00, USA Business Search CD is a cost effective way of generating UNLIMITED leads and
spotting opportunities. Unrestricted export capability enables you to import data into your database
application. Where else can you get this for $429.00? For the same data InfoUSA will charge you over
$600,000.00 WHAT IS USA BUSINESS SEARCH CD? USA Business Search CD gives you unlim

## Feature engineering 

## Part 1: Extracting numeric features

We start with the ideas from the 2nd observation and create new features that count different noise components of the text.

In [6]:
num_features_df = extract_numeric_features(df=df_source, with_labels=True)

Number of samples and columns of input: (5832, 2)
Number of columns: 2
Columns names:
spam_label, text

Numeric features extracted
Data size: (5832, 14)
Number of columns: 14
Columns names:
email_counts, html tag_counts, url_counts, Twitter username_counts, hashtag_counts
character_counts, word_counts, unique word_counts, punctuation mark_counts, uppercase word_counts
lowercase word_counts, digit_counts, alphabetic char_counts, spam_label
Numeric features saved to data/num_features.csv


## Feature engineering

## Part 2: Extracting features from text

Computers don't understand natural language and its unstructured form. So, how do we represent text?

### Bag of words

One of the simplest but in the early days of NLP effective and commonly used models to represent text for machine learning is the ***Bag of Words*** model ([link](https://en.wikipedia.org/wiki/Bag-of-words_model)). When using this model, we discard most of the structure of the input text (word order, chapters, paragraphs, sentences or formatting) and only count how often each word appears in each text. Discarding the structure and counting only word occurrences leads to the mental image of representing text as a "bag". 


**Example:** Let our toy corpus contain four documents.

```python
corpus = [
    'I enjoy paragliding.',
    'I do like NLP.',
    'I like deep learning.',
    'O Captain! my Captain!'
]
```

In [7]:
show_bag_of_words_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0,0,0,1,1,0,0,0,0,0,1
I do like NLP.,0,0,1,0,1,0,1,0,1,0,0
I like deep learning.,0,1,0,0,1,1,1,0,0,0,0
O Captain! my Captain!,2,0,0,0,0,0,0,1,0,1,0


In the table above, each column represents a word from the corpus and each row one of the four documents. The value in each cell represents the number of times that word appears in a specific document. For example, the fourth document has the word `captain` occurring twice and the words `my` and `O` occurring once.

The technical implementation of  Bag of Words is called a CountVectorizer. It converts each document into a rows of numbers, i.e. a numeric vector. Thus the name vectorizer.  

While this kind of transformation allows machine learning algorithms to process text data effectively, it has a drawback. It treats all words as independent and ignores the context in which they appear. For example, losing information about the order of the words in the text can change the meaning of a sentence. The sentences "I do like NLP", "Do I like NLP" or "NLP like I do" have the same set of words but different meanings. 

### TF-IDF

The **Term Frequency–Inverse Document Frequency** approach aims to address this limitation, by measuring how important a word is for a document relative to a collection of documents (the corpus). 

We use the implementation by scikit-learn. It calculates the TF-IDF score as the product of :
- The **term frequency TF**, which is the ratio of the frequency of the word $w$ in the given document $d$ divided by the total number of words in the given document.   
  So $TF(w, d) = \frac{f(w, d)}{N(d)}$
- and the (smoothed) )**inverse document frequency IDF**, which is given by 
$$IDF(w, D) = \log\left(\frac{size(D)+1}{df(w, D)+1}\right)+1$$ 
where $df(w, D)$ is the number of documents in the corpus $D$ that contain the word $w$. Adding `1` in the numerator and denominator keeps the IDF value finite and stable.

This way, common words that appear in many documents (small IDF) are given less weight while rare words that appear in only a few documents get a higher weight (high IDF).

In [8]:
show_tfidf_vector()

Unnamed: 0_level_0,captain,deep,do,enjoy,i,learning,like,my,nlp,o,paragliding
Text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
I enjoy paragliding.,0.0,0.0,0.0,0.644503,0.411378,0.0,0.0,0.0,0.0,0.0,0.644503
I do like NLP.,0.0,0.0,0.57458,0.0,0.366747,0.0,0.453005,0.0,0.57458,0.0,0.0
I like deep learning.,0.0,0.57458,0.0,0.0,0.366747,0.57458,0.453005,0.0,0.0,0.0,0.0
O Captain! my Captain!,0.816497,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.408248,0.0


Below you can extract the text features using either the CountVectorizer (`vectorizer="count"`) or the TfidfVectorizer (`vectorizer="tfidf"`). Please note that this process takes a while, so be patient.

For that reason, we have already pre-computed the features using `"tfidf"`and stored them in the `features` folder. You can load them using the command `load_feature_space(features="text")`.

In [9]:
text_features_df = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
text_features_df.shape

TF-IDF Vectorizer selected
Text features saved to 'data/text_features_tfidf.csv'


(5832, 10001)

### Embeddings

The Bag of Words and TF-IDF approaches cannot capture the meaning of words or the relationships between them. They also lead to very high-dimensional and sparse representations of the text which are not very efficient and can lead to overfitting.
To address these limitations, we can use **embeddings** or transformer based models. Embeddings are denser vector representations of words are learned from large corpora of text. By representing similar words as similar vectors they can capture meaning and relationships in a continuous lower-dimensional vector space.

We have passed the email texts through a language model to generate the associated embeddings. Since the feature extraction takes some time we have stored these embeddings and made them available for you in the file named `email_embeddings.csv`.

You can load them using the command `load_feature_space(features="embeddings")`.

In [10]:
embeddings_df = load_feature_space(features="embedding")

Email embeddings loaded
Data includes labels in the column 'spam_label'
The data set has 5832 rows, 769 columns
