# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys

if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aiml2days.git
        
    # Copy files required to run the code
    !cp -r "aiml2days/notebooks/data" "aiml2days/notebooks/data_prep_tools.py" "aiml2days/notebooks/EDA_tools.py" "aiml2days/notebooks/modeling_tools.py" . 
    
    # Install packages via pip
    !pip install -r "aiml2days/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)


# Notebook overview

Our overall goal of the hands-on session is to explore and compare various features space and machine learning approaches. In this notebook we will focus getting the data ready for this tasks. 

We will:

* Load the raw data
* Explore different samples of raw email and decide on text preprocessing steps
* Preprocessing the raw text of the email
* Extracting different features from the data
* Store the different sets of features



# Data

We use the [SpamAssassin](https://spamassassin.apache.org/), a public email corpus. The dataset is available in the *data* folder. You can more details about this dataset [here](https://spamassassin.apache.org/old/publiccorpus/).


The dataset contains **~6'000 labeled emails**, i.e. we know which emails are regular emails and which are flagged as spam.




Our goal is to explore and compare various features spaces and machine learning approaches. The use of spam emails is just for demonstration and learning purpose as it is a text-based example that everyone is easily familiar with and that allows us to highlight different stages of developing a machine learning application and the decision making processes involved along the way.

## Load the data

In [1]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py

In [2]:
# Load the data
df_source = load_source_data()

8546 emails loaded
Cleaning data set:
2710 duplicate emails found and removed
4 empty emails found and removed

5832 emails remaining

Number of columns: 2
Columns names:
spam_label, text


In [3]:
# If you rerun this cell multiple times you get different samples displayed each time
# OR you can replace the number 3 with a number of your choice
display(df_source.sample(3))

Unnamed: 0,spam_label,text
1696,1,"PUBLIC ANNOUNCEMENT: The new domain names are finally available to the general public at discount prices. Now you can register one of the exciting new .BIZ or .INFO domain names, as well as the original .COM and .NET names for just $14.95. These brand new domain extensions were recently approved by ICANN and have the same rights as the original .COM and .NET domain names. The biggest benefit is of-course that the .BIZ and .INFO domain names are currently more available. i.e. it will be much easier to register an attractive and easy-to-remember domain name for the same price. Visit: http://www.affordable-domains.com today for more info. Register your domain name today for just $14.95 at: http://www.affordable-domains.com/ Registration fees include full access to an easy-to-use control panel to manage your domain name in the future. Sincerely, Domain Administrator Affordable Domains To remove your email address from further promotional mailings from this company, click here: http://www.centralremovalservice.com/cgi-bin/domain-remove.cgi 45 9588xeOa0-732ENqv9875eRMa4-664oWoA0829MnbA5-849OhZo1243goAl55"
2007,0,"Hi all I have a prob when trying to install Linux (tried RedHat, Suse) on my laptop. I can start the install but after about 2min, the whole pc just dies. I know it's not a Linux prob and here is what I have encountered: I had the same problem when installing Win on it and eventually sorted it out by disabling the infrared port. I'm guessing this might be same prob although I'm not sure. I am very new to Linux so it's not that easy for me to work it out. I did manage to follow the setup procedure at one stage (using images on disks) and it cuts out either as it's trying to verify what CD-Rom I have or just after (hence my suspicion of the infrared port again). can anyone help ? thanks Gianni ************************************************************************ This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this e-mail in error, please notify the EPA postmaster - postmaster@epa.ie The opinions contained within are personal to the sender and do not necessarily reflect the policy of the Environmental Protection Agency. This footnote also confirms that this e-mail message has been swept by MIMEsweeper for the presence of computer viruses. ************************************************************************ -- Irish Linux Users' Group: ilug@linux.ie http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information. List maintainer: listmaster@linux.ie"
3573,0,"URL: http://www.newsisfree.com/click/-1,8622126,215/ Date: 2002-10-07T03:52:50+01:00 *Money:* Pensions experts say tax lures should be used to raise retirement age."


## Text preprocessing

The goal of text preprocessing is to clean up and transform the raw text into a format that can be used by machine learning algorithms.


<div class="alert alert-success">

**In the Warmup Task we asked:**
    
__Q1.__ What parts of the text do we think are noise?
   
</div>


## 💡 Observations

1. Some suggestions discussed in class:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...



The *clean_corpus* function below will take care of the above points:

In [4]:
df_cleaned = clean_corpus(df_source)

Number of samples: 5832
Number of columns: 3
Columns names:
spam_label, text, text_cleaned

Number of duplicate cleaned texts found: 279
Number of empty texts found: 27

Email texts cleaned
Number of samples: 5832


In [5]:
# Let's look at some examples.
# You can rerun this cell to get a different sample
show_clean_text(df_cleaned)


Original document:

Anarchist 'Scavenger Hunt' Raises D.C. Police Ire Sat Sep 21, 3:37 PM ET WASHINGTON (Reuters) - An
online "anarchist scavenger hunt" proposed for next week's annual meeting of the International
Monetary Fund ( news - web sites) and World Bank ( news - web sites) here has raised the ire of
police, who fear demonstrators could damage property and wreak havoc. Break a McDonald's window, get
300 points. Puncture a Washington D.C. police car tire to win 75 points. Score 400 points for a pie
in the face of a corporate executive or World Bank delegate. D.C. Assistant Police Chief Terrance
Gainer told a congressional hearing on Friday that law authorities were in talks to decide whether
planned protests were, "so deleterious to security efforts that we ought to take proactive action."
Several thousand people are expected to demonstrate outside the IMF and World Bank headquarters next
weekend. The Anti-Capitalist Convergence, a D.C.-based anarchist group, is also planning a

# Feature engineering 

## Part 1: Extracting numeric features

<div class="alert alert-success">

**In the Warmup Task we asked:**
       
__Q2.__ What should we do with these parts of the text?
</div>

The function *extract_numeric* counts the frequencies of the items in our list above and stores them.

In [6]:
num_features_df = extract_numeric_features(df=df_source, with_labels=True)

Number of samples and columns of input: (5832, 2)
Number of columns: 2
Columns names:
spam_label, text

Numeric features extracted
Data size: (5832, 14)
Number of columns: 14
Columns names:
email_counts, html tag_counts, url_counts, Twitter username_counts, hashtag_counts
character_counts, word_counts, unique word_counts, punctuation mark_counts, uppercase word_counts
lowercase word_counts, digit_counts, alphabetic char_counts, spam_label
Numeric features saved to data/num_features.csv


## Feature engineering

## Part 2: Extracting features from text

In order to give the texts a structure that the computercan handle we represent the text as vectors, those are _long lists of numbers_.

We call these methods vectorizers. The _extract_text_features_ function offers two options  
1. `vectorizer="count"`  
This gives the bag of words model which simply counts how often different word appear in each email.
2. `vectorizer="tfidf"`  
TF-IDF stands for **Term Frequency–Inverse Document Frequency**. This approach takes into account how frequent a word is in the email, and how common it is in the entire dataset. It gives us a weighted count of the words.

In [7]:
text_features_df = extract_text_features(
    df_cleaned, vectorizer="count", with_labels=True, store=True
)
print("Shape of our feature space:", text_features_df.shape)

display(text_features_df.head(3))

Count Vectorizer selected
Text features saved to 'data/text_features_count.csv'
Shape of our feature space: (5832, 10001)


Unnamed: 0,aalib,aall,aaron,abacha,abandon,abandoned,abidjan,abilities,ability,abiword,...,zimbabwe,zine,zone,zonealarm,zones,zoom,zope,zurich,zyban,spam_label
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [8]:
text_features_df2 = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
print("Shape of our feature space:", text_features_df2.shape)

display(text_features_df2.head(3))

TF-IDF Vectorizer selected
Text features saved to 'data/text_features_tfidf.csv'
Shape of our feature space: (5832, 10001)


Unnamed: 0,aalib,aall,aaron,abacha,abandon,abandoned,abidjan,abilities,ability,abiword,...,zimbabwe,zine,zone,zonealarm,zones,zoom,zope,zurich,zyban,spam_label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1



The Bag of Words and TF-IDF approaches cannot capture the meaning of words or the relationships between them. They also lead to very high-dimensional and sparse representations of the text which are not very efficient and can lead to overfitting.

### Embeddings
Embeddings are denser vector representations representing similar words as similar vectors to capture meaning and relationships.

We have already extracted the features for you by passing them through language model. the features are available in the file named `email_embeddings.csv`.

In [9]:
embeddings_df = load_feature_space(features="embedding")

print("Shape of our feature space:", embeddings_df.shape)

display(embeddings_df.head(3))

Email embeddings loaded
Data includes labels in the column 'spam_label'
The data set has 5832 rows, 769 columns
Shape of our feature space: (5832, 769)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,759,760,761,762,763,764,765,766,767,spam_label
0,-0.012373,-0.018983,-0.015448,0.017502,0.03556,-0.00046,0.033559,-0.01354,-0.019357,0.043482,...,0.044551,0.022994,0.029587,0.011559,-0.008164,-0.017245,-0.009922,-0.034954,-0.039636,1
1,-0.000592,-0.036461,-0.025587,0.017729,0.031857,-0.045625,0.051302,0.025131,-0.002957,0.040964,...,0.036356,0.004606,0.048945,-0.039095,0.036534,-0.025406,-0.004709,-0.006947,-0.029345,1
2,-0.015628,-0.032974,-0.017868,0.030587,0.015972,-0.012683,0.016617,-0.008228,-0.026466,0.017005,...,0.034141,-0.03471,0.015784,-0.043653,0.021216,0.01066,-0.027863,-0.00567,-0.029882,1
