# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys

if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/EPFL_EMBA.git
        
    # Copy files required to run the code
    !cp -r "EPFL_EMBA/notebooks/data" "EPFL_EMBA/notebooks/data_prep_tools.py" "EPFL_EMBA/notebooks/EDA_tools.py" "EPFL_EMBA/notebooks/modeling_tools.py" . 
    
    # Install packages via pip
    !pip install -r "EPFL_EMBA/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)


# Notebook overview

Our overall goal of the hands-on session is to explore and compare various features space and machine learning approaches. In this notebook we will focus getting the data ready for this tasks. 

We will:

* Load the raw data
* Explore different samples of raw email and decide on text preprocessing steps
* Preprocessing the raw text of the email
* Extracting different features from the data
* Store the different sets of features



# Data

We use the [SpamAssassin](https://spamassassin.apache.org/), a public email corpus. The dataset is available in the *data* folder. You can more details about this dataset [here](https://spamassassin.apache.org/old/publiccorpus/).


The dataset contains **~6'000 labeled emails**, i.e. we know which emails are regular emails and which are flagged as spam.




Our goal is to explore and compare various features spaces and machine learning approaches. The use of spam emails is just for demonstration and learning purpose as it is a text-based example that everyone is easily familiar with and that allows us to highlight different stages of developing a machine learning application and the decision making processes involved along the way.

## Load the data

In [None]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py

In [None]:
# Load the data
df_source = load_source_data()

In [None]:
# If you rerun this cell multiple times you get different samples displayed each time
# OR you can replace the number 3 with a number of your choice
display(df_source.sample(3))

## Text preprocessing

The goal of text preprocessing is to clean up and transform the raw text into a format that can be used by machine learning algorithms.


<div class="alert alert-success">

**In the Warmup Task we asked:**
    
__Q1.__ What parts of the text do we think are noise?
   
</div>


## 💡 Observations

1. Some suggestions discussed in class:

* HTML tags 
* URLs
* E-mail addresses
* Punctuation marks, digits (e.g. 2002, 1.1, ...)
* Multiple whitespaces
* Case conversion (e.g. Dog vs dog, ...)
* English STOPWORDS (e.g. a, is, my, i, all, and, by...)
* ...



The *clean_corpus* function below will take care of the above points:

In [None]:
df_cleaned = clean_corpus(df_source)

In [None]:
# Let's look at some examples.
# You can rerun this cell to get a different sample
show_clean_text(df_cleaned)

# Feature engineering 

## Part 1: Extracting numeric features

<div class="alert alert-success">

**In the Warmup Task we asked:**
       
__Q2.__ What should we do with these parts of the text?
</div>

The function *extract_numeric* counts the frequencies of the items in our list above and stores them.

In [None]:
num_features_df = extract_numeric_features(df=df_source, with_labels=True)

## Feature engineering

## Part 2: Extracting features from text

In order to give the texts a structure that the computercan handle we represent the text as vectors, those are _long lists of numbers_.

We call these methods vectorizers. The _extract_text_features_ function offers two options  
1. `vectorizer="count"`  
This gives the bag of words model which simply counts how often different word appear in each email.
2. `vectorizer="tfidf"`  
TF-IDF stands for **Term Frequency–Inverse Document Frequency**. This approach takes into account how frequent a word is in the email, and how common it is in the entire dataset. It gives us a weighted count of the words.

In [None]:
text_features_df = extract_text_features(
    df_cleaned, vectorizer="count", with_labels=True, store=True
)
print("Shape of our feature space:", text_features_df.shape)

display(text_features_df.head(3))

In [None]:
text_features_df2 = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
print("Shape of our feature space:", text_features_df2.shape)

display(text_features_df2.head(3))


The Bag of Words and TF-IDF approaches cannot capture the meaning of words or the relationships between them. They also lead to very high-dimensional and sparse representations of the text which are not very efficient and can lead to overfitting.

### Embeddings
Embeddings are denser vector representations representing similar words as similar vectors to capture meaning and relationships.

We have already extracted the features for you by passing them through language model. the features are available in the file named `email_embeddings.csv`.

In [None]:
embeddings_df = load_feature_space(features="embedding")

print("Shape of our feature space:", embeddings_df.shape)

display(embeddings_df.head(3))