# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys

if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aiml2days.git
        
    # Copy files required to run the code
    !cp -r "aiml2days/notebooks/data" "aiml2days/notebooks/data_prep_tools.py" "aiml2days/notebooks/EDA_tools.py" "aiml2days/notebooks/modeling_tools.py" . 
    
    # Install packages via pip
    !pip install -r "aiml2days/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)

In [None]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py
%run modeling_tools.py


# Exploratory Data Analysis (EDA)

In the data preparation notebook we loaded a dataset containing ~6'000 labeled emails (spam and non-spam). We explored the email texts and decided on some procedures to clean the email text and extract both numerical features counting "spammish signatures" and text features. We also extracted embeddings of the original email text from a language model.

### Loading features  

All three feature spaces can be loaded using the `load_feature_space()`function.  

The parameter `features` specifies which of the above features you want to load. The options are:

The different feature sets can be loaded with the `load_feature_space()`-function. The feature sets are specified with the `features` parameter. The options are:
* "num": numerical features
* "text": (default) text features
* "num_text": numerical and text features combined
* "embedding": embedding features
  
The parameter `no_labels` controls whether we want to omit the labels. The options for are:
- `True` to omit the labels
- `False` (default) to load the labels.



### Task 

Explore the different feature spaces.

Disclaimer: For easy of analysis we will analyze the full data set. In practice when building a supervised learning model like a spam filter you create a training set and a test set. You can explore the training set and used the insights to build your model. But you don't explore the test set as it is used to evaluate the performance of the model on unseen data. 

### Notebook overview

* Explore the numeric features
* Explore the text features
* Explore the embeddings



In [None]:
labels = load_labels()

Let's start by exploring the distribution of the labels i.e. how many spam and non-spam emails we have.

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ What do you observe for the frequency of spam and non-spam emails? 

__Q2.__ How could that impact the training of the spam detector?

</div>

In [None]:
plot_class_frequency(labels)

### Observation

Add your obsevation here

## Explore the features

### Numeric features
 The numeric features are:
- "email_counts"
- "html tag_counts"
- "url_counts"
- "Twitter username_counts"
- "hashtag_counts"
- "character_counts"
- "word_counts"
- "unique word_counts"
- "punctuation mark_counts"
- "uppercase word_counts"
- "lowercase word_counts"
- "digit_counts"
- "alphabetic char_counts"

Note some features have been log-scaled.

#### Load the numeric features:

In [None]:
num_features_df = load_feature_space(features="num")

Let's check the distributions of the numeric features 
- once across the full corpus and
- once by `spam_label`   

and see whether there are signs of differences between spam and non-spam emails.

#### Full corpus:

In [None]:
plot_numeric_features(num_features_df, with_labels=False)

#### By `spam_label`:

<div class="alert alert-success">
<h3>Questions</h3>
    
__Q1.__ Do spam and non-spam emails differ on average acorss different numeric features?

* Do spams contain more HTML tags? 
* Does non-spam contain more URLs and E-mail adresses? 
* Are spams mails longer than non-spam? 
* ...
   
__Q2.__ Could these features be useful for the distinction of spam and non-spam emails?
</div>

In [None]:
plot_numeric_features(num_features_df, with_labels=True)

### Observation

Add your obsevation here

### Text features

It is easier to work with the cleaned text before we applied the vectorizers to generate a specific feature space.

#### Load the cleaned text::

In [None]:
df_source = load_source_data(verbose=False)
df_cleaned = clean_corpus(df_source, verbose=False)

<div class="alert alert-success">


<h3>Questions</h3>

Let's explore the cleaned text for the most common words overall. You can change `N` to show more or less words.

__Q1.__ Which words do you think are more indicative of spam, and which are more typical of non-spam?

</div>

#### Give your answer here:

1.    











In [None]:
plot_most_common_words(df_cleaned, N=30, with_labels=False)

<div class="alert alert-success">


<h3>Questions</h3>

Now let's explore the most common words in each class. We are taking the top `N` words from each class and combining them into one set. We then plot the total counts of all these words for both classes. Thus you will get more than `N` words in the plot. The words are sorted by their frequency in the non-spam class (blue).

You can change `N` to show more or less words.

__Q1.__ Which words do you think are more indicative of spam, and which are more typical of non-spam?

__Q2.__ Are the total counts representative of the importance to each class?


</div>

In [None]:
plot_most_common_tokens(df_cleaned, N=15)

#### Give your answer here:

1.    





2. 







<div class="alert alert-success">


<h3>Questions</h3>

Let's repeat the above counts but this time we account for the different class sizes. We will use the relative frequencies adjusted by adjusting our counts to 1000 documents per class.

__Q1.__ Which words do you think are more indicative of spam, and which are more typical of non-spam?

__Q2.__ What has changed in terms of words that appear to be good indicators for either class?

__Q3.__ Playing around a bit with `N`, do you think building a model based on text features can yield some good results?

</div>

In [None]:
plot_most_common_tokens(df_cleaned, N=15, per_1000=True)

#### Give your answer here:

1.    








2.




### Exploring the embedding space


In [None]:
embeddings_df = load_feature_space(features="embedding")

#### PCA 
We run a PCA on the embedding space to see if we can reduce the dimension of needed for the embedding space.  
The PCA is run on the full data set. We extract how many components explain a certain amount of variance and visualize the results using a scree plot and a table.

We will also visualize the first two components in a scatter plot.


<div class="alert alert-success">


<h3>Questions</h3>

Now let's explore the most common words in each class. We are taking the top `N` words from each class and combining them into one set. We then plot the total counts of all these words for both classes. Thus you will get more than `N` words in the plot. The words are sorted by their frequency in the non-spam class (blue).

You can change `N` to show more or less words.

__Q1.__ How much variance is explained by the first two components?

__Q2.__ By how much could the feature space shrink if we wanted to retain 90% of variance?

__Q3.__ Looking at the scatterplot, what insights can you draw from the PCA results regarding class separation? 


</div>

In [None]:
embeddings_pca_df = run_pca(embeddings_df, with_labels=True)

#### Give your answer here:

1.    





2. 




3.
