# Google Colab Setup

In [None]:
#@title Setup Google Colab by running this cell only once (ignore this if run locally) {display-mode: "form"}
import sys

if 'google.colab' in sys.modules:
    # Clone GitHub repository
    !git clone https://github.com/epfl-exts/aiml2days.git
        
    # Copy files required to run the code
    !cp -r "aiml2days/notebooks/data" "aiml2days/notebooks/data_prep_tools.py" "aiml2days/notebooks/EDA_tools.py" "aiml2days/notebooks/modeling_tools.py" . 
    
    # Install packages via pip
    !pip install -r "aiml2days/colab-requirements.txt"
    
    # Restart Runtime
    import os
    os.kill(os.getpid(), 9)

In [None]:
# Load libraries and helper functions
%run data_prep_tools.py
%run EDA_tools.py
%run modeling_tools.py

# Building a spam detector

## In this notebook

Our aim is to build a simple spam detector. These are the main steps:
* Load a feature set into memory
* Split the data into training and test set (see day 1)
* Train a model
* Evaluate the model
* Analyze misclassified samples to gain further insights

You **main task** is to rerun the modelling part several times on different settings. You can change the data, the evaluation metric and more...



## Load the data

For our first run we will just use the numerical features for now. But don't worry it will be your turn to explore what happens when you **later load the other feature sets.**

In [None]:
num_features_df = load_feature_space("num", no_labels=False)

Let's check the number of samples per class in the data.

In [None]:
labels = load_labels()
plot_class_frequency(labels)

<div class="alert alert-success">


<h3>Questions</h3>

Suppose you applied a **very naive approach** to the spam detection problem **that uses none of the features**: _You just either classify all emails as "spam" or as "non-spam"._

__Q1.__ How many emails would be classified correctly in each case?  

__Q2.__ Which approach would be more successful?
</div>

We have just established a **baseline** against which we will compare all other models.

## Model building

Throughout this notebook we will use a **Logistic Regression classifier**. Here is why:
- It is a simple and efficient model for binary classification tasks. 
- It is a good baseline for more complex models. 
- It is fast to train and thus allows us to quickly iterate on our model and try out different settings.
- It is also easy to interpret and allows us to explore where our model makes mistakes.

Below you will
- build a first simple model.
- tune the main hyperparameter `C` for the model using a cross-validated grid search.
- explore different feature sets and see how they affect the performance of the model.
- explore the effect of different evaluation metrics.
- Explore misclassified samples.

### A first trial

#### Training a single model

As a first trial, we will use the `num` feature set with a simpl. The accuracy is defined as the number of correct predictions divided by the total number of predictions.  

In [None]:
# Train/test splitting
df_train, df_test = train_test_split_(num_features_df)

# Fit model on the train data
model = fit_model(df_train, C=1)

# Print predictions on test set
plot_confusion_matrix(df_test, model)

<div class="alert alert-success">


<h3>Questions</h3>

__Q1.__ Which numbers tell use the correct predictions for each class?

__Q2.__ Which numbers tell use the failed predictions for each class?

__Q3.__ Which class faired better?
</div>

#### Classification report

The classification report provides us with different 4 metrics to evaluate the performance of our model: overall accuracy, precision, recall and f1-score. For a reminder expand the text cell below

3 metrics for per class:  
The **precision**  looks at the predictions per class. It divides the number of correct predictions by the number predictions made for that class. In the confusion matrix the emphasis is per column (vertical).

The **recall** looks at the ground truth per class. It divides the number of samples in that class that were correctly predicted by the number of samples in that class (support). In the confusion matrix  the emphasis is per column (horizontal).

The **f1-score** is the harmonic mean of precision and recall.

Overall metric:  
The **accuracy** is the number of all correct predictions divided by the total number of samples for the data set.

In [None]:
# Print classification report for test set
classification_report_(df_test, model)

The column named "Support" gives the total number of samples for each class in the test set.

### A more systematic approach:
#### Fine tuning with grid search and cross-validation

We will use a 5-fold cross-validation (`cv=5`). We also collect all the results from the cross-validation so we can plot them below. The process will automatically choose the best model for us. Finally we use the test set will to `evaluate` the performance of our model. 

In [None]:
# Train/test splitting
df_train, df_test = train_test_split_(num_features_df)
# text_features_df takes 45mins

# Fit model on the train data
model, cv_results = fit_log_reg_model(df_train)

#### Tracking overfitting

Below we plot the results of the cross-validation. For each value of `C` we want to compare the training scores (blue) against the validation score (orange). The red cross marks the value of the best `C`.

In order to assess overfitting we are interested in the gap between the training and validation curves. (for details expand the text cell below)

If the gap is small, it means that our model is not overfitting and generalizes well to unseen data.  
If the gap is large, it means that our model is overfitting. This indicates that the model has learned irrelevant information like noise that does not reflect the general pattern. In such a case we need to find ways to adjust the model to reduce the gap and improve the performance on the validation set.

In [None]:
viz_cv_results(cv_results, show_table=False, plot_confidence=False, plot_fit_time=False)

<div class="alert alert-success">


<h3>Questions</h3>

__Q1.__ Do we observe overfitting i.e. a large gap between the training and validation curves?

__Q2.__ What happens when C is very small and when it is very large?


</div>

### Model evaluation

In [None]:
# Print classification report for test set
classification_report_(df_test, model)

# Print predictions on test set
plot_confusion_matrix(df_test, model)

### Get more insights into the model

The coefficients of the Logistic Regression model can tell us how much each feature contributes to the overall prediction. The larger the absolute value of a coefficient, the more important the corresponding feature is for the model. 

For the impact on the overall prediction we look at _feature values times coefficients_. This will help us understand the model's behavior better and identify which features are driving the predictions for particular samples.

In [None]:
visualize_coefficients(model, df_train, n_top_features=5)

### How sure was the model of its predictions?
The Logistic Regression model provides the probabilities of each class for each sample. This allows us to assess the confidence of the model's predictions.

Low probabilities (close to 0) indicate that the model is very sure that the sample is not spam. High probabilities (close to 1) indicate that the model is very sure that the sample is spam.

In [None]:
plot_prediction_certainties(df_test, model, log_scale=False)

Careful with interpretation if the top plot is using a log-scale. This means that the values are not evenly spaced. You can change the setting.

<div class="alert alert-success">


<h3>Questions</h3>

__Q1.__ When the model misclassified a sample, is it usually very sure of its prediction, or kind of doubtful?
</div>

### Error analysis : Where does our model fail?

 The *error_analysis* function below will show us the top features responsible for the model's wrong decision.

In [None]:
error_analysis(df_test, model, doc_nbr=10, n_top_coeff=5, color_by_coeff_sign=True)

# NOW IT'S YOUR TURN

We have copied the above code blocks again below. You can use them to build your own spam detector now. 

There are a number of things you can adapt:

### Change the feature space

We have loaded 4 feature spaces at the start of the notebook. Simply replace `num_features_df` with `text_features`, `num_text_features`, or `embedding_features` in the code below to use a different feature space.

Warning: The feature spaces using text features are quite slow (45 mins) and will take quite a while to run the fine-tuning with cross-validation.  
The pre-computed output of grid search with cross-validation can be loaded with the following code `cv_results=pd.read_csv("text_log_reg_cv_results.csv")`
    
You can retrain the model using the `fit_model`-function with `C` set to the best `C`-value from `cv_results`.
    
### Change the metric used for fine-tuning

You can change the scoring function inside `fit_log_reg_model(df_train)`.  

The current default value is `scoring=None` which will use the accuracy score. 

However, you can also change the scoring function to `"precision"`, `"recall"`, or `"f1"` and check how the results change. 

More options are given [here](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

What happens to the confusion matrix as you vary the metric?

### Explore other settings

You will likely have to change some of the other parameters in the visualisations, etc. to make them more interpretable.


## Loading the other feature spaces

### Working locally on your machine
If you generated the text based features spaces in the notebook `data_preparation.ipynb` then you will be able to load them directly from the `data` folder.


In [None]:
# Load the remaining feature spaces
embeddings_df = load_feature_space("embedding", no_labels=False)
text_features_df = load_feature_space("text", no_labels=False)
num_text_features_df = load_feature_space("num_text", no_labels=False)

### If working with Colab - else ignore

If you are working on Google Colab, you will need to rerun the feature generation for text (code below) because the different notebooks don't share the same instance of the data folder

In [None]:
# Generate text features (this can take around 1:15 min)
df_source = load_source_data()
df_cleaned = clean_corpus(df_source)

text_features_df = extract_text_features(
    df_cleaned, vectorizer="tfidf", with_labels=True, store=True
)
# You can switch to vectorizer="count" later if you want to try

text_features_df = load_feature_space("text", no_labels=False)
num_text_features_df = load_feature_space("num_text", no_labels=False)

# Time to play

#### To make things easier: Change your settings here and then run the cell below


In [None]:

feature_space = num_features_df
# options are: 
# "num_features_df"
# "text_features_df"
# "num_text_features_df"
# "embeddings_df"


C = 1
# Some options to try are: 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000


scoring = None
# options include: 'accuracy', 'f1', 'precision', 'recall'

### Single model

In [None]:
# Train/test splitting
df_train, df_test = train_test_split_(feature_space)

# Fit model on the train data
model = fit_model(df_train, C=C)

# Print classification report for test set
classification_report_(df_test, model)

# Print predictions on test set
plot_confusion_matrix(df_test, model)

### Fine tuning with grid search and cross-validation

In [None]:
# Train/test splitting
df_train, df_test = train_test_split_(feature_space)

#### Note for using text features
With text features this can take a while to run, so you can skip this first cell and load the pre-computed `cv_results`in the next cell.

In [None]:
# Fit model on the train data
model, cv_results = fit_log_reg_model(df_train, scoring=scoring)

#### Load fine tuning results for text features

In [None]:
# For text features only
cv_results = pd.read_csv("text_log_reg_cv_results.csv")
display(cv_results)

In [None]:
# Fit model with the best C on the train data
best_C = ?

model = fit_model(df_train, C=best_C)

### Tracking overfitting for all cases

In [None]:
viz_cv_results(cv_results, show_table=False, plot_confidence=False, plot_fit_time=False)

### Model evaluation

In [None]:
# Print classification report for test set
classification_report_(df_test, model)

# Print predictions on test set
plot_confusion_matrix(df_test, model)

### Get more insights into the model

In [None]:
visualize_coefficients(model, df_train, n_top_features=10)

### How sure was the model of its predictions?


In [None]:
plot_prediction_certainties(df_test, model, log_scale=True)

### Error analysis :: Where does our model fail?

In [None]:
error_analysis(df_test, model, doc_nbr=10, n_top_coeff=15, color_by_coeff_sign=True)

# Looking for more

### Why not use Gemini  as a coding partner to explore some of your own ideas.