<a href="https://colab.research.google.com/github/sweetpand/Algorithms/blob/master/Tutorial_on_SPAM_detection_using_fastai_ULMFiT_Part_2_Classification_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial on SPAM detection using fastai ULMFiT - Part 2: Classification Model

tl;dr: This post is about how to create a classification model using a pre-trained and fine-tuned **language model**, all from the great `fastai` library.

This post is the continuation of [Tutorial on SPAM detection using fastai ULMFiT - Part 1: Language Model](https://drive.google.com/drive/u/0/folders/13uo91qC4cUFPepeRCg5XXoBCFqg3Q2Mn).  

We are going to quickly replicate all from post 1. 
And yes, it is less expensive than loading the trained **language model**.

In [0]:
# Installing torch_nightly and fastai 
!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu92/torch_nightly.html  gwpy &> /dev/null
!pip install fastai  gwpy &> /dev/null

In [0]:
# import libraries

from fastai import * 
import pandas as pd
import numpy as np
from functools import partial
import io
import os
from fastai.text import *
from sklearn.model_selection import train_test_split

Download SPAM data from UCI repository

In [0]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

df1 = pd.read_csv('SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])

# split data into training and validation set
df_trn, df_val = train_test_split(df1, stratify = df1['target'], test_size = 0.3, random_state = 999)

Now we replicate the creation of the language model with the same parameters as Part 1:

In [0]:
# Language model data
data_lm = TextLMDataBunch.from_df(train_df = df_trn, valid_df = df_val, path = "")

lang_mod = language_model_learner(data_lm,  arch = AWD_LSTM, pretrained = True, drop_mult=1.)

lang_mod.fit_one_cycle(4, max_lr= 5e-02)
lang_mod.freeze_to(-1)
lang_mod.fit_one_cycle(3, slice(1e-2/(2.6**4), 1e-2))
lang_mod.freeze_to(-2)
lang_mod.fit_one_cycle(3, slice(3e-3/(2.6**4), 1e-3))
lang_mod.unfreeze()
lang_mod.fit_one_cycle(3, slice(3e-3/(2.6**4), 1e-3))

lang_mod.save_encoder('my_awsome_encoder')

### POST STARTS HERE! The Clasification Model ⚡


Same as before, we create a data bunch with the needed information for the classication.

Note the `vocab` parameter comes from the data used in the language model.

#### Data for *Classification Model*

In [0]:
# Classifier model data
data_clas = TextClasDataBunch.from_df(path = "", train_df = df_trn,  valid_df = df_val, vocab=data_lm.train_ds.vocab, bs=32)

Check the batch data, now we have `text` + `target` columns

In [0]:
data_clas.show_batch()

Let's create the classifier:

In [0]:
learn_classifier = text_classifier_learner(data_clas, drop_mult=0.7, arch = AWD_LSTM)

Next, we load the encoder (language model) "we did" in Part 1 to the classification model.

In [0]:
learn_classifier.load_encoder('my_awsome_encoder')

### Training the language model

![Training a deep learning](https://blog.datascienceheroes.com/content/images/2019/12/tweaking-NN.gif)

📌 Similar to what we did with the language model, the steps are:


1. Find the best learning rate (LR)
2. Adjust the last layer with the `fit_one_cycle` funciton
3. Unfreeze all the layers
4. Find again the new LR

In [0]:
learn_classifier.lr_find()
learn_classifier.recorder.plot(suggestion=True)

In [0]:
learn_classifier.fit_one_cycle(5, max_lr=1e-2, moms=(0.8,0.7))

In [0]:
learn_classifier.recorder.plot_losses()

Now we unfreeze one more layer, and then we find the new LR:

In [0]:
lang_mod.freeze_to(-1)

learn_classifier.lr_find()
learn_classifier.recorder.plot(suggestion=True)

Hmmm depending on the run, the min suggested point might not be the ideal one. The objective is to find a LR prior to the loss divergence.

Note: After some testing, it was not possible to improve the last performance.

### Playing with the classifier! 

![Testing the algorithm](https://media.giphy.com/media/l2R0duZtUJWZjN2ko/giphy.gif)

Trying a non-spam text:

In [0]:
learn_classifier.predict('did you buy the groceries for dinner? :)')

We gott the prediction label `ham`, prediction value `0`, and the tensor of probabilities associtated with the softmax function ~ `[0.998, 0.002]`.

Where: 0.998=99.8%, is the likelihood for non-spam given the text. And 0.33% is the likelihood of non-spam.

Now some try with a **suspicious** spam text. 
Following text is similar to one shown in the training data, but slighlty different.

In [0]:
learn_classifier.predict('Free entry call back now')

Now the classification is what we expected, 82% of chances to be **spam** 🕵️‍♀️

Homework! Try to write some phrases using some of the words that appear on SPAM messages.

####  A side note about the exploratory data analysis



It's interesting what comes from a quick inspeciton on the SPAM data:

1. Lots of messages are using capital case
2. Lots of messages are using telefono numbers to reply the SMS.

Align to this, we can test the same message as before, and we can check that adding a number increases the likelihood, for the same text message, to be spam:

In [0]:
learn_classifier.predict('Free entry call back now 0393029442')

The SPAM likelihood incresead from 82% to 92%! 

### Validating the Classification Model

Getting the predictions from the validation data, ordered, so we can use it late.

In [0]:
valid_preds, valid_label=learn_classifier.get_preds(ds_type=DatasetType.Valid, ordered=True)
valid_preds.shape

Unexpectedly, if we do the same for the train data, google colab crashes, that's why it's commented.

In [0]:
#train_preds, train_label=learn_classifier.get_preds(ds_type=DatasetType.Train, ordered=True)
#train_preds.shape

_Does anyone know why?_

#### Setting the threshold for SPAM data

First we check the average ratio (prior) for each category:


In [0]:
preds=valid_preds.numpy()
print(np.mean(preds[:,0]))
print(np.mean(preds[:,1]))

89% is non-spam.

11% is spam.

These ratios are an important starting point to set the minimum threshold for which we flag the message as spam. 

`predict` function will assign the label based on a threshold based on 0.5. 
This is not optimized for the classification task.


#### Testing the threshold

In order to be conservative, and reduce the false positive rate (so common in this type of anomaly data projects), the threshold value for the SPAM category will be `0.05`.

All above 0.05 will be flagged as spam.


In [0]:
val_target=preds[:,1]>0.05

We build the final data frame to test the results:



In [0]:
df_val_pred=pd.DataFrame({'text':df_val.text, 'target':df_val.target, 'pred_target':val_target, 'spam_score':preds[:,1]})

#### Confusion Matrix

In [0]:
pd.crosstab(df_val_pred.target, df_val_pred.pred_target)

Not so bad!

**Sanity check**: check the score against some cases

In [0]:
df_val_pred.sort_values(['target','pred_target'])

#### ROC curve

The go-to testing methods for two-class target, the ROC curve (specially useful in non-balanced data sets)

In [0]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from matplotlib import pyplot

df_val_pred['target_binary'] = np.where(df_val_pred['target'].str.contains('spam'), 1, 0)

lr_probs = df_val_pred.spam_score

# calculate scores
lr_auc = roc_auc_score(df_val_pred.target_binary, lr_probs)
lr_fpr, lr_tpr, _ = roc_curve(df_val_pred.target_binary, lr_probs)

In [0]:
# plot the roc curve for the model
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='ULMFiT')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

Area Under Roc Curve (AUC):

In [0]:
lr_auc

Altough it seem too good to be truth, the AUC is 0.99.

![alt text](https://blog.datascienceheroes.com/content/images/2019/12/not-bad-154.png)

### Summing-up!


The fastai library provides an intuitive and easy-to-use interface to create a text classifiaction model (among others), which takes advantage from the pre-trained ULMFiT model we saw in Part 1.

The transfer learning techniques in NLP help us to quickly have a proven semantic base (pretrained model with millions of articles from wikipedia), in addition to being able to adjust it to our domain data just by running the fit_one_cycle function. An incredible job by the fastai team!

### Continue learning

Definitely, check the official documentation, well written and plenty of examples: [Efficient multi-lingual language model fine-tuning](https://nlp.fast.ai/)

An example related to tweet classification: [Transfer Learning in NLP for Tweet Stance Classification](https://towardsdatascience.com/transfer-learning-in-nlp-for-tweet-stance-classification-8ab014da8dde?gi=451c25762288)

A more technical article: [Understanding building blocks of ULMFIT](https://medium.com/mlreview/understanding-building-blocks-of-ulmfit-818d3775325b)


---

Any questions or suggestions? Leave in the blog post comment section 📩


### Get in touch!  🌎

Found me at: [linkedin](https://www.linkedin.com/in/pcasas/) & [twitter](https://twitter.com/pabloc_ds) 

📗 [Data science Live Book](https://livebook.datascienceheroes.com) 