## Basic classifier that uses Distilbert model to predict if the payload is an SQL injection
##### Distilbert is a smaller and faster version of BERT ( Bidirectional Encoder Representations from Transformers) that is 40% lighter while retaining 97% of BERT's language understanding ability. More on Distilbert at [HuggingFace](https://huggingface.co/docs/transformers/model_doc/distilbert)

### To begin, let's install and import some packages


In [None]:
!pip3 install scikit-learn>=1.0.0
!pip3 install ktrain matplotlib tensorflow numpy
import matplotlib
import os
import numpy as np
%reload_ext autoreload
%autoreload 2
%matplotlib inline
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

#### Some more imports...We are using ktrain wrapper to simplify model operations and take advantage of some cool stuff like simplified data set preprocessing, learning rate finding and "autofit" that ensures the model is not overfit. More details on ktrain here: [ktrain on GitHub](https://github.com/amaiya/ktrain)

In [None]:
import ktrain
from ktrain import text

#### Let's print the list of available text classifiers in ktrain. There are relatively simple models like fasttext or bigru that have only 7-10 layers, as well as some more sophisticated deep models like BERT

In [None]:
text.print_text_classifiers()

### Here, we will load our data set. 
##### We have a CSV file ``trainlist_22k.csv`` that contains a list of HTTP paths that are labeled according to their association with cross-site scripting (xss) and sql injection (sqli).  There is also a "regular" traffic that belongs to a "benign" class. Data set load is performed using the ```texts_from_csv``` method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since *val_filepath* is None, 10% of the data will automatically be used as a validation set.
##### In our set we have: 1 feature (payload), 1 label (type) that contains 3 classes:
 - xss
 - sqli
 - benign

##### We will be using Distilbert model so preprocessing mode is set to ``Distilbert``

In [None]:
DATA_PATH = 'trainlist_22k.csv'
NUM_WORDS = 50000
MAXLEN = 200
trn, val, preproc = text.texts_from_csv(DATA_PATH,
                      'payload',
                      label_columns = ["type"],
                      val_filepath=None, # if None, 10% of data will be used for validation
                      max_features=NUM_WORDS, maxlen=MAXLEN,
                      ngram_range=1,
                      preprocess_mode='distilbert')

### Let's load the learner instance that uses ```Distilbert``` model. We will retain the model structure unchanged

In [None]:
from tensorflow.keras.layers import Dense, GaussianDropout

def get_model():
    model = text.text_classifier('distilbert', (trn), 
                             preproc=preproc)
    #model.add(Dense(3, activation='sigmoid'))
    #model.add(GaussianDropout1D(0.2))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
model = get_model()
learner = ktrain.get_learner(model, train_data=(trn), val_data=(val))

#### Here is what our model looks like. It has a number of layers that are pre-trained therefore allowing us to leverage transfer learning.
##### Source code for modeling_tf_distilbert can be found at [HuggingFace Transformers](https://huggingface.co/transformers/v2.3.0/_modules/transformers/modeling_tf_distilbert.html)

In [None]:
learner.print_layers()


#### We need to ensure that majority of existing pre-trained layers are not re-trained so we are freezing those with the following command:

In [None]:
learner.freeze(1)

#### The optimal learning rate for this model can be found using the **lr_find** function however it will take at least **20 minutes!** on this VM that uses CPU only. ( Optimal rate was found to be 3e-5 and therefore there is no need to spend time on this now). **If you still want to proceed**, uncomment the command and run the cell below

In [None]:
#learner.lr_find(show_plot=True, max_epochs=2)

### Now let's train the model using the optimal learning rate. More accuracy is achieved after 4-5 epochs however to save time, we will run the cycle using 2 epochs only. That should give us ~93% accuracy and observed loss (binary crossentropy) of ~0.09

In [None]:
learner.fit_onecycle(3e-5, 2)

##### Autofit function can help optimally train the model without ``overfitting`` it. **Do not run** unless you are willing to spend days (or perhaps weeks) on training

In [None]:
#learner.autofit(3e-5)

#### Alright, let's save our predictor so we can use it to perform inferences outside of the Jupyter notebook

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/home/jupyter/detector_model_lab')
print('MODEL SAVED')

### It's time for some fun! First, get a predictor instance that uses our pre-trained model

In [None]:
predictor = ktrain.load_predictor('/home/jupyter/detector_model_lab')
new_model = ktrain.get_predictor(predictor.model, predictor.preproc)

#### Let's see if it can catch an XSS payload

In [None]:
text = '<applet onkeydown="alert(1)" contenteditable>test</applet>'
result = new_model.predict(text)
print(result)

#### Now we can run more serious testing outside of the notebook