## Basic classifier that uses Fasttext to an SQL injection and Cross-site scripting attacks

First, let's import some required stuff here


In [None]:
!pip3 install scikit-learn>=1.0.0
!pip3 install ktrain matplotlib tensorflow numpy
import matplotlib
import os
import numpy as np
%reload_ext autoreload
%autoreload 2
%matplotlib inline
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

Some more imports...We are using ktrain wrapper to simplify model operations and take advantage of some cool stuff like autofit and learning adjustments

In [None]:
import ktrain
from ktrain import text

Here, we will classify payload to "sqli", "xss" and "benign" . Data set is presented as a CSV file (i.e., download the file ```SQLiV3a_14k.csv```).  Keep in mind data was obtained from public sources and therefore isn't terribly reliable i.e. some requests may be mislabeled and false-positives are likely to be present. We will load the data using the ```texts_from_csv``` method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since *val_filepath* is None, 10% of the data will automatically be used as a validation set.


In [None]:
DATA_PATH = 'SQLiV3a_14k.csv'
NUM_WORDS = 50000
MAXLEN = 200
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,
                      'payload',
                      label_columns = ["type"],
                      val_filepath=None, # if None, 10% of data will be used for validation
                      max_features=NUM_WORDS, maxlen=MAXLEN,
                      ngram_range=1,
                      preprocess_mode='standard')

Load the learner instance that uses ```fasttext``` model (https://fasttext.cc) with pre-trained word vectors

In [None]:
model = text.text_classifier('fasttext', (x_train, y_train), 
                             preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))

Let's visualize the model for better understanding the layer structure and the depth of this model. We can extend it with additional layers or use as-is

In [None]:
learner.print_layers()


Let's add some layers to our model:

In [None]:
from tensorflow.keras.layers import Dense, GaussianDropout, GRU

def get_model():
    model = text.text_classifier('fasttext', (x_train, y_train), 
                             preproc=preproc)
    model.add(GaussianDropout(0.2))
    model.add(Dense(3, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model
model = get_model()
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))

This is our new model

In [None]:
learner.print_layers()

Now let's find the optimal learning rate for this model using ktrain's lr_find() function

In [None]:
learner.lr_find()
learner.lr_plot()

Train the model using the optimal learning rate ( adjust argument as needed accorfing to the graph)

In [None]:
learner.freeze(6)

In [None]:
learner.autofit(0.04)

It's time for some fun! First, save the model and get a predictor instance that uses our pre-trained model

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('detector_model_fasttext_my')
print('MODEL SAVED')

Let's see if it can catch an SQLi or XSS payload

In [None]:
import pickle
from tensorflow.keras.models import load_model
# loading preprocess and model file
features = pickle.load(open('detector_model_fasttext_my/tf_model.preproc',
                            'rb'))
new_model = load_model('detector_model_fasttext_my/tf_model.h5')
labels = ['benign', 'sqli', 'xss']

In [None]:
text = '<applet onkeydown="alert(1)" contenteditable>test</applet>'
preproc_text = features.preprocess([text])
result = new_model.predict(preproc_text)
print(result)
label = labels[result[0].argmax(axis=0)]
score = ('{:.2f}'.format(round(np.max(result[0]), 2)*100))
print('LABEL :', label, 'SCORE :', score)