## Basic classifier that uses Bidirectional Gated Recurrent Unit (bigru) to predict if the payload is an SQL injection

Let's import our stuff here


In [1]:
!pip3 install scikit-learn>=1.0.0
!pip3 install ktrain matplotlib tensorflow pickle
import matplotlib
import os
%reload_ext autoreload
%autoreload 2
%matplotlib inline
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 



Some more imports...We are using ktrain wrapper to simplify model operations and take advantage of some cool stuff like autofit

In [2]:
import ktrain
from ktrain import text

Here, we will classify payload to "malicious sqli" and "benign" . Data set is presented as a CSV file (i.e., download the file ```SQLiV3a_.csv```).  We will load the data using the ```texts_from_csv``` method, which assumes the label_columns are already one-hot-encoded in the spreadsheet. Since *val_filepath* is None, 10% of the data will automatically be used as a validation set.


In [3]:
DATA_PATH = 'SQLiV3a_8k.csv'
NUM_WORDS = 50000
MAXLEN = 100
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_csv(DATA_PATH,
                      'payload',
                      label_columns = ["sqli","xss"],
                      val_filepath='SQLiV3a_4k.csv', # if None, 10% of data will be used for validation
                      max_features=NUM_WORDS, maxlen=MAXLEN,
                      ngram_range=1,
                      preprocess_mode='standard')

detected encoding: UTF-8-SIG (if wrong, set manually)
['sqli', 'xss']
   sqli  xss
0     1    0
1     1    0
2     1    0
3     1    0
4     1    0
['sqli', 'xss']
   sqli  xss
0     0    0
1     0    0
2     0    0
3     0    0
4     0    0
language: en
Word Counts: 10936
Nrows: 13605
13605 train sequences
train sequence lengths:
	mean : 9
	95percentile : 26
	99percentile : 42
x_train shape: (13605,100)
y_train shape: (13605, 2)
Is Multi-Label? False
5003 test sequences
test sequence lengths:
	mean : 7
	95percentile : 28
	99percentile : 41
x_test shape: (5003,100)
y_test shape: (5003, 2)


Load the learner instance that uses ```bigru``` model 

In [None]:
model = text.text_classifier('bigru', (x_train, y_train), 
                             preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test))

Is Multi-Label? False
compiling word ID features...
maxlen is 100
word vectors will be loaded from: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
processing pretrained word vectors...
loading pretrained word vectors...this may take a few moments...


Now let's find the optimal learning rate for this model

In [None]:
learner.lr_find()
learner.lr_plot()

Train the model using the optimal learning rate ( adjust argument as needed accorfing to the graph)

In [None]:
learner.autofit(0.01)

Now let's evaluate the the ```validation data set```

In [None]:
learner.evaluate()

Here is what our model looks like. It has a number of layers that comprise this ```GRU```

In [None]:
learner.print_layers()


In [None]:
learner.freeze()

It's time for some fun! First, get a predictor instance that uses our pre-trained model

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('sqli_xss_detector')
print('MODEL SAVED')

Let's see if it can catch an SQLi or XSS payload

In [None]:
import pickle
from tensorflow.keras.models import load_model
# loading preprocess and model file
features = pickle.load(open('sqli_xss_detector/tf_model.preproc',
                            'rb'))
new_model = load_model('sqli_xss_detector/tf_model.h5')
labels = ['benign', 'malicious']

In [None]:
text = ''
preproc_text = features.preprocess([text])
result = new_model.predict(preproc_text)
label = labels[result[0].argmax(axis=0)]
score = ('{:.2f}'.format(round(np.max(result[0]), 2)*100))
print('LABEL :', label, 'SCORE :', score)