## ML experimentation
#### Current approach:
- Selection of 3 different encoders (TokenVectorizer, 2x BERT embeddings)
- All encoders use CodeBERT as a tokenizer
- line, prev_line, next_line & func_vul are considered
- MLPClassifier, not optimized yet


#### Open issues:
- Try different ML models
- Include joern-features (maybe, but unlikely due to resource constraints)
- Hyperparameter tuning
- ...

_References_:
- https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
- https://github.com/microsoft/CodeBERT

In [1]:
import pandas as pd
import numpy as np
import math
import torch
from tqdm import tqdm
import itertools
from line_encoders import EncoderCountVectorizer, EncoderTFIDFVectorizer, EncoderBERTVectorConcat, EncoderBERTStringConcat

In [2]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
df = pd.read_csv('./big-vul_dataset/line_sample_20p_balanced_train.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
df_functions = pd.read_csv('./big-vul_dataset/functions_only_all.csv', usecols=['func_id', 'target'], skipinitialspace=True, low_memory=True, keep_default_na=False)
df = pd.merge(df, df_functions, on='func_id', how='inner')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,func_id,line,vul,prev_line,next_line,target
0,0,178329,s->method = TLSv1_1_server_method();,1,else if (s->version == TLS1_1_VERSION),else if (s->version == TLS1_VERSION),1
1,1,188459,"long long element_start,",1,"long long size_,",long long element_size) :,1
2,2742,188459,"m_element_start(element_start),",1,"m_size(size_),","m_element_size(element_size),",1
3,3473,188459,"long long start,",1,"Segment* pSegment,","long long size_,",1
4,6263,188459,"m_element_size(element_size),",1,"m_element_start(element_start),","m_entries(0),",1


## Data preparation section

- Selection of 3 different data encoders
- Create train and test dataset
- Split into batches

### Please select _encoder_ below!
- `encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vec_vocabulary.pkl")`
- `encoder = EncoderBERTVectorConcat()`
- `encoder = EncoderBERTStringConcat()`

In [4]:
### Use a standard vectorization approach by creating a feature matrix.
### Loses semantic and structural information, but it is FAST (>> anything BERT embedding related)
### Pre-configured to use 750 features 
encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vocab/vec_vocabulary.pkl")
#encoder = EncoderTFIDFVectorizer(df_data=df)
#encoder = EncoderCountVectorizer(df_data=df)

### Most precise since the embeddings of all 3 features are preserved (i.e. generates a 2304-element vector)
### On average 2x slower than EncoderBERTStringConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

#encoder = EncoderBERTVectorConcat(avg_embeddings=False)

### Less precise since all 3 features are concatenated before embedding creation (i.e. generates a 768-element vector)
### On average 2x faster than EncoderBERTVectorConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

#encoder = EncoderBERTStringConcat()



In [None]:
test = encoder.encode(["int x = 5", "int x = 5"], ["function(){", "function(){"], ["}", "}"], [0,1])
test

In [5]:
from sklearn.model_selection import train_test_split

# split into training and validation sets
line_tr, line_test, prev_line_tr, prev_line_test, next_line_tr, next_line_test, func_vul_tr, func_vul_test, y_tr, y_test = \
    train_test_split(df['line'], df['prev_line'], df['next_line'], df['target'], df['vul'], test_size=0.2, random_state=42)

In [6]:
# define how big a batch of entries should be (depending on RAM)
batch_size = 100

# number of epochs is calulated based on the batch_size
epochs = math.ceil(len(line_tr)/batch_size)

# split the dataframes (X_tr, y_tr) into an array of dataframes (number of epochs)
batchesLine = np.array_split(line_tr, epochs)
batchesPrevLine = np.array_split(prev_line_tr, epochs)
batchesNextLine = np.array_split(next_line_tr, epochs)
batchesFuncVul = np.array_split(func_vul_tr, epochs)
batchesY = np.array_split(y_tr, epochs)

## ML Model

- Using train_test_split for creating subsets of data
- Training MLPClassifier with training data
- Evaluating model with test data
- (Include RandomForestClassifier as one alternative option for model)


In [7]:
# import and initialisation of generic MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(activation='relu', alpha=0.05, hidden_layer_sizes=(1500,750), learning_rate='adaptive',solver='adam', shuffle=True)

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, func_vul, Y_batch = \
        batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesFuncVul[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), \
                                  next_line_batch.tolist(), func_vul.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))

100%|████████████████████| 114/114 [01:19<00:00,  1.44it/s]


In [None]:
# RandomForestClassifier (Currently WIP)
# Does not provide a partial_fit() method, therefore workaround by increasing tree count each epoch

TREE_INCREASE_EACH_EPOCH = 10

# import and initialisation of generic MLPClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, warm_start=True)

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, func_vul, Y_batch = \
        batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesFuncVul[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), \
                                  next_line_batch.tolist(), func_vul.tolist())
    
    clf.fit(encodedBatch, Y_batch)
    # increase tree count each epoch
    clf.set_params(n_estimators=len(clf.estimators_)+TREE_INCREASE_EACH_EPOCH)

In [None]:
# Naive Bayes approach

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, func_vul, Y_batch = \
        batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesFuncVul[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), \
                                  next_line_batch.tolist(), func_vul.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))



In [None]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, func_vul, Y_batch = \
        batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesFuncVul[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), \
                                  next_line_batch.tolist(), func_vul.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))



In [None]:
X_test_encoded = encoder.encode(line_test.tolist(), prev_line_test.tolist(), \
                                next_line_test.tolist(), func_vul_test.tolist())

print("Accuracy of prediction: " , clf.score(X_test_encoded, y_test))
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred))

## Model persistance
Selectively execute when needed!

In [None]:
from joblib import dump
# Store model
dump(clf, 'models/mlp_Count_withFuncVul.model')

In [None]:
from joblib import load
# Load model
clf = load('models/mlp_Count_withFuncVul.model') 

## Hyperparameter tuning

<span style="color:red">Currently not _really_ supported due to missing batch support. <br>i.e. only works for small data samples and _EncoderCountVectorizer_</span> 

In [None]:
mlp = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

# TODO: add batch support to allow for larger data end even alternative encoders

encoded_X_tr = encoder.encode(line_tr.tolist(), prev_line_tr.tolist(), next_line_tr.tolist())

from sklearn.model_selection import GridSearchCV

gridclf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
gridclf.fit(encoded_X_tr, y_tr)


# Best parameter set
print('Best parameters found:\n', gridclf.best_params_)

# All results
means = gridclf.cv_results_['mean_test_score']
stds = gridclf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gridclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))



In [None]:
from sklearn.ensemble import RandomForestClassifier
rndclf = RandomForestClassifier(max_depth=2, random_state=0)
rndclf.fit(X_train, y_train)
rndclf.score(X_test, y_test)

y_pred = rndclf.predict(X_test)
print(classification_report(y_test, y_pred))



## Validation section

- Read unbalanced dataset
- Encode using selected encoder
- Evaluate model

In [10]:
# Load data for validation of the whole process

# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
val_df = pd.read_csv('./big-vul_dataset/line_sample_val.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)
df_functions = pd.read_csv('./big-vul_dataset/functions_only_all.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)
val_df = pd.merge(val_df, df_functions, on='func_id', how='inner')
val_df = val_df.reset_index()
val_df = val_df.sample(2000)

In [11]:
from joblib import load
# Load function model
clf_func = load('models/full_function.model')

from function_encoders import FuncEncoderCountVectorizer
encoder_func = FuncEncoderCountVectorizer(vocabulary_path="vocab/func_vocab.pkl")

line_val, prev_line_val, next_line_val, func_vul, func_code, y_val = \
    val_df['line'], val_df['prev_line'], val_df['next_line'], val_df['target'], val_df['processed_func'], val_df['vul']

from sklearn.metrics import classification_report
func_vul_pred = clf_func.predict(encoder_func.encode(func_code.tolist()))
print('Function-only prediction results:')
print(classification_report(func_vul, func_vul_pred))

Token indices sequence length is longer than the specified maximum sequence length for this model (859 > 512). Running this sequence through the model will result in indexing errors


Function-only prediction results:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      1708
           1       1.00      0.69      0.81       292

    accuracy                           0.95      2000
   macro avg       0.97      0.84      0.89      2000
weighted avg       0.96      0.95      0.95      2000



In [12]:
X_val_encoded = encoder.encode(line_val.tolist(), prev_line_val.tolist(), \
                               next_line_val.tolist(), func_vul_pred.tolist())

print("Accuracy of prediction: " , clf.score(X_val_encoded, y_val))
from sklearn.metrics import classification_report
y_val_pred = clf.predict(X_val_encoded)
print(classification_report(y_val, y_val_pred))

Accuracy of prediction:  0.902
              precision    recall  f1-score   support

           0       1.00      0.91      0.95      1981
           1       0.06      0.58      0.10        19

    accuracy                           0.90      2000
   macro avg       0.53      0.74      0.52      2000
weighted avg       0.99      0.90      0.94      2000



In [None]:
for input, prediction, label in zip(line_val, y_val_pred, y_val):
  if prediction != label:
    print(input, 'has been classified as ', prediction, 'and should be ', label)