## ML experimentation
#### Current approach:
- Use CodeBERT to tokenize all lines
- Find the most occuring 250 tokens and save them as vocabulary
- Every dataset (training, test, validation) is vectorized based on the vocabulary
- Include matrix of line & prev_line as features
- MLPClassifier, not optimized yet


#### Open issues:
- Create embeddings with CodeBERT (?)
- Include next_line
- Include joern-features
- Hyperparameter tuning
- ...

_References_:
- https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
- https://github.com/microsoft/CodeBERT

In [25]:
import pandas as pd
import numpy as np
import math
from transformers import AutoTokenizer, AutoModel
import torch
from tqdm import tqdm
import itertools 

In [2]:
# Init CodeBERT
tokenizer_bert = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model_bert = AutoModel.from_pretrained("microsoft/codebert-base")

In [3]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
df = pd.read_csv('./big-vul_dataset/line_sample_20p_balanced_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,func_id,line,vul,prev_line,next_line
0,0,185342,bool activatable = activatable_ && !hit_test_b...,1,surface_->GetHitTestBounds() + surface_origin....,if (activatable != CanActivate()) {
1,1,188415,return E_BUFFER_NOT_FULL;,1,len = 1;,}
2,2,185284,defaultQuirksStyle = RuleSet::create().leakPtr();,1,defaultPrintStyle = RuleSet::create().leakPtr();,}
3,3,179389,struct ext4_extent *ex1 = NULL;,1,"struct ext4_extent *ex, newex, orig_ex;",struct ext4_extent *ex2 = NULL;
4,4,185548,if (toHTMLElement(this)->highestAncestor() != ...,1,ASSERT(m_form);,setForm(0);


## Data preparation section

- Create the vocabulary for the matrix
- All datasets will have the contents of this vocabulary as columns 
- Create the matrix for train and test dataset
- TODO: next_line, other features

In [29]:
# Condense multiple embeddings vectors to ONE single vector using element-wise averaging 
def condenseEmbeddings(context_embeddings):
    np_array = context_embeddings.detach().numpy()
    avg_embedding = np.mean(np_array[0].tolist(), axis=0)
    return avg_embedding
    

# Return ONE vector per input string
# This vector is created by element-wise averaging of all token vectors of a string 
def bert_encode(list_of_strings):
    embeddings = []
    for s in list_of_strings:
        # tokenize
        code_tokens=tokenizer_bert.tokenize(s)
        # add special tokens
        tokens=[tokenizer_bert.cls_token]+code_tokens+[tokenizer_bert.sep_token]
        # convert to IDs
        tokens_ids=tokenizer_bert.convert_tokens_to_ids(tokens)
        # create embedding
        context_embeddings=model_bert(torch.tensor(tokens_ids)[None,:])[0]
        # condense embedding to a single vector of fixed size
        embeddings.append(condenseEmbeddings(context_embeddings))
 
    return embeddings

def encode(list_lines, list_prev_lines, list_next_lines):
    encoded_lines = bert_encode(line_batch)
    encoded_prev_lines = bert_encode(prev_line_batch)
    encoded_next_lines = bert_encode(next_line_batch)
    
    # concat 3x768 vectors to 1x2304 vector 
    return [np.concatenate((l, p, n)) for l, p, n in zip(encoded_lines, encoded_prev_lines, encoded_next_lines)]
    

In [30]:
from sklearn.model_selection import train_test_split

# split into training and validation sets
line_tr, line_test, prev_line_tr, prev_line_test, next_line_tr, next_line_test, y_tr, y_test = \
    train_test_split(df['line'], df['prev_line'], df['next_line'], df['vul'], test_size=0.1, random_state=42)

In [31]:
# define how big a batch of entries should be (depending on RAM)
batch_size = 100

# number of epochs is calulated based on the batch_size
epochs = math.ceil(len(line_tr)/batch_size)

# split the dataframes (X_tr, y_tr) into an array of dataframes (number of epochs)
batchesLine = np.array_split(line_tr, epochs)
batchesPrevLine = np.array_split(prev_line_tr, epochs)
batchesNextLine = np.array_split(next_line_tr, epochs)
batchesY = np.array_split(y_tr, epochs)

## ML Model

- Using train_test_split for creating subsets of data
- Training MLPClassifier with training data
- Evaluating model with test data
- (Include RandomForestClassifier as one alternative option for model)


In [None]:
# import and initialisation of generic MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(activation='relu', alpha=0.05, hidden_layer_sizes=(100,), learning_rate='constant',solver='adam')

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to 2304 element vector 
    encodedBatch = encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))

  0%|                              | 0/148 [00:00<?, ?it/s]

In [None]:
X_test_encoded = encode(line_test.tolist(), prev_line_test.tolist(), next_line_test.tolist())

print("Accuracy of prediction: " , clf.score(X_test_encoded, y_test))
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred))

## Hyperparameter tuning

Currently not supported due to BERT & prev/next line inclusion 

In [17]:
mlp = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}


from sklearn.model_selection import GridSearchCV

gridclf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
gridclf.fit(X_train, y_train)


# Best parameter set
print('Best parameters found:\n', gridclf.best_params_)

# All results
means = gridclf.cv_results_['mean_test_score']
stds = gridclf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gridclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))



Best parameters found:
 {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.643 (+/-0.013) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.628 (+/-0.007) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'adam'}
0.639 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.622 (+/-0.016) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.641 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.624 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'adam'



In [102]:
from sklearn.ensemble import RandomForestClassifier
rndclf = RandomForestClassifier(max_depth=2, random_state=0)
rndclf.fit(X_train, y_train)
rndclf.score(X_test, y_test)

y_pred = rndclf.predict(X_test)
print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.61      0.63      0.62       843
           1       0.59      0.58      0.58       795

    accuracy                           0.60      1638
   macro avg       0.60      0.60      0.60      1638
weighted avg       0.60      0.60      0.60      1638



## Validation section

- Read unbalanced dataset
- Create matrix with token as defined in the vocab at the beginning
- Evaluate model
- TODO: include prev_line in this section

In [88]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
val_df = pd.read_csv('./big-vul_dataset/line_sample_10p_original_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

Unnamed: 0,vul,"""",""");",""",",#,%,&,(,"(""",(&,...,Ġsize,Ġsizeof,Ġthe,Ġto,Ġwe,Ġx,Ġy,Ġ{,Ġ|,Ġ||
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0


In [89]:
line_val, prev_line_val, next_line_val, y_val = df['line'], df['prev_line'], df['next_line'], df['vul']

X_val_encoded = encode(line_val.tolist(), prev_line_val.tolist(), next_line_val.tolist())
print("Accuracy of prediction: " , clf.score(X_val_encoded, y_val))
y_val_pred = clf.predict(X_val_encoded)
print(classification_report(y_val, y_val_pred))

Accuracy of prediction:  0.6584071111744483
              precision    recall  f1-score   support

           0       1.00      0.66      0.79    500599
           1       0.02      0.68      0.04      4634

    accuracy                           0.66    505233
   macro avg       0.51      0.67      0.41    505233
weighted avg       0.99      0.66      0.79    505233

