## ML experimentation
#### Current approach:
- Selection of 3 different encoders (TokenVectorizer, 2x BERT embeddings)
- All encoders use CodeBERT as a tokenizer
- line, prev_line & next_line are considered
- MLPClassifier, not optimized yet


#### Open issues:
- Try different ML models
- Include joern-features (maybe, but unlikely due to resource constraints)
- Hyperparameter tuning
- ...

_References_:
- https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
- https://github.com/microsoft/CodeBERT

In [13]:
import pandas as pd
import numpy as np
import math
import torch
from tqdm import tqdm
import itertools
from line_encoders import EncoderCountVectorizer, EncoderTFIDFVectorizer, EncoderBERTVectorConcat, EncoderBERTStringConcat

In [14]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
df = pd.read_csv('./big-vul_dataset/line_sample_20p_10_90_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)
#df = pd.read_csv('./big-vul_dataset/line_sample_all_balanced_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,func_id,line,vul,prev_line,next_line
0,0,180319,"pkt_len = parse_netscreen_rec_hdr(phdr, line, ...",1,},"cap_dst, err, err_info);"
1,1,188617,const int b = b1 + (((b2 - b1) * xoff + 8) >> 4);,1,const int a = a1 + (((a2 - a1) * xoff + 8) >> 4);,const int r = a + (((b - a) * yoff + 8) >> 4);
2,2,186397,FetchHistogramsFromChildProcesses();,1,RunUntilInputProcessed(GetWidgetHost());,"const std::string scroll_types[] = {""ScrollBeg..."
3,3,183791,"EXPECT_CALL(*gl_, BindBuffer(GL_ARRAY_BUFFER, 0))",1,.RetiresOnSaturation();,.Times(1)
4,4,180475,child = new_item;,1,new_item->prev = child;,"if ( ! ( value = skip( parse_string( child, sk..."


## Data preparation section

- Selection of 3 different data encoders
- Create train and test dataset
- Split into batches

### Please select _encoder_ below!
- `encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vec_vocabulary.pkl")`
- `encoder = EncoderBERTVectorConcat()`
- `encoder = EncoderBERTStringConcat()`

In [16]:
### Use a standard vectorization approach by creating a feature matrix.
### Loses semantic and structural information, but it is FAST (>> anything BERT embedding related)
### Pre-configured to use 750 features 
#encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vec_vocabulary.pkl")
encoder = EncoderTFIDFVectorizer(df_data=df)
#encoder = EncoderCountVectorizer(df_data=df)

### Most precise since the embeddings of all 3 features are preserved (i.e. generates a 2304-element vector)
### On average 2x slower than EncoderBERTStringConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

# encoder = EncoderBERTVectorConcat(avg_embeddings=False)

### Less precise since all 3 features are concatenated before embedding creation (i.e. generates a 768-element vector)
### On average 2x faster than EncoderBERTVectorConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

# encoder = EncoderBERTStringConcat()

In [17]:
test = encoder.encode(["int x = 5"], ["function(){"], ["}"])
test

Unnamed: 0,!,"""",""")",""");",""",",""";",#,%,&,',...,Ġv_next,Ġvalue_next,Ġvoid_next,Ġw_next,Ġx_next,Ġy_next,Ġ{_next,Ġ|_next,Ġ||_next,Ġ~_next
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
from sklearn.model_selection import train_test_split

# split into training and validation sets
line_tr, line_test, prev_line_tr, prev_line_test, next_line_tr, next_line_test, y_tr, y_test = \
    train_test_split(df['line'], df['prev_line'], df['next_line'], df['vul'], test_size=0.2, random_state=42)

In [19]:
# define how big a batch of entries should be (depending on RAM)
batch_size = 100

# number of epochs is calulated based on the batch_size
epochs = math.ceil(len(line_tr)/batch_size)

# split the dataframes (X_tr, y_tr) into an array of dataframes (number of epochs)
batchesLine = np.array_split(line_tr, epochs)
batchesPrevLine = np.array_split(prev_line_tr, epochs)
batchesNextLine = np.array_split(next_line_tr, epochs)
batchesY = np.array_split(y_tr, epochs)

## ML Model

- Using train_test_split for creating subsets of data
- Training MLPClassifier with training data
- Evaluating model with test data
- (Include RandomForestClassifier as one alternative option for model)


In [20]:
# import and initialisation of generic MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(activation='relu', alpha=0.05, hidden_layer_sizes=(1500,750), learning_rate='adaptive',solver='adam', shuffle=True)

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))

100%|██████████| 721/721 [03:51<00:00,  3.11it/s]


In [49]:
# RandomForestClassifier (Currently WIP)
# Does not provide a partial_fit() method, therefore workaround by increasing tree count each epoch

TREE_INCREASE_EACH_EPOCH = 10

# import and initialisation of generic MLPClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, warm_start=True)

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.fit(encodedBatch, Y_batch)
    # increase tree count each epoch
    clf.set_params(n_estimators=len(clf.estimators_)+TREE_INCREASE_EACH_EPOCH)

100%|██████████| 132/132 [00:05<00:00, 25.88it/s]


In [17]:
# Naive Bayes approach

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))



100%|██████████| 148/148 [00:14<00:00, 10.42it/s]


In [23]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))



100%|██████████| 148/148 [00:09<00:00, 15.48it/s]


In [21]:
X_test_encoded = encoder.encode(line_test.tolist(), prev_line_test.tolist(), next_line_test.tolist())

print("Accuracy of prediction: " , clf.score(X_test_encoded, y_test))
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred))

Accuracy of prediction:  0.9169071936056838
              precision    recall  f1-score   support

           0       0.92      1.00      0.96     16414
           1       0.76      0.10      0.17      1602

    accuracy                           0.92     18016
   macro avg       0.84      0.55      0.56     18016
weighted avg       0.90      0.92      0.89     18016



## Model persistance
Selectively execute when needed!

In [None]:
from joblib import dump
# Store model
dump(clf, 'models/mlp_BERTVectorConcat_firstEmbedding.model')

In [None]:
from joblib import load
# Load model
clf = load('mlp.model') 

## Hyperparameter tuning

<span style="color:red">Currently not _really_ supported due to missing batch support. <br>i.e. only works for small data samples and _EncoderCountVectorizer_</span> 

In [None]:
mlp = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

# TODO: add batch support to allow for larger data end even alternative encoders

encoded_X_tr = encoder.encode(line_tr.tolist(), prev_line_tr.tolist(), next_line_tr.tolist())

from sklearn.model_selection import GridSearchCV

gridclf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
gridclf.fit(encoded_X_tr, y_tr)


# Best parameter set
print('Best parameters found:\n', gridclf.best_params_)

# All results
means = gridclf.cv_results_['mean_test_score']
stds = gridclf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gridclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))



In [None]:
from sklearn.ensemble import RandomForestClassifier
rndclf = RandomForestClassifier(max_depth=2, random_state=0)
rndclf.fit(X_train, y_train)
rndclf.score(X_test, y_test)

y_pred = rndclf.predict(X_test)
print(classification_report(y_test, y_pred))



## Validation section

- Read unbalanced dataset
- Encode using selected encoder
- Evaluate model

In [22]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
val_df = pd.read_csv('./big-vul_dataset/line_sample_1p_original_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

In [23]:
line_val, prev_line_val, next_line_val, y_val = val_df['line'], val_df['prev_line'], val_df['next_line'], val_df['vul']

X_val_encoded = encoder.encode(line_val.tolist(), prev_line_val.tolist(), next_line_val.tolist())
print("Accuracy of prediction: " , clf.score(X_val_encoded, y_val))
y_val_pred = clf.predict(X_val_encoded)
print(classification_report(y_val, y_val_pred))

Accuracy of prediction:  0.9887035584850475
              precision    recall  f1-score   support

           0       0.99      1.00      0.99     46774
           1       0.21      0.11      0.15       409

    accuracy                           0.99     47183
   macro avg       0.60      0.55      0.57     47183
weighted avg       0.99      0.99      0.99     47183



In [24]:
for input, prediction, label in zip(line_val, y_val_pred, y_val):
  if prediction != label:
    print(input, 'has been classified as ', prediction, 'and should be ', label)

content::BrowserContext::GetStoragePartitionForSite(profile, site)-> has been classified as  0 and should be  1
return equalIgnoringCase(getAttribute(attributeName), "true"); has been classified as  0 and should be  1
#else has been classified as  0 and should be  1
{ has been classified as  0 and should be  1
if (tab_permissions && has been classified as  0 and should be  1
if(context->curY >= p->height) { has been classified as  0 and should be  1
case 16: has been classified as  0 and should be  1
valuelen, has been classified as  0 and should be  1
} has been classified as  0 and should be  1
/* XXX What should be done if we fail here? * has been classified as  0 and should be  1
const int b1 = ref[(w + 1) * (y + 1) + x + 0]; has been classified as  0 and should be  1
if (*rsize >= 74 && has been classified as  0 and should be  1
pCluster = m_pSegment->GetNext(pCluster); has been classified as  0 and should be  1
default: has been classified as  0 and should be  1
channel_->rendere