## ML experimentation
#### Current approach:
- Selection of 3 different encoders (TokenVectorizer, 2x BERT embeddings)
- All encoders use CodeBERT as a tokenizer
- line, prev_line & next_line are considered
- MLPClassifier, not optimized yet


#### Open issues:
- Try different ML models
- Include joern-features (maybe, but unlikely due to resource constraints)
- Hyperparameter tuning
- ...

_References_:
- https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
- https://github.com/microsoft/CodeBERT

In [1]:
import pandas as pd
import numpy as np
import math
import torch
from tqdm import tqdm
import itertools
from line_encoders import EncoderCountVectorizer, EncoderBERTVectorConcat, EncoderBERTStringConcat

In [2]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
df = pd.read_csv('./big-vul_dataset/line_sample_20p_balanced_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,func_id,line,vul,prev_line,next_line
0,0,185342,bool activatable = activatable_ && !hit_test_b...,1,surface_->GetHitTestBounds() + surface_origin....,if (activatable != CanActivate()) {
1,1,188415,return E_BUFFER_NOT_FULL;,1,len = 1;,}
2,2,185284,defaultQuirksStyle = RuleSet::create().leakPtr();,1,defaultPrintStyle = RuleSet::create().leakPtr();,}
3,3,179389,struct ext4_extent *ex1 = NULL;,1,"struct ext4_extent *ex, newex, orig_ex;",struct ext4_extent *ex2 = NULL;
4,4,185548,if (toHTMLElement(this)->highestAncestor() != ...,1,ASSERT(m_form);,setForm(0);


## Data preparation section

- Selection of 3 different data encoders
- Create train and test dataset
- Split into batches

### Please select _encoder_ below!
- `encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vec_vocabulary.pkl")`
- `encoder = EncoderBERTVectorConcat()`
- `encoder = EncoderBERTStringConcat()`

In [4]:
### Use a standard vectorization approach by creating a feature matrix.
### Loses semantic and structural information, but it is FAST (>> anything BERT embedding related)
### Pre-configured to use 750 features 
# encoder = EncoderCountVectorizer(df_data=df, vocabulary_path="vec_vocabulary.pkl")

### Most precise since the embeddings of all 3 features are preserved (i.e. generates a 2304-element vector)
### On average 2x slower than EncoderBERTStringConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

# encoder = EncoderBERTVectorConcat()

### Less precise since all 3 features are concatenated before embedding creation (i.e. generates a 768-element vector)
### On average 2x faster than EncoderBERTVectorConcat
### Set avg_embeddings=False to take the first embedding vector, or True to average all embedding vectors

encoder = EncoderBERTStringConcat(avg_embeddings=False)

In [5]:
from sklearn.model_selection import train_test_split

# split into training and validation sets
line_tr, line_test, prev_line_tr, prev_line_test, next_line_tr, next_line_test, y_tr, y_test = \
    train_test_split(df['line'], df['prev_line'], df['next_line'], df['vul'], test_size=0.1, random_state=42)

In [6]:
# define how big a batch of entries should be (depending on RAM)
batch_size = 100

# number of epochs is calulated based on the batch_size
epochs = math.ceil(len(line_tr)/batch_size)

# split the dataframes (X_tr, y_tr) into an array of dataframes (number of epochs)
batchesLine = np.array_split(line_tr, epochs)
batchesPrevLine = np.array_split(prev_line_tr, epochs)
batchesNextLine = np.array_split(next_line_tr, epochs)
batchesY = np.array_split(y_tr, epochs)

## ML Model

- Using train_test_split for creating subsets of data
- Training MLPClassifier with training data
- Evaluating model with test data
- (Include RandomForestClassifier as one alternative option for model)


In [7]:
# import and initialisation of generic MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(activation='relu', alpha=0.05, hidden_layer_sizes=(100,), learning_rate='constant',solver='adam')

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.partial_fit(encodedBatch, Y_batch, classes=np.unique(y_tr))

100%|████████████████████| 148/148 [33:26<00:00, 13.56s/it]


In [None]:
# RandomForestClassifier (Currently WIP)
# Does not provide a partial_fit() method, therefore workaround by increasing tree count each epoch

TREE_INCREASE_EACH_EPOCH = 10

# import and initialisation of generic MLPClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, warm_start=True)

# iterate over the number of epochs
for i in tqdm(range(epochs)):
    # take a batch and process it and partial_fit the model to the batch
    line_batch, prev_line_batch, next_line_batch, Y_batch = batchesLine[i], batchesPrevLine[i], batchesNextLine[i], batchesY[i]
    
    # encode to vector 
    encodedBatch = encoder.encode(line_batch.tolist(), prev_line_batch.tolist(), next_line_batch.tolist())
    
    clf.fit(encodedBatch, Y_batch)
    # increase tree count each epoch
    clf.set_params(n_estimators=len(clf.estimators_)+TREE_INCREASE_EACH_EPOCH)

In [9]:
X_test_encoded = encoder.encode(line_test.tolist(), prev_line_test.tolist(), next_line_test.tolist())

print("Accuracy of prediction: " , clf.score(X_test_encoded, y_test))
from sklearn.metrics import classification_report
y_pred = clf.predict(X_test_encoded)
print(classification_report(y_test, y_pred))

Accuracy of prediction:  0.63003663003663
              precision    recall  f1-score   support

           0       0.62      0.62      0.62       803
           1       0.64      0.64      0.64       835

    accuracy                           0.63      1638
   macro avg       0.63      0.63      0.63      1638
weighted avg       0.63      0.63      0.63      1638



## Model persistance
Selectively execute when needed!

In [8]:
from joblib import dump
# Store model
dump(clf, 'models/mlp_BERTStringConcat_firstEmbedding.model')

['models/mlp_BERTStringConcat_firstEmbedding.model']

In [None]:
from joblib import load
# Load model
clf = load('mlp.model') 

## Hyperparameter tuning

<span style="color:red">Currently not supported due to BERT & prev/next line inclusion.</span>

In [None]:
mlp = MLPClassifier(max_iter=100)
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}


from sklearn.model_selection import GridSearchCV

gridclf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3)
gridclf.fit(X_train, y_train)


# Best parameter set
print('Best parameters found:\n', gridclf.best_params_)

# All results
means = gridclf.cv_results_['mean_test_score']
stds = gridclf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, gridclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))



In [None]:
from sklearn.ensemble import RandomForestClassifier
rndclf = RandomForestClassifier(max_depth=2, random_state=0)
rndclf.fit(X_train, y_train)
rndclf.score(X_test, y_test)

y_pred = rndclf.predict(X_test)
print(classification_report(y_test, y_pred))



## Validation section

- Read unbalanced dataset
- Encode using selected encoder
- Evaluate model

In [None]:
# DO NOT forget 'keep_default_na=False' --> otherwise some NaN values in read data
val_df = pd.read_csv('./big-vul_dataset/line_sample_10p_original_ratio.csv', skipinitialspace=True, low_memory=True, keep_default_na=False)

In [None]:
line_val, prev_line_val, next_line_val, y_val = df['line'], df['prev_line'], df['next_line'], df['vul']

X_val_encoded = encoder.encode(line_val.tolist(), prev_line_val.tolist(), next_line_val.tolist())
print("Accuracy of prediction: " , clf.score(X_val_encoded, y_val))
y_val_pred = clf.predict(X_val_encoded)
print(classification_report(y_val, y_val_pred))