# Bug Detection

In this notebook I will show the baseline model for the bug detection project. This will be the simplest model that can be created to try and solve the problem of bug detection. The main idea of this model is that for each token of code we can predict if it is buggy or not. However we will be using multi class labels where each bug is an actual error class.

## Imports

In [1]:
import pickle
import codenet

import numpy as np

from gensim.models import FastText

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Data

In this section we load the data generated from the CodeNet subset we explored previously. The training data will consist of tokens of code as input and the respective error class as target. From the data exploration section we know that an error can be caused by a sequence of tokens, or an instruction. To label each token we will create an array of labels, each corresponding to a token. Then for each error class, if it's range includes a token we mark it with that label. In the end, the data will be a sequence of token, error class pairs. We want our model to be able to learn which token corresponds to which error class. The limitation of this method is the fact that we will not consider the context of the problem. We will consider each token to be independent of each other.

In [2]:
def convert_X_y(X, y):
    target_y = []
    
    for x, errs in zip(X, y):
        new_y = np.empty_like(x)
        new_y.fill("Accepted")
        for [i1, i2, err] in errs:
            new_y[i1:max(i1+1, i2)] = err
        target_y.append(new_y)
        
    return X, target_y

with open(codenet.detection_X_y_v2_path, 'rb') as f:
    X, y = pickle.load(f)
    X, y = convert_X_y(X, y)

Here we split the data into training and testing subsets with 80% of the sample for training and the rest for testing. Then we use FastText to vectorize the tokens into float arrays with size 16. I chose FastText because it handles words that are also out of vocabulary, like for example variable names that can be unique to a specific submission. Then we transform the input data from a sequence of tokens to an array containing the vectors for each token. The resulting data will be a sequence of vector, label pairs.

In [93]:
tok_train, tok_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [4]:
sentences = [x.tolist() for x in tok_train]

vectorizer = FastText(sentences=sentences, min_count=0, vector_size=16, seed=42)

In [5]:
tok_train = np.concatenate(tok_train)
tok_test = np.concatenate(tok_test)

X_train = vectorizer.wv[tok_train]
X_test = vectorizer.wv[tok_test]
y_train = np.concatenate(y_train)
y_test = np.concatenate(y_test)

## Model

The baseline classifier we are going to use is a decision tree classifier.

In [9]:
clf = DecisionTreeClassifier(random_state=42)

In [10]:
clf.fit(X_train, y_train)

DecisionTreeClassifier(random_state=42)

In [11]:
y_pred = clf.predict(X_test)

In [12]:
print(f"Accuracy {accuracy_score(y_test, y_pred)}")
print(f"Precision {precision_score(y_test, y_pred, average='weighted', zero_division=0)}")
print(f"Recall {recall_score(y_test, y_pred, average='weighted', zero_division=0)}")
print(f"F1 {f1_score(y_test, y_pred, average='weighted', zero_division=0)}")

Accuracy 0.8728067375886525
Precision 0.7871550503899936
Recall 0.8728067375886525
F1 0.8151476436521569


## Inference

In [124]:
bug = """mylist = [1, 2, 3, double]

print(mylist[123])
"""

token_df = codenet.run_pythontokenizer_str(bug)
tokens = token_df["text"].values
tokens = vectorizer.wv[tokens]

prediction = clf.predict(tokens)
token_err_df = codenet.prediction2err(token_df, prediction)

display(token_err_df)
codenet.exec_python_str(bug)