# Bug Detection Pipeline

In this notebook we will look at an example of how to use the pre-trained models to perform bug detection, and obtain both the error description and the labels for each token in the source code.

## Imports

In [1]:
import os
import json
import torch

import numpy as np
import pandas as pd

from transformers import RobertaTokenizerFast, T5ForConditionalGeneration, RobertaForTokenClassification
from difflib import SequenceMatcher
from tqdm.notebook import tqdm
from IPython.display import HTML

pd.set_option('max_columns', None)

codenet_root = '../../input/generated/'

os.environ["WANDB_DISABLED"] = "true"

## Loading Data

In this section we load the examples and also compute the ground truth for a comparison.

In [2]:
def generate_char_mask(original_src, changed_src):
    s = SequenceMatcher(None, original_src, changed_src)
    opcodes = [x for x in s.get_opcodes() if x[0] != "equal"]
    
    original_labels = np.zeros_like(list(original_src), dtype=np.int32)
    for _, i1, i2, _, _ in opcodes:
        original_labels[i1: max(i1+1, i2)] = 1

    return original_labels.tolist()

with open(codenet_root + 'codenetpy_test.json') as f:
    data = json.load(f)["data"]

codenetpy_df = pd.DataFrame(data)

original_src = codenetpy_df['original_src'].tolist()
error_class_extra = codenetpy_df['error_class_extra'].tolist()
changed_src = codenetpy_df['changed_src'].tolist()

true_labels = [generate_char_mask(o, c) for (o, c) in zip(original_src, changed_src)]

## Loading Models

In this section we load the pre-trained models for identifying the error description and the token buggy labels.

In [3]:
tokenizer_ed = RobertaTokenizerFast.from_pretrained("alexjercan/codet5-base-buggy-error-description")
model_ed = T5ForConditionalGeneration.from_pretrained("alexjercan/codet5-base-buggy-error-description")

tokenizer_tc = RobertaTokenizerFast.from_pretrained("alexjercan/codebert-base-buggy-token-classification")
model_tc = RobertaForTokenClassification.from_pretrained("alexjercan/codebert-base-buggy-token-classification")

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

## Inference

The `predict_error_description` function will make use of the pre-trained model to take as input the source code and output the error description in natural language. (works in batches too)

The `predict_token_class` function will make use of the pre-trained model to take as input the error message and the source code and will output labels of 0 and 1 for each character. This is due to the fact that its easier to do visualization and also that we can expand each token label to include all the characters that form that token.

In [4]:
def predict_error_description(tokenizer, model, source):
    tokenized_inputs = tokenizer(source, padding=True, truncation=True, return_tensors="pt").to(model.device)
    tokenized_labels = model.generate(**tokenized_inputs).cpu().detach().numpy()
    
    return tokenizer.batch_decode(tokenized_labels, skip_special_tokens=True)

def predict_token_class(tokenizer, model, error, source):
    if not isinstance(source, list):
        source = [source]
        error = [error]
    
    tokenized_inputs = tokenizer(text=error, text_pair=source, padding=True, truncation=True, return_tensors="pt").to(model.device)
    tokenized_labels = np.argmax(model(**tokenized_inputs)['logits'].cpu().detach().numpy(), 2)
    
    all_labels = []
    for i in range(tokenized_labels.shape[0]):
        labels = [0] * len(source[i])
        for j, label in enumerate(tokenized_labels[i]):
            if tokenized_inputs.token_to_sequence(i, j) != 1:
                continue

            word_id = tokenized_inputs.token_to_word(i, j)
            cs = tokenized_inputs.word_to_chars(i, word_id, sequence_index=1)
            if cs.start == cs.end:
                continue
            labels[cs.start:cs.end] |= tokenized_labels[i, j]
        
        all_labels.append(labels)
    
    return all_labels

def predict(tokenizer_ed, model_ed, tokenizer_tc, model_tc, source):
    error = predict_error_description(tokenizer_ed, model_ed, source)
    labels = predict_token_class(tokenizer_tc, model_tc, error, source)
    
    return error, labels

def color_source(source_code, mask, color='red'):
    text = ""
    for i, char in enumerate(source_code):
        norm_color = 'black'
        if char == ' ':
            char = "•"
            norm_color = 'lightgrey'
        if char == '\n':
            char = "↵\n"
            norm_color = 'lightgrey'
        text += f'<span style="color:{color if mask[i] == 1 else norm_color};">{char}</span>'
    return "<pre>" + text + "</pre>"

def display_example(source_code, error_class_extra, true_error_class_extra, mask, true_mask):
    display(HTML("<h4>The source code that is predicted buggy:\n</h4>"))
    display(HTML(color_source(source_code, mask, color='red')))

    display(HTML("<h4>The source code that is buggy:\n</h4>"))
    display(HTML(color_source(source_code, true_mask, color='blue')))
    
    display(HTML("<h4>The bug predicted to the source code:\n</h4>"))
    display(HTML(f"<pre>{error_class_extra}</pre>"))
    
    display(HTML("<h4>The bug assigned to the source code:\n</h4>"))
    display(HTML(f"<pre>{true_error_class_extra}</pre>"))

In [5]:
for i in range(10):
    display(HTML(f"<h2>Example {i}</h2>"))

    error, labels = predict(tokenizer_ed, model_ed, tokenizer_tc, model_tc, [original_src[i]])

    display_example(original_src[i], error[0], error_class_extra[i], labels[0], true_labels[i])