# OpenBugger
This notebook is a self-contained demo of the OpenBugger package that automatically bugs python code using LibCST

This plan for the notebook is to develop all the components necessary to construct a pipeline that
starting from a python script, it extracts its Concrete Syntax Tree and use it to apply a sequence of revertable syntactical code-mutation to automatically generate training data for debugging language models.

1. Tools to ensure that the sequence of modification is consistent and does not overwrite previously introduced mutations.
2. Bugger class to store the local context of multiple bugs that can be applied in a chain.
3. InverseTransformer class that is able to reverse the transformation of any other transformer sharing the same bugger context.
4. 5/12 example LibCST transformers that can each implement a different logical bug.  
5. Use the InverseTransformer to generate accurate debugging instructions to be used as training data for Large Language Models.

## LibCST

In [None]:
from libcst.codemod import CodemodContext, Codemod
from libcst.metadata import MetadataWrapper
from libcst import Equal, GreaterThanEqual, GreaterThan, CSTNode, Module
from typing import List
from copy import deepcopy
import libcst as cst
from libcst.codemod import CodemodContext, ContextAwareTransformer, ContextAwareVisitor
from libcst.metadata import BatchableMetadataProvider, PositionProvider, CodePosition, CodeRange
import libcst.matchers as m
import uuid

Useful LibCstDocs: 

https://libcst.readthedocs.io/en/latest/metadata.html#position-metadata

https://libcst.readthedocs.io/en/latest/_modules/libcst/metadata/position_provider.html#PositionProvider

https://libcst.readthedocs.io/en/latest/_modules/libcst/metadata/position_provider.html#WhitespaceInclusivePositionProvidingCodegenState


https://libcst.readthedocs.io/en/latest/parser.html

## PositionContextUpdater and is_modified
Because we want to apply potentially random changes to the code we need to introduce some simple helper methods, contained in the is_modified function that use the meta-data saved from transformers after modifyng a node to prevent any other transformer from applying further modifications to the either Node, its parent or its childrens. 

Since each node modification might introduce or remove code we also use the PositionContextUpdater to maintain the context scratch consistent after each mutation.

In [None]:
from openbugger.context import is_modified, save_modified, is_parent_CodeRange, is_child_CodeRange, is_equal_Coderange, PositionContextUpdater

## Bugger and InverseTransformer

In [None]:
from openbugger.bugger import Bugger, TestTransformer, bugger_example, InverseTransformer

### Example Debugging Output Using the Test Tranformer

In [None]:
transformers = [TestTransformer]
# Get the script as a string it should have  while loop and take multiple lines
script = "while x < y[10]:\n\tprint(x)\n\tx = x[2] + 1+y[2:10]"
bugger_example(transformers,script)


# Example Bugs 

In this section of the notebook we develop the 5/12 example bugs using LibCST ContextAwareTransformer that use the is_modified method to check if the node was already targeted by a mutation and then apply the mutation and save the mutation to the scratchpad using save_modified method.

The bugs are:

1. incorrect_comparison_operator - Done
2. comparison_swap - Done
3. forgetting_to_update_variable - Done
4. infinite_loop - Done
5. off_by_k_index - Done
6. incorrect_return_value
7. incorrect_boolean_operator
8. using_wrong_type_of_loop
9. using_loop_variable_outside_loop
10. using_variable_before_assignment
11. using_wrong_variable_scope
12. incorrect_use_of_exception_handling
13. incorrect_function_call

The current approach only allows for the scratchpad context to be passed as initialization to the transformers, therefore we use generator functions for the bugs that could take multiple inputs like which operator to swap. These transformers apply the bug to ALL the target instances they find, we will derive some wrapper in the later part of the notebook to control the number of bugs or target a specific code-range.

## LogicalBugs

In [None]:
from openbugger.bugs.logical import gen_ComparisonTargetTransfomer, ComparisonSwapTransformer

### Incorrect Comparison Operator
This bug takes as input two libcst comparison operators and swaps every instance of the first for the second

In [None]:
transformers = [gen_ComparisonTargetTransfomer('==','!=')]
# Get the script as a string
script = "x == 1 + 2 == 3 + 2 != 3 + 4 > 3"

bugger_example(transformers,script)

### Comparison Swap

In [None]:
transformers = [ComparisonSwapTransformer]
# Get the script as a string
script = "x == 1 + 2 == 3 + 2 != 3 + 4 > 3 "
bugger_example(transformers,script)

## ControlFLow Bugs

In [None]:
from openbugger.bugs.controlflow import ForgettingToUpdateVariableTransformer, InfiniteWhileTransformer,gen_OffByKIndexTransformer

### ForgettingToUpdateVariable

In [None]:
transformers = [ForgettingToUpdateVariableTransformer]
# Get the script as a string
script = "while x == 1 + 2 == 3 + 2 != 3 + 4 > 3 : \n  y = 1 + 2"
bugger_example(transformers,script)

### Infinite While Loop

In [None]:
transformers = [InfiniteWhileTransformer]
# Get the script as a string
script = "while x == 1 + 2 == 3 + 2 != 3 + 4 > 3: \n  y = 1 + 2"
bugger_example(transformers,script)

### OffByKIndex 

In [None]:
transformers = [gen_OffByKIndexTransformer(1)]
# Get the script as a string
script = "while x == 1 + 2 == 3 + 2 != 3 + 4 > 3: \n  y = 1 + 2\nx[1:2]\nx[1]"
bugger_example(transformers,script)

### Incorrect Exception Handling
Catching the wrong exceptions, or incorrect use of the `try/except/finally` block.

In [None]:
class IncorrectExceptionHandlerTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider,)
    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"
    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        return self.transform_module(tree)
    def leave_ExceptHandler(self, original_node: cst.ExceptHandler, updated_node: cst.ExceptHandler) -> cst.ExceptHandler:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node,meta_pos,self.context)
        if not already_modified:
            updated_node = original_node.with_changes(type=None)
            save_modified(self.context,meta_pos,original_node,updated_node,self.id)
        return updated_node

In [None]:
script= """try:
    x = 1 / 0
except ZeroDivisionError:
    print('Zero division error!')
"""
expected_output= """try:
    x = 1 / 0
except:
    print('Zero division error!')
"""

In [None]:
transformers = [IncorrectExceptionHandlerTransformer]
bugger_example(transformers,script)

### Missing Argument Transformer
Calling functions in the wrong sequence.

In [None]:
class MissingArgumentTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider,)
    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"
    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        self.reverse = reverse
        return self.transform_module(tree)
    def leave_Call(self, original_node: cst.Call, updated_node: cst.Call) -> cst.Call:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node,meta_pos,self.context)
        if not already_modified and original_node.args:
            updated_node = original_node.with_changes(args=original_node.args[:-1])
            save_modified(self.context,meta_pos,original_node,updated_node,self.id)
        return updated_node


In [None]:
script= """print(str(123))"""
expected_output = """print(str())"""
transformers = [MissingArgumentTransformer]
bugger_example(transformers,script)


### ReturningEarly
Returning from a function before all the necessary computations have been made.

In [None]:
class ReturningEarlyTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider,)
    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"
    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        self.reverse = reverse
        return self.transform_module(tree)
    def leave_Return(self, original_node: cst.Return, updated_node: cst.Return) -> cst.Return:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node,meta_pos,self.context)
        if not already_modified:
            updated_node = cst.Return(value=None)
            save_modified(self.context,meta_pos,original_node,updated_node,self.id)
        return updated_node


In [None]:
script= """def add(x, y):
    return x + y
    print('This will not be printed.')
"""
expected_output = """def add(x, y):
    return
    print('This will not be printed.')
"""
transformers = [ReturningEarlyTransformer]
bugger_example(transformers,script)

## Data Related Bugs

### Incorrect variable initialization
Variables are initialized with incorrect values.

In [None]:
import random

DEFAULT_VALUES = {
    int: [1, 2, 3, 4, 5],
    str: ['foo', 'bar', 'baz'],
    list: [[1, 2, 3], ['a', 'b', 'c'], []],
    dict: [{'key': 'value'}, {}, {'num': 1, 'bool': False}],
    bool: [True, False],
    None: [None]
}

from libcst import matchers
from libcst.metadata import ParentNodeProvider


class IncorrectVariableInitializationTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider, ParentNodeProvider)

    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"

    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        return self.transform_module(tree)

    def leave_Assign(self, original_node: cst.Assign, updated_node: cst.Assign) -> cst.Assign:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node,meta_pos,self.context)
        if not already_modified:
            if matchers.matches(updated_node.value, matchers.SimpleString()):
                old_value = updated_node.value.value
                old_value_type = str
            elif matchers.matches(updated_node.value, matchers.Integer()):
                old_value = int(updated_node.value.value)
                old_value_type = int
            elif matchers.matches(updated_node.value, matchers.Float()):
                old_value = float(updated_node.value.value)
                old_value_type = float
            elif matchers.matches(updated_node.value, matchers.Name()):
                old_value = updated_node.value.value
                if old_value == "True" or old_value == "False":
                    old_value_type = bool
                elif old_value == "None":
                    old_value_type = type(None)
                else:
                    return updated_node
            else:
                return updated_node

            new_value = random.choice(DEFAULT_VALUES.get(old_value_type, [0]))
            
            if old_value_type == str:
                new_value = f'"{new_value}"'

            updated_node = original_node.with_changes(value=cst.parse_expression(str(new_value)))
            save_modified(self.context,meta_pos,original_node,updated_node,self.id)
        return updated_node
    
    def leave_List(self, original_node: cst.List, updated_node: cst.List) -> cst.List:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node,meta_pos,self.context)
        
        parent_node = self.get_metadata(ParentNodeProvider, original_node)
        if isinstance(parent_node, cst.Assign):
            if not already_modified and len(updated_node.elements) > 0:
                idx = random.randint(0, len(updated_node.elements)-1)
                new_value = random.choice(DEFAULT_VALUES.get(int, [0]))
                updated_elements = list(updated_node.elements)
                updated_elements[idx] = updated_elements[idx].with_changes(value=cst.Integer(str(new_value)))
                updated_node = updated_node.with_changes(elements=tuple(updated_elements))
                save_modified(self.context,meta_pos,original_node,updated_node,self.id)
        return updated_node
    
    def leave_Dict(self, original_node: cst.Dict, updated_node: cst.Dict) -> cst.Dict:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node, meta_pos, self.context)

        parent_node = self.get_metadata(ParentNodeProvider, original_node)
        if isinstance(parent_node, cst.Assign):
            if not already_modified and len(updated_node.elements) > 0:
                idx = random.randint(0, len(updated_node.elements) - 1)
                old_element = updated_node.elements[idx]

                if isinstance(old_element, cst.DictElement):
                    new_value = random.choice(DEFAULT_VALUES.get(str, ["foo", "bar", "baz"]))
                    updated_element = old_element.with_changes(value=cst.SimpleString(f'"{new_value}"'))

                    updated_elements = list(updated_node.elements)
                    updated_elements[idx] = updated_element
                    updated_node = updated_node.with_changes(elements=tuple(updated_elements))
                    save_modified(self.context, meta_pos, original_node, updated_node, self.id)
        return updated_node




In [None]:
script = """var1 = 5
var2 = "hello"
var3 = [1, 2, 3]
var4 = {"key": "value"}
var5 = True
var6 = None
"""
possible_output = """var1 = 2
var2 = "bar"
var3 = ['a', 'b', 'c']
var4 = {}
var5 = False
var6 = None
"""
transformers = [IncorrectVariableInitializationTransformer]
bugger_example(transformers,script)

### Variable Name Typo
Mistyping the names of variables, causing them to be treated as new, uninitialized variables.

In [26]:
from libcst import matchers as m

class VariableNameTypoTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider,)

    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"
        self.seen_variables = set()

    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        return self.transform_module(tree)

    def leave_Assign(self, original_node: cst.Assign, updated_node: cst.Assign) -> cst.Assign:
        meta_pos = self.get_metadata(PositionProvider, original_node)
        already_modified = is_modified(original_node, meta_pos, self.context)

        if not already_modified:
            targets = []
            for target in original_node.targets:
                if isinstance(target.target, cst.Name):
                    var_name = target.target.value
                    self.seen_variables.add(var_name)
                    extra_character = random.choice(list(self.seen_variables)) if self.seen_variables else ''
                    new_name = cst.Name(var_name + extra_character)
                    target = target.with_changes(target=new_name)
                targets.append(target)
            updated_node = original_node.with_changes(targets=targets)
            save_modified(self.context, meta_pos, original_node, updated_node, self.id)
        return updated_node


In [27]:
script = """var1 = 5
var2 = "hello"
var1 = var2
"""
possible_output = """var1 = 5
var2 = "hello"
var1_var2 = var2
"""
transformers = [VariableNameTypoTransformer]
bugger_example(transformers,script)

original_code
var1 = 5
var2 = "hello"
var1 = var2

tainted_code
var1var1 = 5
var2var1 = "hello"
var1var1 = var2

The result of deep_equals between the concrete syntax tree of the original and the bugged code is False
Checking for bugs...
The following Node has a bug of type VariableNameTypoTransformer-1322 starting at line 1, column 0 and ending at line 1, column 12.
The bug can be fixed by substituting the bugged code-string <var1var1 = 5> with the following code-string <var1 = 5>
The following Node has a bug of type VariableNameTypoTransformer-1322 starting at line 2, column 0 and ending at line 2, column 18.
The bug can be fixed by substituting the bugged code-string <var2var1 = "hello"> with the following code-string <var2 = "hello">
The following Node has a bug of type VariableNameTypoTransformer-1322 starting at line 3, column 0 and ending at line 3, column 15.
The bug can be fixed by substituting the bugged code-string <var1var1 = var2> with the following code-string <var1 = var

### Mutable Default Arguments
Using mutable types (like lists or dictionaries) as default function arguments.

In [42]:
from typing import Optional
class MutableDefaultArgumentTransformer(ContextAwareTransformer):
    METADATA_DEPENDENCIES = (PositionProvider,)

    def __init__(self, context: CodemodContext):
        super().__init__(context)
        self.id = f"{self.__class__.__name__}-{uuid.uuid4().hex[:4]}"
        self.used_variables = set()
        self.first_pass = True

    def visit_Name(self, node: cst.Name) -> Optional[bool]:
        # During the first pass, record all the variable names used in the module
        if self.first_pass:
            self.used_variables.add(node.value)
        return None

    def leave_FunctionDef(
        self, original_node: cst.FunctionDef, updated_node: cst.FunctionDef
    ) -> cst.FunctionDef:
        # Only modify function definitions during the second pass
        if not self.first_pass and self.used_variables:
            # Choose one of the used variables to replace with a mutable default argument
            variable_to_replace = self.used_variables.pop()
            
            # Build new parameters, replacing the chosen variable with a mutable default argument
            new_params_list = []
            for param in updated_node.params.params:
                if isinstance(param, cst.Param) and param.name.value == variable_to_replace:
                    new_params_list.append(cst.Param(name=param.name, default=cst.List([])))
                else:
                    new_params_list.append(param)
            new_params = cst.Parameters(params=new_params_list)
            
            # Build a new function body that modifies the mutable default argument
            assignment = cst.parse_statement(f"{variable_to_replace}.append(1)")

            # If the body is non-empty, insert the assignment before the last statement. 
            # Otherwise, add the assignment as the only statement.
            if updated_node.body.body:
                new_body = cst.IndentedBlock(
                    body=list(updated_node.body.body[:-1]) + [assignment] + [updated_node.body.body[-1]]
                )
            else:
                new_body = cst.IndentedBlock(
                    body=[assignment]
                )

            # Return the updated function definition
            return updated_node.with_changes(params=new_params, body=new_body)
        return updated_node




    def mutate(self, tree: cst.Module, reverse: bool = False) -> cst.Module:
        # First pass: collect variable names
        self.first_pass = True
        self.transform_module(tree)
        # Second pass: introduce bugs
        self.first_pass = False
        return self.transform_module(tree)


In [44]:
script = """def func1(arg1, arg2):
    return arg1 + arg2"""
expected_output = """def func1(arg1, arg2, mutable_default_arg=[]):
    return arg1 + arg2"""
transformers = [MutableDefaultArgumentTransformer]
bugger_example(transformers,script)

original_code
def func1(arg1, arg2,arg3):
    return arg1 + arg2
tainted_code
def func1(arg1, arg2,arg3 = []):
    arg3.append(1)
    return arg1 + arg2
The result of deep_equals between the concrete syntax tree of the original and the bugged code is False
Checking for bugs...
Debugging...
clean_code
def func1(arg1, arg2,arg3 = []):
    arg3.append(1)
    return arg1 + arg2
Checking if the debugged code is equal to the original code..
The result of deep_equals between the concrete syntax tree of the original and debugged code is False


## Chaining Multiple Bugs

Bugs can be chained in a sequence and they are executed in an order, bugs do not modify nodes that have already been modified or that either their children or parent has been modified. As an example we are going to all InfiniteWhiletransformer, OffByKIndexTransformer and ComparisonTargetTransfomer over the script:

while x == 1 + 2 == 3 + 2 != 3 + 4 > z[4]:  y = 1 + 2\nx[1:2]\nx[1]\nx[1]==3 

Comparison target and OffByKIndex only modify targets that are outside the while target statement.

In [None]:
transformers = [InfiniteWhileTransformer,gen_OffByKIndexTransformer(1),gen_ComparisonTargetTransfomer("==",">=")]
# Get the script as a string
script = "while x == 1 + 2 == 3 + 2 != 3 + 4 > 3: \n  y = 1 + 2\nx[1:2]\nx[1]\nx[1]==3"
bugger_example(transformers,script)