## Test case generation with Large Language Models

In this series of exercises, we will investigate the use of LLM to generate test cases.

### Step 1: Our reference code

As opposed to the previous experience with code generation - where we had valid test cases - we assume this time that we have valid solutions for given software requirements. Our task now is to generate test cases for valid code.



In [6]:
#the same code is saved in the python script function_01.py

original_function="""def racer_disqualified(times, winner_times, n_penalties, penalties):
    \"""
    Determines if a racer is disqualified based on their times, penalties, and winner times.

    Parameters:
        times (list of int): List of the racer's times for three events.
        winner_times (list of int): List of winner times for the same three events.
        n_penalties (int): Number of penalties the racer incurred.
        penalties (list of int): List of penalty values.

    Returns:
        bool: True if the racer is disqualified, False otherwise.

    Raises:
        ValueError: If inputs do not meet the required types or constraints.
    \"""
    # Input validation
    if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):
        raise ValueError("times must be a list of three integers.")

    if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):
        raise ValueError("winner_times must be a list of three integers.")

    if not isinstance(n_penalties, int):
        raise ValueError("n_penalties must be an integer.")

    if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):
        raise ValueError("penalties must be a list of integers.")

    if n_penalties != len(penalties):
        raise ValueError("n_penalties must match the length of the penalties list.")

    disqualified = False
    tot_penalties = 0

    # Calculate total penalties and check for any excessive penalty
    for penalty in penalties:
        tot_penalties += penalty
        if penalty > 100:
            disqualified = True

    # Check for disqualification based on total penalties or number of penalties
    if tot_penalties > 100 or n_penalties > 5:
        disqualified = True

    # Check if any time exceeds 1.5 times the corresponding winner time
    for i in range(3):
        max_time = winner_times[i] * 1.5
        if times[i] > max_time:
            disqualified = True

    return disqualified"""


file_path = "function_01.py"

with open(file_path, 'w') as file:
    file.write(original_function)



def racer_disqualified(times, winner_times, n_penalties, penalties):
    """
    Determines if a racer is disqualified based on their times, penalties, and winner times.

    Parameters:
        times (list of int): List of the racer's times for three events.
        winner_times (list of int): List of winner times for the same three events.
        n_penalties (int): Number of penalties the racer incurred.
        penalties (list of int): List of penalty values.

    Returns:
        bool: True if the racer is disqualified, False otherwise.

    Raises:
        ValueError: If inputs do not meet the required types or constraints.
    """
    # Input validation
    if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):
        raise ValueError("times must be a list of three integers.")

    if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):
        raise ValueError("winner_times must be a list of three integers.")

    if not isinstance(n_penalties, int):
        raise ValueError("n_penalties must be an integer.")

    if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):
        raise ValueError("penalties must be a list of integers.")

    if n_penalties != len(penalties):
        raise ValueError("n_penalties must match the length of the penalties list.")

    disqualified = False
    tot_penalties = 0

    # Calculate total penalties and check for any excessive penalty
    for penalty in penalties:
        tot_penalties += penalty
        if penalty > 100:
            disqualified = True

    # Check for disqualification based on total penalties or number of penalties
    if tot_penalties > 100 or n_penalties > 5:
        disqualified = True

    # Check if any time exceeds 1.5 times the corresponding winner time
    for i in range(3):
        max_time = winner_times[i] * 1.5
        if times[i] > max_time:
            disqualified = True

    return disqualified

### Step 2: Define some pytest test cases

We now setup an environment to run test cases and obtain the coverage of test cases. To start, we define a couple of test cases with the PyTest library.

In [None]:
import pytest
import ipytest

#TODO 
#define test cases with pytest to run with ipytest below


def run_tests():
    ipytest.run('-vv')  

# Running the tests with ipytests
run_tests()



### Step 3: Computing the pass rate

The first objective of our analysis is computing the pass rate of the test cases.

The pass rate for a test suite is defined as the ratio between the passing test cases and all the test cases executed.

Notice that this ratio is computed in the same way as the Functional Correctness when you are comparing generated code against an existing test suite, but there is a subtle difference in what we are measuring: 
- when we compute functional correctness, we have a correct test suite, and we are verifying if the code complies to requirements by executing the test cases.
- when we compute the pass rate, we have correct code, and we are verifying if the test cases comply to the requirements by executing them against the code.

For now, we are defining the test cases manually: we make sure that the pass rate is 100%.

In [None]:
import pytest
import io
import sys
import subprocess
import re



#TODO
# Run the pytest command and capture the output
result = ""

#TODO
# Extract test results from the pytest output
# parse the results to find passed, failed and errors
# (hint: you can use the code from previous lab)


errors, failures, passes = (0,0,0)


print(f"# Passed: {passes}")
print(f"# Failed: {failures}")
print(f"# Errors: {errors}")

#compute the pass rate of the test cases
pass_rate = 0
print(f"Pass Rate: {pass_rate}")

### Step 4: Compute the coverage

To compute the coverage of a test suite over a function or a set of functions, we can use the coverage library.

pip install pytest-cov

Once we have the coverage module installed, it is possible to launch the coverage by launching the following command line instructions:
- coverage run -m pytest test_function_name
- coverage report -m

In this code section, define multiple subprocess runs to obtain the results of the coverage computation inside a variable.

In [None]:
#TODO
# Run the pytest coverage run command
result = ""

#TODO
# Run the pytest coverage report command
result2 = ""

#TODO
#define code to extract the coverage from the coverage report
#1) find the line where the function is defined (the line will report the name of the file)
#2) extract the coverage

coverage = 0.0

print(f"Coverage: {coverage}%")

### Step 5: Introducing mutations

To try out mutation testing, produce a set of variants of the function by changing operators and values. Save all these variants in a dictionary of mutations by modifying the text of the function like in the example below.

Remeber to introduce a single mutant in each mutated version of the function.

**Note**: several tools exist to automate mutation. You can refer to the libraries mutatest and mutpy to generate automatic mutations for test cases written with pytest. In this example, we will introduce mutations manually.

In [None]:

#in this mutant, the check "if penalty < 100" is changed to "if penalty > 100"

mutant1 = """def racer_disqualified(times, winner_times, n_penalties, penalties):
    \"""
    Determines if a racer is disqualified based on their times, penalties, and winner times.

    Parameters:
        times (list of int): List of the racer's times for three events.
        winner_times (list of int): List of winner times for the same three events.
        n_penalties (int): Number of penalties the racer incurred.
        penalties (list of int): List of penalty values.

    Returns:
        bool: True if the racer is disqualified, False otherwise.

    Raises:
        ValueError: If inputs do not meet the required types or constraints.
    \"""
    # Input validation
    if not (isinstance(times, list) and len(times) == 3 and all(isinstance(t, int) for t in times)):
        raise ValueError("times must be a list of three integers.")

    if not (isinstance(winner_times, list) and len(winner_times) == 3 and all(isinstance(wt, int) for wt in winner_times)):
        raise ValueError("winner_times must be a list of three integers.")

    if not isinstance(n_penalties, int):
        raise ValueError("n_penalties must be an integer.")

    if not (isinstance(penalties, list) and all(isinstance(p, int) for p in penalties)):
        raise ValueError("penalties must be a list of integers.")

    if n_penalties != len(penalties):
        raise ValueError("n_penalties must match the length of the penalties list.")

    disqualified = False
    tot_penalties = 0

    # Calculate total penalties and check for any excessive penalty
    for penalty in penalties:
        tot_penalties += penalty
        if penalty < 100:
            disqualified = True

    # Check for disqualification based on total penalties or number of penalties
    if tot_penalties > 100 or n_penalties > 5:
        disqualified = True

    # Check if any time exceeds 1.5 times the corresponding winner time
    for i in range(3):
        max_time = winner_times[i] * 1.5
        if times[i] > max_time:
            disqualified = True

    return disqualified"""


#TODO
#define additional mutations and combine them in a mutant list
mutant2 = ""
mutant3 = ""
mutant4 = ""
mutant5 = "" 
mutants = []



### Step 6: Calculating Mutation Score

Now cycle over the list of mutants. For every mutant, overwrite the function function_01.py and re-execute the test cases. For each mutant you can compute the following outcome:
- Mutant killed: one or more test cases failed
- Mutant survived: all test cases passed

At the end of the iteration over mutants, compute the mutation score:
- Mutation score = survived mutants / total number of mutants

In [None]:
#define the path where to save the mutants

file_path = "function_01.py"


#initialize killed mutants and survived mutants
killed_mutants = 0
survived_mutants = 0

# Iterate over the list of mutants 

for mutant in mutants:
    
    #TODO
    #overwrite the file with the function with each mutant

    
    #TODO
    #run the test cases and collect the number of passed tests


    #TODO
    # Extract test results from the pytest output
    # parse the results to find passed, failed and errors
    # (hint: you can use the code from previous lab)


    #TODO
    #update the number of survived or killed mutants

    pass

#TODO
#compute the mutation score
mutation_score = 0.0

print(f"Mutation score: {round(mutation_score*100, 2)}%")





### Step 7 : Generating tests with LLMs

This time, we will consider again at least two alternatives for test case generation:
- a model from HuggingFace, e.g., CodeLLAMA
- a chat engine, e.g., ChatGPT or Qwen2.5

With each engine, we will generate a new test file (e.g., test_function_01_gpt.py, and test_function_01_llama.py), and replicate the pass rate, coverage and mutation analysis performed before with pre-defined test cases.

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import subprocess




#TODO 
#obtain test code by using LLAMA or other chat engines, save the results in different files


#TODO
#append the results on different test files
test_files = []


#TODO
#traverse all the test files saved in the list
for test_file in test_files :


    print()
    print()
    print("Doing:", test_file)

    #TODO
    #restore the original function in function_01.py after mutation analysis is performed


    #TODO
    #compute pass rate for the given test file




    number_of_tests = 0
    pass_rate = 0.0
    print(f"Number of tests: {number_of_tests}")
    print(f"Pass Rate: {round(pass_rate*100, 2)}")

    #TODO
    #compute coverage for the given test file




    print(f"Coverage: {coverage}%")
    

    #re-execute the mutation analysis

    survived_mutants = 0
    killed_mutants = 0

    for mutant in mutants:
    
        #TODO
        #overwrite the file with the function with each mutant
        pass

    #TODO
    #compute the mutation score
    mutation_score = 0.0

    print(f"Mutation score: {round(mutation_score*100, 2)}%")


    pass

