##### All works of this assignment are copyright @ Mohammad Ehsan Shahmi Chowdhury

### The following are some important imports required for this assignment:

In [14]:
from datasets import load_dataset
from google import genai
from google.genai import types
import pandas as pd
import numpy as np
import coverage as cv
import mutmut as mut
import os
import subprocess
from pathlib import Path
import sys
import unittest
import io

# **PART 1:**

## We start by installing and loading the dependencies one by one.

## [1] First we download and connect with the **HumanEval database** here.

In [2]:
!pip install -e human-eval

Obtaining file:///C:/Users/Md.%20Ehsan%20Shahmi/OneDrive%20-%20York%20University/PhD_YorkU/Source_codes/EECS6466_Assignment_Ehsan/human-eval

  DEPRECATION: Legacy editable install of human-eval==1.0 from file:///C:/Users/Md.%20Ehsan%20Shahmi/OneDrive%20-%20York%20University/PhD_YorkU/Source_codes/EECS6466_Assignment_Ehsan/human-eval (setup.py develop) is deprecated. pip 25.3 will enforce this behaviour change. A possible replacement is to add a pyproject.toml or enable --use-pep517, and use setuptools >= 64. If the resulting installation is not behaving as expected, try using --config-settings editable_mode=compat. Please consult the setuptools documentation for more information. Discussion can be found at https://github.com/pypa/pip/issues/11457



  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Installing collected packages: human-eval
  Running setup.py develop for human-eval
Successfully installed human-eval-1.0


> The git repository is already downloaded using cloning the **human-eval** repository link: [https://github.com/openai/human-eval](https://github.com/openai/human-eval). After that the above command is run, where the dataset is now being connected with this current jupyter notebook for **Part 1** of the Assignment.

#### Now we load our dataset using two types of object and then print to see that it is correctly loaded.

> 1. Dataset Dictionary structure - to see only the first row.
> 2. Pandas object - to see the full dataset.

In [2]:
datasetHumanEval_Dict = load_dataset("openai_humaneval")
print(type (datasetHumanEval_Dict))                        # Here the dataset is loaded as a Dataset Dictionary structure
print ()
print(datasetHumanEval_Dict['test'][0])                    # This is to view only the first row of the dataset

<class 'datasets.dataset_dict.DatasetDict'>

{'task_id': 'HumanEval/0', 'prompt': 'from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    """ Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    """\n', 'canonical_solution': '    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = abs(elem - elem2)\n                if distance < threshold:\n                    return True\n\n    return False\n', 'test': "\n\nMETADATA = {\n    'author': 'jt',\n    'dataset': 'test'\n}\n\n\ndef check(candidate):\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n    assert candidate([1.0, 2.0, 

#### The above is the **Dataset Dictionary structure**, with the following header values:
>```
>DatasetDict({
>   test: Dataset({
>      features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point'],
>      num_rows: 164
>  })
>})
>```

In [3]:
# Here the dataset is loaded as a pandas object, to operate easily
dfHumanEval_pandas = pd.read_parquet("hf://datasets/openai/openai_humaneval/openai_humaneval/test-00000-of-00001.parquet")
print (type (dfHumanEval_pandas))
print ()
# This is to view all the values of the dataset
print (dfHumanEval_pandas)

<class 'pandas.core.frame.DataFrame'>

           task_id                                             prompt  \
0      HumanEval/0  from typing import List\n\n\ndef has_close_ele...   
1      HumanEval/1  from typing import List\n\n\ndef separate_pare...   
2      HumanEval/2  \n\ndef truncate_number(number: float) -> floa...   
3      HumanEval/3  from typing import List\n\n\ndef below_zero(op...   
4      HumanEval/4  from typing import List\n\n\ndef mean_absolute...   
..             ...                                                ...   
159  HumanEval/159  \ndef eat(number, need, remaining):\n    """\n...   
160  HumanEval/160  \ndef do_algebra(operator, operand):\n    """\...   
161  HumanEval/161  \ndef solve(s):\n    """You are given a string...   
162  HumanEval/162  \ndef string_to_md5(text):\n    """\n    Given...   
163  HumanEval/163  \ndef generate_integers(a, b):\n    """\n    G...   

                                    canonical_solution  \
0        for idx, elem in 

#### Now we have two different objects loaded with the dataset:
> 1. '**datasetHumanEval_Dict**' - dictionary
> 2. '**dfHumanEval_pandas**' - pandas dataframe

## [2] Secondly, we download the **coverage** package from standard library [here](https://coverage.readthedocs.io/en/7.10.7/).

> Using `pip install coverage` in the command prompt, this package is installed.

> We test the `coverage` package by verifying its version.

In [4]:
!coverage --version

Coverage.py, version 7.10.7 with C extension
Full documentation is at https://coverage.readthedocs.io/en/7.10.7


> We run the `coverage` package on a simple python file. Then we report on it to see it works on any file.
> The -m argument even shows the line numbers that is missed.

In [7]:
!coverage run -m unittest coverageVerficationModule.py    # to run the test suite and validate the coverage package        
!coverage report -m coverageVerficationModule.py          # to report the results found from the run in line 1

..
----------------------------------------------------------------------
Ran 2 tests in 0.001s

OK


Name                           Stmts   Miss  Cover   Missing
------------------------------------------------------------
coverageVerficationModule.py      13      1    92%   7
------------------------------------------------------------
TOTAL                             13      1    92%


## [3] Thirdly, we download the **mutmut** package for mutation score from its standard library [here](https://pypi.org/project/mutmut/#files)

> Using `pip install mutmut` in the command prompt, this package is installed.

In [13]:
!pip install mutmut



> We test the `mutmut` package by verifying its version.

In [64]:
print(mutmut.__version__)                     # This following is the verison of mutmut we will be working on.

Name: mutmut
Version: 2.4.4
Summary: mutation testing for Python 3
Home-page: https://github.com/boxed/mutmut
Author: Anders Hovmöller
Author-email: boxed@killingar.net
License: BSD
Location: C:\Users\Md. Ehsan Shahmi\anaconda3\Lib\site-packages
Requires: click, glob2, junit-xml, parso, pony, toml
Required-by: 


## [4] Finally, we install the **gemini** LLM from [here](https://ai.google.dev/api) to utilize it for the assignment.

> Using `pip install google-genai` in the command prompt, the package is installed.

> Then we test the `google-genai` package by verifying its version below.

In [6]:
!pip show google-genai

Name: google-genai
Version: 1.41.0
Summary: GenAI Python SDK
Home-page: https://github.com/googleapis/python-genai
Author: 
Author-email: Google LLC <googleapis-packages@google.com>
License: Apache-2.0
Location: C:\Users\Md. Ehsan Shahmi\anaconda3\Lib\site-packages
Requires: anyio, google-auth, httpx, pydantic, requests, tenacity, typing-extensions, websockets
Required-by: 


> We also test the import worked properly with a simple manual prompt.
> We will use our free generated API `AIzaSyAxhBko7FBirnssg6sJ_T9sPUnCJ7OPs_E` for this entire assignment.

In [None]:
# Set your API key
os.environ['GEMINI_API_KEY'] = 'public_something_key'

# Confirm the key was set (optional)
print(os.environ['GEMINI_API_KEY'])

# The client gets the API key from the environment variable `GEMINI_API_KEY`, which is set above.
client = genai.Client()

In [5]:
response = client.models.generate_content(
    model="gemini-2.5-flash", contents="Explain who you are.", 
    config=types.GenerateContentConfig(
        thinking_config=types.ThinkingConfig(thinking_budget=0)) # Disables thinking thus reducing tokens for this testing here.
)
print(response.text)

I am a large language model, trained by Google.


> #### Here ends our setting up the environment, which is the 1st instruction within Part 1.
## Now we commence with the 2nd instruction in the Part 1 of our assignment. 

> We load the pandas dataframe, where our dataset is loaded previously, to see its structure.

In [6]:
dfHumanEval_pandas

Unnamed: 0,task_id,prompt,canonical_solution,test,entry_point
0,HumanEval/0,from typing import List\n\n\ndef has_close_ele...,"for idx, elem in enumerate(numbers):\n ...","\n\nMETADATA = {\n 'author': 'jt',\n 'da...",has_close_elements
1,HumanEval/1,from typing import List\n\n\ndef separate_pare...,result = []\n current_string = []\n ...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",separate_paren_groups
2,HumanEval/2,\n\ndef truncate_number(number: float) -> floa...,return number % 1.0\n,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",truncate_number
3,HumanEval/3,from typing import List\n\n\ndef below_zero(op...,balance = 0\n\n for op in operations:\n...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",below_zero
4,HumanEval/4,from typing import List\n\n\ndef mean_absolute...,mean = sum(numbers) / len(numbers)\n re...,"\n\nMETADATA = {\n 'author': 'jt',\n 'da...",mean_absolute_deviation
...,...,...,...,...,...
159,HumanEval/159,"\ndef eat(number, need, remaining):\n """"""\n...",if(need <= remaining):\n return [ n...,def check(candidate):\n\n # Check some simp...,eat
160,HumanEval/160,"\ndef do_algebra(operator, operand):\n """"""\...",expression = str(operand[0])\n for oprt...,def check(candidate):\n\n # Check some simp...,do_algebra
161,HumanEval/161,"\ndef solve(s):\n """"""You are given a string...",flg = 0\n idx = 0\n new_str = list(s...,def check(candidate):\n\n # Check some simp...,solve
162,HumanEval/162,"\ndef string_to_md5(text):\n """"""\n Given...",import hashlib\n return hashlib.md5(tex...,def check(candidate):\n\n # Check some simp...,string_to_md5


> We provide the prompt of each problem in the dataset to the LLM. The LLM creates 10 test cases for each prompt.

> However, for each prompt, LLM might also create a dummy canonical solution before the test cases, which we have to ignore. 

> We store all the 10 test cases, as a single test suite, in separate a python file for each problem.

> The python file is named **python_canonicalx.py**, where x represents the problem index from the dataset.

> This python file will be used for evaluation of the **3rd instruction in Part 1** of the assignment. Before storing in the python file, we instruct the LLM to append the ground truth solution to the test suite. Then we write the entire code in the python file.

In [None]:
file_iterator = 0
for row in dfHumanEval_pandas.itertuples(index=False):    # The loop to iterate over the entire dataset

    prompt = row[1]                                       # The prompt is stored in the variable "prompt"

    # The LLM is provided with this along with a small instruction prompt
    response = client.models.generate_content(            
    model="gemini-2.5-flash", 
    contents=["Using Python standard unittest format, create 10 test cases as a complete test suite for the following prompt. Do not create the solution function of the prompt in any way. ", prompt]
    )

    canonical = row[2]                                   # The canonical ground truth value is stored in variable "canonical"

    # The LLM is again told to append the ground truth
    # with its generated test suite to complete the python file
    response = client.models.generate_content(       
    model="gemini-2.5-flash",                            
    contents=[response_text, "Take the following ground truth solution function and append it above the test suite generated by you. ", canonical],
    config=types.GenerateContentConfig(thinking_config=types.ThinkingConfig(thinking_budget=0))
    )

    # We do few string operations so that evaluations could be done.
    # Before that, we save the entire generated test suite in a string, so that we can clean.
    response_text = response.text                          
    response_text = '#'+response_text
    response_text = response_text.replace("```","")
    
    # We now store the entire test suite with ground truths in python files for each problem
    # Thus when loop completes, there will be 164 .py files to run the evaluations on them
    file_name = f"prompt_canonical_testSuite\test_prompt_canonical{file_iterator}.py"
    file_content = response.text
    with open(file_name, 'w', encoding='utf-8') as file:
        file.write(response.tex)
    
    file_iterator += 1

> #### Here ends the 2nd instruction within Part 1.
## Now we commence with the 3rd instruction in the Part 1 of our assignment. 

> Here we deal with evaluations using the instructed metrics.

> We have our test suites with the ground truth values above them from the dataset. All are stored in a separate python file for each problem, so that each problem and its corresponding test suite can be run separately.

> The following is just a command prompt running a test Suite from a randomly **selected problem (problem number 6)**.

In [7]:
!python -m unittest prompt_canonical_testSuite/prompt_canonical6.py -v

test_complex_single_group_with_branches (prompt_canonical_testSuite.prompt_canonical6.TestParseNestedParens.test_complex_single_group_with_branches)
Test a single complex group with branching parentheses. ... ok
test_docstring_example (prompt_canonical_testSuite.prompt_canonical6.TestParseNestedParens.test_docstring_example)
Test the example provided in the function's docstring. ... ok
test_empty_string (prompt_canonical_testSuite.prompt_canonical6.TestParseNestedParens.test_empty_string)
Test with an empty input string. ... ok
test_more_complex_and_longer_input (prompt_canonical_testSuite.prompt_canonical6.TestParseNestedParens.test_more_complex_and_longer_input)
Test a longer input string with varied and complex groups. ... FAIL
test_multiple_groups_all_level_one (prompt_canonical_testSuite.prompt_canonical6.TestParseNestedParens.test_multiple_groups_all_level_one)
Test multiple groups, each with one level of nesting. ... ok
test_multiple_groups_mixed_simple (prompt_canonical_testSui

> Here we tested the run over the entire dataset as a whole together - just to see whether works or not.

In [13]:
# We show the directory where my test Suites are located
test_dir = os.path.join(os.getcwd(), 'prompt_canonical_testSuite')

# We create a TestLoader instance
loader = unittest.TestLoader()

# We use discover function to give the name pattern of the test Suite files in the specified directory
suite = loader.discover(start_dir=test_dir, pattern='prompt_canonical*.py')

# We use io object to get the values of the result.
stream = io.StringIO()

# The runner.run() method returns a TestResult object.
runner = unittest.TextTestRunner(stream=stream, verbosity=2)
result = runner.run(suite)

# We calculate the counts from the above object
succeeded_tests = result.testsRun - len(result.failures) - len(result.errors)
failed_tests = len(result.failures)
error_tests = len(result.errors)

print("\n--- Test Summary ---")
print(f"Succeeded: {succeeded_tests}")
print(f"Failed: {failed_tests}")
print(f"Errors: {error_tests}")

# The original output from the test result is viewed here.
print("\n--- Full Test Output ---")
print(stream.getvalue())


--- Test Summary ---
Succeeded: 1588
Failed: 47
Errors: 3

--- Full Test Output ---
test_01_no_close_elements_from_docstring (prompt_canonical0.TestHasCloseElements.test_01_no_close_elements_from_docstring)
Test case from the docstring: numbers are clearly far apart. ... ok
test_02_close_elements_from_docstring (prompt_canonical0.TestHasCloseElements.test_02_close_elements_from_docstring)
Test case from the docstring: a pair of numbers is closer than the threshold. ... ok
test_03_empty_list (prompt_canonical0.TestHasCloseElements.test_03_empty_list)
Test with an empty list of numbers. ... ok
test_04_single_element_list (prompt_canonical0.TestHasCloseElements.test_04_single_element_list)
Test with a list containing only one number. ... ok
test_05_two_elements_not_close (prompt_canonical0.TestHasCloseElements.test_05_two_elements_not_close)
Test with exactly two elements that are clearly not closer than the threshold. ... ok
test_06_two_elements_are_close (prompt_canonical0.TestHasClose

> Now we calculate the **validity rate of each problem**. Then we calculate the **average validity rate** of the whole dataset.

#### The average validity rate of the whole dataset is **97.01%** using our gemini-2.5-flash LLM.

In [38]:
# We show the directory where my test Suites are located
test_dir = os.path.join(os.getcwd(), 'prompt_canonical_testSuite')

number_of_problems = 164
total_validity = 0                                                                 # To calculate the average validity of the dataset
for prob_number in range(number_of_problems):                                    # We use this loop to get validity rate of each problem
    print (f"\n\nStarting a new test Suite for a new problem: {prob_number}")
    # We create a TestLoader instance
    loader = unittest.TestLoader()
    # We use discover function to give the name pattern of the test Suite files in the specified directory
    file_name = f'prompt_canonical{prob_number}.py'
    # suite = loader.discover(start_dir=test_dir, pattern='prompt_canonical66.py')
    sys.path.append(test_dir)
    module_name = os.path.basename(file_name).removesuffix('.py')
    suite = loader.loadTestsFromNames([module_name])
    
    # We use io object to get the values of the result.
    stream = io.StringIO()
    
    # The runner.run() method returns a TestResult object.
    runner = unittest.TextTestRunner(stream=stream, verbosity=2)
    result = runner.run(suite)
    
    # We calculate the counts from the above object
    succeeded_tests = result.testsRun - len(result.failures) - len(result.errors)
    failed_tests = len(result.failures)
    error_tests = len(result.errors)
    
    print("--- Test Summary ---")
    print(f"Succeeded: {succeeded_tests}")
    print(f"Failed: {failed_tests}")
    print(f"Errors: {error_tests}")
    
    # Calculating validity of individual problem here and reporting.
    validity = succeeded_tests / result.testsRun * 100
    print(f"Validity Rate: {validity}%")
    total_validity += validity
    
    # The original output from the test suite is viewed here.
    print("\n--- Full TestSuite Output for this problem:---")
    print(stream.getvalue())

avg_validity = round(total_validity / number_of_problems, 2)
print (f"\n\nAverage validity rate for the entire Human-Eval dataset: {avg_validity}%\n")



Starting a new test Suite for a new problem: 0
--- Test Summary ---
Succeeded: 10
Failed: 0
Errors: 0
Validity Rate: 100.0%

--- Full TestSuite Output for this problem:---
test_01_no_close_elements_from_docstring (prompt_canonical0.TestHasCloseElements.test_01_no_close_elements_from_docstring)
Test case from the docstring: numbers are clearly far apart. ... ok
test_02_close_elements_from_docstring (prompt_canonical0.TestHasCloseElements.test_02_close_elements_from_docstring)
Test case from the docstring: a pair of numbers is closer than the threshold. ... ok
test_03_empty_list (prompt_canonical0.TestHasCloseElements.test_03_empty_list)
Test with an empty list of numbers. ... ok
test_04_single_element_list (prompt_canonical0.TestHasCloseElements.test_04_single_element_list)
Test with a list containing only one number. ... ok
test_05_two_elements_not_close (prompt_canonical0.TestHasCloseElements.test_05_two_elements_not_close)
Test with exactly two elements that are clearly not closer 

> Now we do the same evaluation of **coverage**, run over the entire dataset.

> We see the coverage percentage for each problem in the 3rd column. The **average coverage value** is given at the bottom.

> It is also noted that the test suite missing lines are provided in the 4th column for each problem.

#### The average validity rate of the whole dataset is **95.0%** using our gemini-2.5-flash LLM.

In [62]:
!coverage run --source=prompt_canonical_testSuite/ -m unittest discover
!coverage report -m

......................................................F.............F............................................................................................F..F..F........................................................F......F............F................................................F.........F....F.................F........................................................F...................................................................F.....FFF.......................F...............................................................................................................................................................................................................................F..........................E.......F..................F.................F..........................F.F...F................................................................................................................................................FF.F.....................F.......F.........

Name                                                     Stmts   Miss  Cover   Missing
--------------------------------------------------------------------------------------
prompt_canonical_testSuite\__init__.py                       0      0   100%
prompt_canonical_testSuite\test_prompt_canonical0.py        54      1    98%   141
prompt_canonical_testSuite\test_prompt_canonical1.py        77      5    94%   231-236
prompt_canonical_testSuite\test_prompt_canonical2.py        26      1    96%   92
prompt_canonical_testSuite\test_prompt_canonical3.py        42      1    98%   95
prompt_canonical_testSuite\test_prompt_canonical4.py        59      1    98%   161
prompt_canonical_testSuite\test_prompt_canonical5.py        64      1    98%   101
prompt_canonical_testSuite\test_prompt_canonical6.py        36      1    97%   94
prompt_canonical_testSuite\test_prompt_canonical7.py        67      1    99%   97
prompt_canonical_testSuite\test_prompt_canonical8.py        32      1    97%   81
pro

In [66]:
!mutmut run --path-to-mutate

Usage: mutmut run [OPTIONS] [ARGUMENT]
Try 'mutmut run -h' for help.

Error: No such option: --path-to-mutate (Possible options: --paths-to-exclude, --paths-to-mutate, --post-mutation)
