## Introduction

This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \
It serves the purpose of evaluating the application performance before and after checklist modification, and evaluating the application performance upon model setting changes.

### Libraries

In [1]:
import sys
sys.path.append("../test_creation/")

In [2]:
from analyze import TestEvaluator

import itertools
import json
import pandas as pd

## Data

Please specify the `test_functions_directory` below to load the ML test code base for the evaluation.\
The loaded test functions will be further split.

In [3]:
test_functions_directory = '../../../lightfm/tests'

## Parameters

Please specify the parameters, e.g. checklist, and the corresponding models to be evaluated

In [1]:
models = dict()

In [4]:
# temperatures = [0.1]
# models = ['gpt-3.5-turbo']
# roles = ['a senior machine learning engineer who specializes in performing Machine Learning system testing']

In [5]:
# human_message = """
# Your task is to answer each question in the checklist using only the provided test functions. Do not disclose your work for this step.
# Then, decide the completion score in a fraction format based on your answers.
# Desired JSON format:
# {
#     "Checklist Evaluation":
#         "ID": 
#         "Requirement": -||-
#         "Evaluation": Satisfied/Partially Satisfied/Not Satisfied
#     "Completeness Score": 
#         "Number of satisfied requirements":
#         "Number of partially satisfied requirements":
#         "Number of not satisfied requirements":
#         "Number of requirements":
# }
# """

In [6]:
# user_prompts = [human_message]

In [7]:
checklist_real_before = './checklist_demo1.yaml'
# # Before Prompt Engineering
# '''
# 2.1: Verify the function for loading data files load the file if the files exists with the right format, and doesn't load the file if it doesn't exist, and that it returns the expected results.
# 2.2: Verify the functions for saving data and figures can write as expected. They should check the if the write operation is successfully carried out, and the content is in an expected format.
# 3.1: Ensure that all data files are non-empty and contain the necessary data to proceed with the analysis or processing tasks.
# 3.2: Check that the data to be ingested is in the format expected by the processing algorithms (e.g., Is the CSV loaded as a `pd.DataFrame`? Is the image file loaded as a `np.array`, or a `PIL.Image`?) and that their structure matches the expected schema, any present.
# '''

In [8]:
checklist_real_after = './checklist_demo2.yaml'
# # After Prompt Engineering
# """
# 2.1: Ensure that data-loading functions correctly load files when they exist and match the expected format, handle non-existent files appropriately, and return the expected results.
# 2.2: Verify that functions for saving data and figures perform write operations correctly, checking that the operation succeeds and the content matches the expected format.
# 3.1: Ensure all data files are non-empty and contain the necessary data required for further analysis or processing tasks.
# 3.2: Verify that the data to be ingested matches the format expected by processing algorithms (like pd.DataFrame for CSVs or np.array for images) and adheres to the expected schema.
# """

In [9]:
checklists = [checklist_real_after] # checklist_real_before, 

In [10]:
params = [
    {
        'Param_Set_ID': i,
        'checklist': item[0],
        # 'temperature': item[1],
        # 'model': item[2],
        # 'role': item[3],
        # 'user_prompt': item[4],
    } for i, item in enumerate(itertools.product(
        checklists,
        # temperatures, 
        # models,
        # roles,
        # user_prompts,
    ))
]

In [11]:
pd.DataFrame(params)

Unnamed: 0,Param_Set_ID,checklist
0,0,./checklist_demo2.yaml


## API Running

Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses.

In [12]:
num_test_runs = 2

In [13]:
# def extract_json(response, start='{', end='}'):
#     start_idx = response.index(start)
#     end_idx = response[::-1].index(end)
#     if end_idx == 0:
#         string = response[start_idx:]
#     else:
#         string = response[start_idx:-end_idx]
#     return json.loads(string)

In [15]:
results = []
for param in params:
    # evaluation test run
    evaluator = TestEvaluator(test_functions_directory)
    evaluator.load_checklist(param['checklist'])
    
    for i in range(num_test_runs):
        result = dict()
        evaluator.evaluate()

        result['report'] = evaluator.evaluation_result
        result['Param_Set_ID'] = param['Param_Set_ID']
        result['Test_No'] = i+1
        results.append(result)

  0%|                                                                                                       | 0/6 [00:00<?, ?it/s]

../../../lightfm/tests/test_fast_functions.py
# splits: 6


 17%|███████████████▊                                                                               | 1/6 [00:06<00:32,  6.50s/it]

../../../lightfm/tests/test_movielens.py
# splits: 6


 33%|███████████████████████████████▋                                                               | 2/6 [00:46<01:43, 25.94s/it]

../../../lightfm/tests/test_datasets.py
# splits: 6


 50%|███████████████████████████████████████████████▌                                               | 3/6 [00:54<00:54, 18.17s/it]

../../../lightfm/tests/test_cross_validation.py
# splits: 6


 67%|███████████████████████████████████████████████████████████████▎                               | 4/6 [01:04<00:29, 14.85s/it]

../../../lightfm/tests/test_evaluation.py
# splits: 6


 83%|███████████████████████████████████████████████████████████████████████████████▏               | 5/6 [01:14<00:12, 12.84s/it]

../../../lightfm/tests/test_data.py
# splits: 6


100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [01:21<00:00, 13.54s/it]
  0%|                                                                                                       | 0/6 [00:00<?, ?it/s]

../../../lightfm/tests/test_fast_functions.py
# splits: 6


  0%|                                                                                                       | 0/6 [00:08<?, ?it/s]


JSONDecodeError: Extra data: line 1 column 22 (char 21)

In [20]:
results[0]['report']

[{'file': '../../../lightfm/tests/test_fast_functions.py',
  'report': [{'ID': '2.1',
    'Title': 'Ensure Data File Loads as Expected',
    'Requirement': 'Ensure that data-loading functions correctly load files when they exist and match the expected format, handle non-existent files appropriately, and return the expected results.',
    'Observation': 'The test functions do not directly involve loading data files. They operate on a sparse matrix created from a numpy array.',
    'Functions': ['test_in_positives'],
    'Evaluation': 'Not Satisfied',
    'Score': 0,
    'file': '../../../lightfm/tests/test_fast_functions.py'},
   {'ID': '2.2',
    'Title': 'Ensure Saving Data/Figures Function Works as Expected',
    'Requirement': 'Verify that functions for saving data and figures perform write operations correctly, checking that the operation succeeds and the content matches the expected format.',
    'Observation': 'The test functions do not involve saving data or figures.',
    'Func

In [275]:
for i in range(len(results)):
    results[i]['report'] = extract_json(results[i]['report'])

In [298]:
# results[2]['report']

In [277]:
results_df = pd.DataFrame(results)
results_df = pd.concat([results_df, results_df['report'].apply(pd.Series)], axis=1)
results_df = results_df.drop(columns='report')
results_df

Unnamed: 0,Param_Set_ID,Test_No,Checklist Evaluation,Completeness Score
0,0,1,"[{'ID': '2.1', 'Requirement': 'Verify the func...","{'Number of satisfied requirements': 3, 'Numbe..."
1,0,2,"[{'ID': '2.1', 'Requirement': 'Verify the func...","{'Number of satisfied requirements': 3, 'Numbe..."
2,0,3,"[{'ID': '2.1', 'Requirement': 'Verify the func...","{'Number of satisfied requirements': 3, 'Numbe..."
3,0,4,"[{'ID': '2.1', 'Requirement': 'Verify the func...","{'Number of satisfied requirements': 3, 'Numbe..."
4,0,5,"[{'ID': '2.1', 'Requirement': 'Verify the func...","{'Number of satisfied requirements': 2, 'Numbe..."


## Result & Evaluation

The evaluation will be based on 2 metrics calculated from the response:
- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters
- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters

In [296]:
completeness_score_df = results_df.drop(columns='Checklist Evaluation')
completeness_score_df = pd.concat([completeness_score_df, completeness_score_df['Completeness Score'].apply(pd.Series)], axis=1)
completeness_score_df['completeness_score'] = completeness_score_df['Number of satisfied requirements'] / completeness_score_df['Number of requirements']
completeness_score_df = completeness_score_df.pivot(index='Test_No', columns='Param_Set_ID', values='completeness_score')
# completeness_score_df.reset_index()
# completeness_score_df.columns.name = 'completeness_score'

In [297]:
completeness_score_df.T

Test_No,1,2,3,4,5
Param_Set_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.75,0.75,0.75,0.75,0.5


In [304]:
consistency_df = results_df.drop(columns='Completeness Score')
consistency_df = consistency_df.explode('Checklist Evaluation')
consistency_df = pd.concat([consistency_df, consistency_df['Checklist Evaluation'].apply(pd.Series)], axis=1)
consistency_df = consistency_df.pivot(index=['Param_Set_ID', 'ID'], columns=['Test_No'], values=['Evaluation'])['Evaluation']
consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)
# consistency_df = consistency_df.reset_index().drop(columns=['Param_Set_ID'])
consistency_df.columns.name = None

In [305]:
consistency_df

Unnamed: 0_level_0,Unnamed: 1_level_0,1,2,3,4,5,consistency
Param_Set_ID,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,2.1,Satisfied,Satisfied,Satisfied,Satisfied,Satisfied,True
0,2.2,Partially Satisfied,Not Satisfied,Not Satisfied,Satisfied,Not Satisfied,False
0,3.1,Satisfied,Satisfied,Satisfied,Satisfied,Satisfied,True
0,3.2,Partially Satisfied,Partially Satisfied,Partially Satisfied,Partially Satisfied,Partially Satisfied,True


In [303]:
# consistency_df = consistency_df.reset_index().rename(columns={"ID": "Checklist_ID"})
# consistency_df = consistency_df.groupby(['Param_Set_ID']).mean('consistency')
# consistency_df