## Introduction

This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \
It serves the purpose of evaluating the application performance before and after checklist modification, and evaluating the application performance upon model setting changes.

### Libraries

In [1]:
import sys
sys.path.append("../test_creation/")

In [42]:
from analyze import TestEvaluator

import pandas as pd

## Data

Please specify the `test_functions_directory` below to load the ML test code base for the evaluation.\
The loaded test functions will be further split.

In [3]:
test_functions_directory = '../../../lightfm/tests'

## Parameters

Please specify the parameters, e.g. checklist, and the corresponding models to be evaluated

In [25]:
models = []

In [5]:
# temperatures = [0.1]
# models = ['gpt-3.5-turbo']

In [26]:
checklist_directory = '../../checklist/checklist_demo.yaml'

In [27]:
name = 'checklist_demo_1'
evaluator = TestEvaluator(test_functions_directory)
evaluator.load_checklist(checklist_directory)
models.append({'name': name, 'model': evaluator})

In [28]:
name = 'checklist_demo_2'
evaluator = TestEvaluator(test_functions_directory)
evaluator.load_checklist(checklist_directory)
models.append({'name': name, 'model': evaluator})

In [29]:
models

[{'name': 'checklist_demo_1', 'model': <analyze.TestEvaluator at 0x15a9f2c90>},
 {'name': 'checklist_demo_2', 'model': <analyze.TestEvaluator at 0x15a9f2c60>}]

In [30]:
pd.DataFrame(models)

Unnamed: 0,name,model
0,checklist_demo_1,<analyze.TestEvaluator object at 0x15a9f2c90>
1,checklist_demo_2,<analyze.TestEvaluator object at 0x15a9f2c60>


## API Running

Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses.

In [32]:
num_test_runs = 2

In [None]:
# def extract_json(response, start='{', end='}'):
#     start_idx = response.index(start)
#     end_idx = response[::-1].index(end)
#     if end_idx == 0:
#         string = response[start_idx:]
#     else:
#         string = response[start_idx:-end_idx]
#     return json.loads(string)

In [33]:
results = []
for item in models:
    for i in range(num_test_runs):
        result = dict()
        model = item['model']
        model.evaluate()

        result['score'] = model.get_completeness_score(score_format='number')
        result['report'] = model.evaluation_report
        result['model_name'] = item['name']
        result['test_no'] = i+1
        results.append(result)

100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:50<00:00,  8.46s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:43<00:00,  7.19s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:47<00:00,  7.92s/it]
100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:49<00:00,  8.26s/it]


In [None]:
# results

In [34]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,score,report,model_name,test_no
0,1.0,...,checklist_demo_1,1
1,1.0,...,checklist_demo_1,2
2,1.0,...,checklist_demo_2,1
3,1.0,...,checklist_demo_2,2


## Result & Evaluation

The evaluation will be based on 2 metrics calculated from the response:
- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters
- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters

In [35]:
completeness_score_df = results_df.drop(columns='report')
completeness_score_df = completeness_score_df.pivot(index='model_name', columns='test_no', values='score')

In [36]:
completeness_score_df

test_no,1,2
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1
checklist_demo_1,1.0,1.0
checklist_demo_2,1.0,1.0


In [51]:
# import matplotlib
# completeness_score_df.plot(kind='box')

In [37]:
consistency_df = pd.DataFrame()
for i in results_df.index:
    result = results_df.iloc[i]['report'].reset_index()
    result['test_no'] = results_df.iloc[i]['test_no']
    result['model_name'] = results_df.iloc[i]['model_name']
    consistency_df = pd.concat([consistency_df, result], axis = 0, ignore_index=True)
consistency_df = consistency_df.pivot(index=['model_name', 'ID'], columns=['test_no'], values=['is_Satisfied'])
consistency_df.columns = consistency_df.columns.droplevel(level=0)
consistency_df.columns.name = None
consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)

In [38]:
consistency_df

Unnamed: 0_level_0,Unnamed: 1_level_0,1,2,consistency
model_name,ID,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
checklist_demo_1,1.1,1.0,1.0,True
checklist_demo_1,1.2,1.0,1.0,True
checklist_demo_1,2.1,1.0,1.0,True
checklist_demo_1,5.1,1.0,1.0,True
checklist_demo_2,1.1,1.0,1.0,True
checklist_demo_2,1.2,1.0,1.0,True
checklist_demo_2,2.1,1.0,1.0,True
checklist_demo_2,5.1,1.0,1.0,True


In [20]:
# consistency_df.groupby(['model_name']).agg({'consistency': 'mean'})