## Introduction

This Jupyter notebook is a tool to evaluate the consistency of ML Test Evaluation performed by ChatGPT based on the research paper (Alexander, R., Katz, L., Moore, C., & Schwartz, Z. (2023)). \
It serves the purpose of evaluating the application performance before and after changes (e.g. checklist modification, model setting changes).

### Libraries

In [1]:
import sys
sys.path.append("../test_creation/")

In [42]:
from analyze import TestEvaluator

import pandas as pd

## Inputs

Please specify the `test_functions_directory` below to load the ML test code base, the parameters, e.g. checklist, and the corresponding models to for evaluation

In [25]:
models = []

In [3]:
test_functions_directory = '../../../lightfm/tests'

In [5]:
# temperatures = [0.1]
# models = ['gpt-3.5-turbo']

In [26]:
checklist_directory = '../../checklist/checklist_demo.yaml'

In [27]:
name = 'checklist_demo_1'
evaluator = TestEvaluator(test_functions_directory)
evaluator.load_checklist(checklist_directory)
models.append({'name': name, 'model': evaluator})

In [28]:
name = 'checklist_demo_2'
evaluator = TestEvaluator(test_functions_directory)
evaluator.load_checklist(checklist_directory)
models.append({'name': name, 'model': evaluator})

In [29]:
models

In [30]:
pd.DataFrame(models)

## API Running

Incorporate the data, prompts and parameters, feed into OpenAI API for test runs and fetch responses.

In [79]:
# # Clone the model to make sure that all the test runs are independent.
# import copy
# model_temp = copy.copy(models[0]['model'])

In [69]:
class ConsistencyEvaluator:
    def __init__(self):
        self.evaluation_reports = None

    def evaluate(self, models, num_test_runs=2, verbose=False):
        """
        Input the initialized TestEvaluator models, test run `num_test_runs` times to obtain the result
        models = [{'name': 'model_no1', 'model': {{model object}}}, ...]
        """
        results = []
        for item in models:
            if verbose:
                print(f'Model: {item['name']}')
                
            for test_no in range(num_test_runs):
                if verbose:
                    print(f'Test Run No.: {test_no+1}')
                
                result = dict()
                model = item['model']
                model.evaluate()
        
                result['score'] = model.get_completeness_score(score_format='number')
                result['report'] = model.evaluation_report
                result['model_name'] = item['name']
                result['test_no'] = test_no+1
                results.append(result)
        self.evaluation_reports = pd.DataFrame(results)
        return

    def get_completeness_score_dist(self):
        """
        Obtain the distribution of the Test Completeness scores
        """
        completeness_score_df = self.evaluation_reports.drop(columns='report')
        completeness_score_df = completeness_score_df.pivot(index='model_name', columns='test_no', values='score')
        return completeness_score_df

    def get_consistency_dist(self):
        """
        Obtain the distribution of the consistency per checklist item
        """
        consistency_df = pd.DataFrame()
        for idx in self.evaluation_reports.index:
            result = self.evaluation_reports.iloc[idx]['report'].reset_index()
            result['test_no'] = self.evaluation_reports.iloc[idx]['test_no']
            result['model_name'] = self.evaluation_reports.iloc[idx]['model_name']
            consistency_df = pd.concat([consistency_df, result], axis = 0, ignore_index=True)
        consistency_df = consistency_df.pivot(index=['model_name', 'ID'], columns=['test_no'], values=['is_Satisfied'])
        consistency_df.columns = consistency_df.columns.droplevel(level=0)
        consistency_df['consistency'] = consistency_df.eq(consistency_df.iloc[:, 0], axis=0).all(1)
        return consistency_df

In [66]:
consistency_evaluator = ConsistencyEvaluator()
consistency_evaluator.evaluate(models, num_test_runs=5, verbose=True)

## Result & Evaluation

The evaluation will be based on 2 metrics calculated from the response:
- Completeness Score distribution: The distribution of the `num_test_runs` completeness scores per each set of parameters
- Consistency Score: Out of all `checklist` items, the proportion of results remain consistent among `num_test_runs` runs per each set of parameters

In [67]:
consistency_evaluator.get_completeness_score_dist()

In [51]:
# import matplotlib
# completeness_score_df.plot(kind='box')

In [68]:
consistency_evaluator.get_consistency_dist()

In [20]:
# consistency_df.groupby(['model_name']).agg({'consistency': 'mean'})