# Analysis of the Results

For a given set of tools, automatic test cases are generated. This notebook analyzes the quality of the questions generated along with their diversity and whether they really span the requested combinations of tools.

Note that GPT may generate `functions.<function_name>` instead of just the function name and we account for that.

## Test case generation

We demanded the LLMs to generate test cases including all of their tools. We would like 2 tests per tool (individually) and 2 tests that require exactly 2 of the tools to be answered correctly. Let us verify if this generation was correct.

For each test generation strategy, we want to check:

- Number of test cases generated per tool (individually);
- Number of test cases generated per tool (considering pairs);
- How often the model generated questions that used the tool requested;
- Manually, how many of the questions really need the tools planned.

### Set up and read files

In [None]:
import os
import json
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
test_gen_strategies = [
    'use_all',
    'only_selected',
    'selected_with_dummies',
]

In [None]:
test_files = [x for x in os.listdir() if x.endswith('.json')]

In [None]:
def read_generated_tests():
    df_tests = None
    for cur_file in test_files:
        with open(cur_file, 'r') as f:
            contents = json.loads(f.read())
            file_info = cur_file.split('_test_cases_')
            df = pd.DataFrame(contents)
            df['gen_strategy'] = file_info[0]
            df['model'] = file_info[1].replace('.json', '')
            if df_tests is None:
                df_tests = df
            else:
                df_tests = pd.concat([df_tests, df])
    return df_tests


df_tests = read_generated_tests()
df_tests

In [None]:
# compute all tools that were tested
all_tools = [
    x for x in set(df_tests.expected_tool_to_gen_test.dropna()) if ',' not in x
]
all_tools

### Number of test cases generated per tool (individually)

In [None]:
# build a dataframe with boolean flags for each tool
df_tests_per_tool = df_tests.copy()
for t in all_tools:
    df_tests_per_tool[t] = df_tests_per_tool.appropriate_tools.map(lambda z: t in str(z).replace('functions.', ''))

cols = ['model', 'gen_strategy'] + all_tools
df_tests_per_tool[cols].groupby(by=['model', 'gen_strategy']).sum()

In [None]:
df_tests_per_tool[cols].groupby(
    by=['model', 'gen_strategy']
).sum().plot.barh(figsize=(15,15), title='Number of test cases generated per tool/strategy')

In [None]:
df_agg = df_tests_per_tool[cols].groupby(by=['model', 'gen_strategy']).sum()
df_agg['coverage'] = np.sum(df_agg[all_tools].values > 0, axis=1) / len(all_tools)
df_agg = df_agg.sort_values(by='coverage', ascending=False)
df_agg[['coverage']].plot.barh()

In [None]:
# missing coverage
all_missing = []
for idx, r in df_agg[all_tools].iterrows():
    missing_tools = [x for x in all_tools if r[x] == 0]
    all_missing.append(missing_tools)
df_agg['missing_tools'] = all_missing
df_agg[df_agg.coverage < 1][['coverage', 'missing_tools']]

## Number of test cases generated per tool (considering pairs)

In [None]:
def detect_invented_tools(tool_list):
    """ Checks if a list has invented tools.
    Returns the names of the invented tools
    """
    invented_tools = []
    for x in tool_list:
        adjusted_name = x.replace('functions.', '')
        if adjusted_name not in all_tools:
            invented_tools.append(x)
    return ','.join(invented_tools)


def belongs_to_col(tool_list, col_tool_names):
    """ Checks if a given tool list belongs to a column
    Every tool in the tool list has to match a function in the column
    And they need to be different if more than one
    """
    if len(tool_list) > 2:
        return False

    if len(tool_list) == 2 and tool_list[0] == tool_list[1]:
        return False

    required_matches = len(col_tool_names)
    n_matches = 0
    for t in tool_list:
        delta = 1 if (t.replace('functions.', '') in str(col_tool_names)) else 0
        n_matches += delta

    return n_matches == required_matches

In [None]:
# build a dataframe with boolean flags for tool combinations

all_tools_and_pairs = [[x] for x in all_tools]
for i, t1 in enumerate(all_tools[0:-1]):
    for j, t2 in enumerate(all_tools[i + 1:]):
        all_tools_and_pairs.append([t1, t2])

df_pair_tests_per_tool = df_tests.copy()
df_pair_tests_per_tool['invented_tools'] = df_pair_tests_per_tool.appropriate_tools.map(lambda z: detect_invented_tools(z))
tool_cols = [','.join(t) for t in all_tools_and_pairs]
for t in all_tools_and_pairs:
    df_pair_tests_per_tool[','.join(t)] = df_pair_tests_per_tool.appropriate_tools.map(lambda z: belongs_to_col(z, t))

In [None]:
# df_agg.columns

In [None]:
cols = ['model', 'gen_strategy'] + tool_cols
df_pair_tests_per_tool[cols].groupby(by=['model', 'gen_strategy']).sum()

In [None]:
df_agg = df_pair_tests_per_tool[cols].groupby(by=['model', 'gen_strategy']).sum()
df_agg['coverage'] = np.sum(df_agg[tool_cols].values > 0, axis=1) / len(tool_cols)
df_agg = df_agg.sort_values(by='coverage', ascending=False)
df_agg[['coverage']].plot.barh()

In [None]:
# missing coverage and invented tools
from IPython.display import display, Math, Latex

all_missing = []
for idx, r in df_agg[tool_cols].iterrows():
    missing_tools = [x for x in tool_cols if r[x] == 0]
    all_missing.append(missing_tools)
df_agg['missing_tools'] = all_missing

latex_tbl = df_agg[df_agg.coverage <= 1][['coverage']].to_latex(float_format="{:.2f}".format,)
df_agg[df_agg.coverage <= 1][['coverage', 'missing_tools']]

In [None]:
# export for paper
# print(latex_tbl.replace('_', '\\_').replace(' - Anthropic', '').replace(' - OpenAI', '').replace(' - Bedrock', ''))

In [None]:
df_pair_tests_per_tool[df_pair_tests_per_tool['invented_tools'] != ''].groupby(['model', 'gen_strategy']).count()['question']

In [None]:
# [detect_invented_tools(x) for x in df_tests.appropriate_tools[20:40]]

In [None]:
# set([detect_invented_tools(x) for x in df_tests.appropriate_tools])

# Evaluation

Make sure to test when

- No tools were planned
- Tool names were made up
- Correct tools planned
- Only one of the correct tools was planned