# NOTE: 
**This demo notebook was used to generate the Test Suites and Run results recorded in the directory. Please feel free to use this as a reference when creating your own Test Suites and Test Runs.**

**Running this notebook directly will result in errors. If you want to run the notebook as is, either:**
1. Change the name of the Test Suite 
2. Delete the results from the directory
3. Set the `BENCH_FILE_DIR` environment variable (which defaults to `./bench`) to a different directory. To do this, uncomment the cell below:

In [1]:
# #uncomment me to change the default `BENCH_FILE_DIR`
# import os
# os.environ['BENCH_FILE_DIR'] = 'FILL ME IN'

# ArthurBench: Evaluating answers to coding questions with BERTScore distance from golden answers

In [2]:
import pandas as pd
from arthur_bench.run.testsuite import TestSuite

In [3]:
import arthur_bench

arthur_bench.__file__

'/Users/reese/Projects/arthur-bench/arthur_bench/__init__.py'

In [4]:
stack_df = pd.read_csv('stackoverflow_qa/stack_overflow_golden.csv')
stack_prompt0_df = pd.read_csv('stackoverflow_qa/gpt35_base.csv')
stack_prompt1_df = pd.read_csv('stackoverflow_qa/gpt35_engineered.csv')

In [5]:
stack_df['reference_output'] = stack_df['reference_output'].apply(lambda x: " ".join(x.split(" ")[:100]))

# Make a test suite

In [6]:
my_test_suite = TestSuite(
    'stack_dist_to_golden', 
    "bertscore",
    reference_data=stack_df, 
    input_column='input', 
    reference_column='reference_output')

# Run the test

In [7]:
from time import time

In [10]:
start_time = time()

run0 = my_test_suite.run(
    'prompt39842',
    candidate_data=stack_prompt0_df, 
    candidate_column='candidate_output',
    batch_size=1
)

print(f"execution time: {time() - start_time:.3f} seconds")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of the model checkpoint at microsoft/deberta-v3-base were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.dense.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical 

execution time: 9.156 seconds





# Run tests on other summaries (one rephrased on golden, one corrupted of candidate)

In [None]:
run1 = my_test_suite.run(
    'prompt1',
    candidate_data=stack_prompt1_df, 
    candidate_column='candidate_output',
)

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10,5))
ax.set_title('BERTScore from candidate answers to golden answers')
ax.hist([s.score for s in run0.test_case_outputs], label='basic prompt', alpha=0.5)
ax.hist([s.score for s in run1.test_case_outputs], label='improved prompt', alpha=0.5)
ax.legend()
plt.tight_layout()
plt.show()