# NOTE: 
**This demo notebook was used to generate the Test Suites and Run results recorded in the directory. Please feel free to use this as a reference when creating your own Test Suites and Test Runs.**

**Running this notebook directly will result in errors. If you want to run the notebook as is, either:**
1. Change the name of the Test Suite 
2. Delete the results from the directory
3. Set the `BENCH_FILE_DIR` environment variable (which defaults to `./bench`) to a different directory. To do this, uncomment the cell below:

In [1]:
# #uncomment me to change the default `BENCH_FILE_DIR`
# import os
# os.environ['BENCH_FILE_DIR'] = 'FILL ME IN'

# ArthurBench: Evaluating Summaries by LLMs Choosing Better Responses

In this notebook, we evaluate the quality of three generated summaries for news articles, in reference to summaries generated by gpt-3.5-turbo. The three candidate summaries are:  
- paraphrases of the GPT generated summaries
- summaries generated by an open source model, trained to summarize books
- intentionally corrupted summaries

We use bench to score whether each candidate summary is better, worse, or the same quality as the reference ChatGPT summary, and to highlight common failure modes of the open source model in transferring summarization domains.

In [3]:
import pandas as pd

from arthur_bench.run.testsuite import TestSuite

TabError: inconsistent use of tabs and spaces in indentation (summary_quality.py, line 86)

In [None]:
pd.set_option('display.max_colwidth', None)

We have prepared a dataset of input news articles, reference summaries, and candidate summaries.

In [None]:
summary_data = pd.read_csv('news_summary/example_summaries.csv', index_col=0)

In [None]:
summary_data.head(5)

# Make a test suite

A bench test suite consists of the inputs to the task and the target outputs. Here, we instantiate a new test suite named `compare_gpt3mapreduce`, from our data frame and indicate that the inputs are in data frame column `input_text` and the reference outputs are in column `gpt3mapreduce`. 

This test suite uses the scoring method `summary-qual` to evaluate future runs.

In [None]:
my_test_suite = TestSuite(
    'news_summary', 
    "summary_quality",
    reference_data=summary_data, 
    input_column='input_text', 
    reference_column='gpt3mapreduce')

# Run the test

Below, we create three test runs, one for each of the candidate summaries. 

In [None]:
my_test_run = my_test_suite.run(
    "longt5books", 
    candidate_data=summary_data, 
    candidate_column='longt5books'
)

In [None]:
print('Evaluator prefered gpt3mapreduce over longt5books:', int(len(my_test_run.test_case_outputs) - sum([case.score for case in my_test_run.test_case_outputs])), 'out of', len(my_test_run.test_case_outputs))


# Compare against other summary A/B tests

In [None]:
gpt3_vs_rephrase = my_test_suite.run(
    "rephrase", 
    candidate_data=summary_data, 
    candidate_column='chatgpt_rephrase_gpt3'
)

In [None]:
print('Evaluator prefered gpt3mapreduce over rephrases of gpt3mapreduce:', int(len(gpt3_vs_rephrase.test_case_outputs) - sum([case.score for case in gpt3_vs_rephrase.test_case_outputs])), 'out of', len(gpt3_vs_rephrase.test_case_outputs))


In [None]:
gpt3_vs_corrupt = my_test_suite.run(
    "corrupt", 
    candidate_data=summary_data, 
    candidate_column='chatgpt_corrupt_longt5',
)

In [None]:
print('Evaluator prefered gpt3mapreduce over corrupted longt5 summaries:', int(len(gpt3_vs_corrupt.test_case_outputs) - sum([case.score for case in gpt3_vs_corrupt.test_case_outputs])), 'out of', len(gpt3_vs_corrupt.test_case_outputs))


# Explore test results

# Observations about the test run results where reference was chosen over candidate

## 1. Typos and Hallucinations

### Giuliani (1) 
"Noelle Frank Dunphy was hired as the head of business development at **Giulinius's** new office...After one week into her job, **Giviani** flew to New York with Dunphy in order to get permission to stay in an apartment with him..."

### NBA playoffs (9) 
"The Boston 76ers and the Philadelphia 78ers face off in a conference finals game at the Garden on Sunday, May 14..."



## 2. Not translating to this new context 

### Speaking book (2, 3, 8, 11, 24, 31, 38) 

"In this chapter, we get a detailed look at some of the best potential players to enter the June draft and how they will fare in the event they are selected..."

"In this chapter, we get a brief summary of what's going on with the Bieber family. We learn that Justin and Hailey are engaged and planning to have kids soon. They got married in July of last year, and they have a little wedding in October of next year"

"The title of this chapter is "A Florida man living beneath the ocean won't revive even after breaking a record" 

## 3. Missing the main point

### Nascar race (13): 

#### Title: NASCAR results: William Byron wins Throwback race at Darlington ahead of Kevin Harvick, Chase Elliott\n\nGoodyear 400 final results...

"The Goodyear 400 is a big event in the spring. It's one of the most famous races in the world, and it gets even bigger on Sunday afternoon as the field goes for a run through the state's largest race track. There's lots of good racing going on, including some classic car shows like the Dodge Grand Car Classic and the Darlington Speedway. Some newcomers will be making their first appearance, like Chase Elliott and Josh Berry. They'll all be hoping to make it to the top of the heap."

# Observations about the test run results where candidate was chosen over reference

## 1. Presence of the summary prompt itself fooled the evaluator: 

Unintentional prompt hacking (5): "This is a very brief summary of the main point of the text. It captures all the important details in the text, and doesnt concentrate too much on tiny details. The deal with Activision has been approved by the European Union, but Britain\'s competition authority has already veto it..."