# NOTE: 
**This demo notebook was used to generate the Test Suites and Run results recorded in the directory. Please feel free to use this as a reference when creating your own Test Suites and Test Runs.**

**Running this notebook directly will result in errors. If you want to run the notebook as is, either:**
1. Change the name of the Test Suite 
2. Delete the results from the directory
3. Set the `BENCH_FILE_DIR` environment variable (which defaults to `./bench`) to a different directory. To do this, uncomment the cell below:

In [1]:
#uncomment me to change the default `BENCH_FILE_DIR`
# import os
# os.environ['BENCH_FILE_DIR'] = 'FILL ME IN'

# Bench: Evaluating QA correctness 

This notebook evaluates the responses of three open source models on a synthetic question answering dataset. Each question asks about a fictional clothing product.

In [1]:
import pandas as pd

from arthur_bench.run.testsuite import TestSuite

In [62]:
clothing_df = pd.read_csv('clothing_qa/clothing_inventory_qa.csv', index_col=0)

In [63]:
clothing_df

Unnamed: 0,source_text,question,item,human_answer,roberta-base-squad2,tinyroberta-squad2,minilm-squad2
0,Introducing the new Plumanoix athletic shirt -...,What type of fabric is the Plumanoix athletic ...,shirt,moisture-wicking breathable fabric,moisture-wicking and breathable,moisture-wicking and breathable materials,moisture-wicking and breathable
1,Introducing the latest addition to the Clothim...,What is the fabric of the Athletic Shirt 2.0 f...,shirt,high quality moisture-wicking stretchy fabric,"high-quality, moisture-wicking","high-quality, moisture-wicking fabric",moisture-wicking
2,Introducing the latest addition to the Shimmer...,What is the fabric of the Shimmer Threads Athl...,shirt,breathable quick drying moisture wicking machi...,moisture-wicking fabric,moisture-wicking fabric,moisture-wicking
3,Introducing the latest addition to the ZorbyX ...,What is the material used in making the ZorbyX...,shirt,Breathable material,"high-quality, breathable materials","high-quality, breathable materials",breathable materials that wick away sweat
4,Introducing the latest addition to Zephyr Thre...,What kind of materials is the Zephyr Threads S...,shirt,soft cotton,sustainable,sustainable,high-quality cotton
5,Introducing Plumanoix's newest addition to the...,What feature do the Plumanoix Performance Pant...,pants,moisture-wicking technology,moisture-wicking technology,moisture-wicking technology,moisture-wicking technology
6,Introducing the latest addition to the Clothim...,What type of closure do the Performance Pants ...,pants,Elastic waistband and drawstring closure,drawstring,drawstring,drawstring
7,Introducing a new luxury item in the pants cat...,What are the available colors of the Shimmerin...,pants,champagne and rose gold,champagne and rose gold,champagne and rose gold,champagne and rose gold
8,Introducing the latest addition to the ZorbyX ...,What materials are these ZorbyX fashionista pa...,pants,does not say - only mentions the finest qualit...,finest quality materials,finest quality materials,premium fabrics
9,Introducing the latest addition to the Zephyr ...,What is the fabric blend used to make the Zeph...,pants,blend of cotton and elastane,cotton and elastane,maximum comfort and durability,cotton and elastane


# Make a test suite

In [4]:
my_test_suite = TestSuite(
    'clothing_qa', 
    "qa_correctness",
    reference_data=clothing_df, 
    input_column='question', 
    reference_column='human_answer')

# Run the tests

In [44]:
import openai
import cohere
# co = cohere.Client('<YOUR COHERE API KEY HERE>')
from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT
# anthropic = Anthropic('<YOUR ANTHROPIC API KEY HERE>')


def chatgpt(input_text):
    return openai.ChatCompletion.create(
        model="gpt-3.5-turbo", messages=[
            {"role" : "system", "content" : "You are a helpful assistant."},
            {"role" : "user", "content" : input_text}
        ]
    )['choices'][0]['message']['content']

def cohere_fn(input_text):
    return co.generate(prompt=input_text)[0].text


def anthropic_claude(input_text):
    return anthropic.completions.create(
        model="claude-2",
        max_tokens_to_sample=300,
        prompt=f"{HUMAN_PROMPT} {input_text} {AI_PROMPT}",
    ).completion

In [42]:
QA_prompt = """
Provide an answer to a question based on the following context: <context>
Question: <question>
Answer:"""

In [66]:
def get_responses(model_fn):
    responses = []
    for i, test_case in enumerate(my_test_suite.suite.test_cases):
        filled_prompt = QA_prompt.replace(
            "<context>", clothing_df.source_text[i]).replace(
            "<question>", test_case.input)
        responses.append(model_fn(filled_prompt))
    return responses

In [74]:
chatgpt_responses = get_responses(chatgpt)

In [75]:
cohere_responses = get_responses(cohere_fn)

In [76]:
claude_responses = get_responses(anthropic_claude)

In [70]:
clothing_df['chatgpt'] = chatgpt_responses
clothing_df['cohere'] = cohere_responses
clothing_df['claude'] = claude_responses
clothing_df

Unnamed: 0,source_text,question,item,human_answer,roberta-base-squad2,tinyroberta-squad2,minilm-squad2,chatgpt,cohere,claude
0,Introducing the new Plumanoix athletic shirt -...,What type of fabric is the Plumanoix athletic ...,shirt,moisture-wicking breathable fabric,moisture-wicking and breathable,moisture-wicking and breathable materials,moisture-wicking and breathable,The Plumanoix athletic shirt is made from a bl...,The Plumanoix athletic shirt is made from a b...,"Based on the context provided, the Plumanoix ..."
1,Introducing the latest addition to the Clothim...,What is the fabric of the Athletic Shirt 2.0 f...,shirt,high quality moisture-wicking stretchy fabric,"high-quality, moisture-wicking","high-quality, moisture-wicking fabric",moisture-wicking,The fabric of the Athletic Shirt 2.0 from Clot...,The Athletic Shirt 2.0 from Clothimus Univers...,"Based on the context provided, the fabric of ..."
2,Introducing the latest addition to the Shimmer...,What is the fabric of the Shimmer Threads Athl...,shirt,breathable quick drying moisture wicking machi...,moisture-wicking fabric,moisture-wicking fabric,moisture-wicking,The Shimmer Threads Athletic Shirt is made fro...,The fabric of the Shimmer Threads Athletic Sh...,The Shimmer Threads Athletic Shirt is made fr...
3,Introducing the latest addition to the ZorbyX ...,What is the material used in making the ZorbyX...,shirt,Breathable material,"high-quality, breathable materials","high-quality, breathable materials",breathable materials that wick away sweat,The ZorbyX Fashionista Athletic Shirt is made ...,The ZorbyX Fashionista Athletic Shirt is made...,"Based on the provided context, the material u..."
4,Introducing the latest addition to Zephyr Thre...,What kind of materials is the Zephyr Threads S...,shirt,soft cotton,sustainable,sustainable,high-quality cotton,The Zephyr Threads Shirt is made from high-qua...,The Zephyr Threads Shirt is made from high-qu...,"Based on the context provided, the Zephyr Thr..."
5,Introducing Plumanoix's newest addition to the...,What feature do the Plumanoix Performance Pant...,pants,moisture-wicking technology,moisture-wicking technology,moisture-wicking technology,moisture-wicking technology,The Plumanoix Performance Pants are equipped w...,The Plumanoix Performance Pants are equipped ...,The Plumanoix Performance Pants have moisture...
6,Introducing the latest addition to the Clothim...,What type of closure do the Performance Pants ...,pants,Elastic waistband and drawstring closure,drawstring,drawstring,drawstring,The Performance Pants feature an elastic waist...,The elastic waistband and drawstring closure.,The Performance Pants feature an elastic wais...
7,Introducing a new luxury item in the pants cat...,What are the available colors of the Shimmerin...,pants,champagne and rose gold,champagne and rose gold,champagne and rose gold,champagne and rose gold,The available colors of the Shimmering Satin P...,The Shimmering Satin Pants from Shimmer Threa...,"Based on the given context, the Shimmering Sa..."
8,Introducing the latest addition to the ZorbyX ...,What materials are these ZorbyX fashionista pa...,pants,does not say - only mentions the finest qualit...,finest quality materials,finest quality materials,premium fabrics,These ZorbyX fashionista pants are made from a...,These ZorbyX fashionista pants are made with ...,Unfortunately the context does not specify th...
9,Introducing the latest addition to the Zephyr ...,What is the fabric blend used to make the Zeph...,pants,blend of cotton and elastane,cotton and elastane,maximum comfort and durability,cotton and elastane,The Zephyr Threads Casual Pants are made with ...,The Zephyr Threads Casual Pants are crafted w...,"Based on the context provided, the Zephyr Thr..."


In [71]:
chatgpt_run = my_test_suite.run(
    'chatgpt-run',
    candidate_data=clothing_df, 
    candidate_column='chatgpt',
    context_column='source_text'
)

100%|███████████████████████████████████████████| 40/40 [00:17<00:00,  2.32it/s]


In [72]:
cohere_run = my_test_suite.run(
    'cohere-run',
    candidate_data=clothing_df, 
    candidate_column='cohere',
    context_column='source_text'
)

  0%|                                                    | 0/40 [00:00<?, ?it/s]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
100%|███████████████████████████████████████████| 40/40 [00:47<00:00,  1.18s/it]


In [73]:
claude_run = my_test_suite.run(
    'claude-run',
    candidate_data=clothing_df, 
    candidate_column='claude',
    context_column='source_text'
)

100%|███████████████████████████████████████████| 40/40 [00:15<00:00,  2.54it/s]


In [None]:
roberta_run = my_test_suite.run(
    'roberta-run',
    candidate_data=clothing_df, 
    candidate_column='roberta-base-squad2',
    context_column='source_text'
)

In [7]:
tinyroberta_run = my_test_suite.run(
    'tinyroberta-run',
    candidate_data=clothing_df, 
    candidate_column='tinyroberta-squad2',
    context_column='source_text'
)

In [None]:
minilm_run = my_test_suite.run(
    'minilm-run',
    candidate_data=clothing_df, 
    candidate_column='minilm-squad2',
    context_column='source_text'
)