# Unsupervised model evaluation

In this notebook we'll look at using Anthropic's Claude Sonnet model to evaluate responses from two smaller models, Llama-2 Chat 13B and Mixtral 8*7B.

## Dataset

We'll use the cnn_dailymail dataset. We'll only process five samples to save time.

In [1]:
from datasets import load_dataset

dataset = load_dataset('cnn_dailymail', '3.0.0')

  from .autonotebook import tqdm as notebook_tqdm
Downloading readme: 100%|██████████| 15.6k/15.6k [00:00<00:00, 11.7MB/s]
Downloading data: 100%|██████████| 257M/257M [00:14<00:00, 17.3MB/s] 
Downloading data: 100%|██████████| 257M/257M [00:12<00:00, 19.8MB/s] 
Downloading data: 100%|██████████| 259M/259M [00:14<00:00, 17.8MB/s] 
Downloading data: 100%|██████████| 34.7M/34.7M [00:01<00:00, 17.9MB/s]
Downloading data: 100%|██████████| 30.0M/30.0M [00:01<00:00, 19.7MB/s]
Generating train split: 100%|██████████| 287113/287113 [00:09<00:00, 29840.62 examples/s]
Generating validation split: 100%|██████████| 13368/13368 [00:00<00:00, 28026.46 examples/s]
Generating test split: 100%|██████████| 11490/11490 [00:00<00:00, 33098.76 examples/s]


In [2]:
dataset

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [3]:
dataset['train']['article'][0]

'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart. Details o

In [4]:
import numpy as np
num_to_eval = 5
eval_idxs = np.random.randint(low=0, high=len(dataset['train']), size=num_to_eval)
eval_idxs

array([ 72301, 183223, 249639,  44596,  76148])

In [5]:
docs_to_summarize = [dataset['train']['article'][i] for i in eval_idxs]

In [6]:
len(docs_to_summarize)

5

## Get summaries 

In [12]:
import boto3
import json
import time

In [8]:
mixtral_model_id = 'mistral.mixtral-8x7b-instruct-v0:1'
claude_model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
llama_model_id = 'meta.llama2-13b-chat-v1'

In [9]:
bedrock_runtime_client = boto3.client('bedrock-runtime')

### Llama-2 13B

In [17]:
def call_llama(prompt):
    instruction = f"<s>[INST] Write a short summary of this article: {prompt} [/INST]"
    body = {
        "prompt": instruction,
        "max_gen_len": 500,
        "temperature": 0.5,
    }

    response = bedrock_runtime_client.invoke_model(
        modelId=llama_model_id, body=json.dumps(body)
    )

    response_body = json.loads(response["body"].read())
    completion = response_body["generation"]

    return completion

In [18]:
llama_sums = []
for doc in docs_to_summarize:
    response = call_llama(doc)
    llama_sums.append(response)
    time.sleep(2)
    

In [19]:
llama_sums[0]

'  Sure! Here\'s a short summary of the article:\n\nThe article discusses the issue of prescription pill addiction in Utah, which has seen a 400% increase in overdose deaths over the past decade. Despite the state\'s reputation for being healthy and clean, the article reveals that the high rates of addiction are affecting people from all walks of life, including those in the Mormon community. The article highlights the story of Shannon, a young mother who left Salt Lake City to escape the "epidemic" of addiction in the state. The article also notes that the Church of Jesus Christ of Latter-Day Saints, which is the predominant religion in Utah, has acknowledged the issue and is working to address it. The article concludes by highlighting the bravery of those who have shared their stories of addiction and recovery.'

### Mixtral

In [11]:
def call_mixtral(prompt):
    instruction = f"<s>[INST] Write a short summary of this article: {prompt} [/INST]"
    body = {
        "prompt": instruction,
        "max_tokens": 500,
        "temperature": 0.5,
    }

    response = bedrock_runtime_client.invoke_model(
        modelId=mixtral_model_id, body=json.dumps(body)
    )

    response_body = json.loads(response["body"].read())
    outputs = response_body.get("outputs")

    completions = [output["text"] for output in outputs]

    return completions

In [13]:
mixtral_sums = []
for doc in docs_to_summarize:
    response = call_mixtral(doc)
    mixtral_sums.append(response[0])
    time.sleep(2)

In [14]:
mixtral_sums[0]

" The article discusses the prescription pill addiction epidemic in Utah, which has the eighth highest number of prescription drug overdose deaths in the United States. Despite the state's reputation for a healthy lifestyle and the influence of the Mormon religion, which discourages the use of harmful substances, the abuse of prescription pills is a growing problem. The Mormon church hierarchy has acknowledged the issue and allowed access to those dealing with addiction for a CNN report. The report features individuals, including a young mother named Shannon who moved from Salt Lake City to Los Angeles to escape the epidemic, sharing their stories of addiction and recovery."

## Evaluation

In [20]:
def call_claude_3(prompt):

    body = {
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    }
                ]
            }
        ],
        "max_tokens": 1000,
        "anthropic_version": "bedrock-2023-05-31",
        "temperature": 0.5
    }

    response = bedrock_runtime_client.invoke_model(
        modelId=claude_model_id, body=json.dumps(body)
    )

    response_body = json.loads(response["body"].read())

    return response_body['content'][0]['text']

In [23]:
prompt_eng_base = '''You will be given a summmary of a news article. Your task is to evaluate the summary in four dimensions, accuracy, coherence, factuality, and completeness. Provide a score of 1-5 in each dimension, with 5 being the best score.

<original_discussion>
[DISCUSSION]
</original_discussion>

<summary>
[SUMMARY]
</summary>

Evaluation form (scores only):

- Coherence: 
- Accuracy:
- Factuality:
- Completeness:
'''

In [22]:
def make_prompt(search, context, prompt_eng_base):
    search = search.replace("\"", "'")
    context = context.replace("\"", "'")
    prompt = prompt_eng_base.replace('[DISCUSSION]', context)
    prompt = prompt.replace('[SUMMARY]', search)
    return prompt

### Let's look at one result

In [24]:
llama_sum = llama_sums[0]
mixtral_sum = mixtral_sums[0]
doc = docs_to_summarize[0]
p_llama = make_prompt(llama_sum, doc, prompt_eng_base)
p_mixtral = make_prompt(mixtral_sum, doc, prompt_eng_base)

In [26]:
print(call_claude_3(p_llama))

Coherence: 5
Accuracy: 5
Factuality: 5
Completeness: 4

The summary is well-written, coherent, and accurately captures the key points discussed in the original text. It maintains factual integrity by correctly representing the statistics, anecdotes, and details mentioned in the article. However, in terms of completeness, the summary could have included a brief mention of the role of the Mormon church hierarchy in addressing the issue, as highlighted in the original discussion. Overall, the summary effectively conveys the central theme and major points covered in the article.


In [27]:
print(call_claude_3(p_mixtral))

Evaluation form (scores only):

- Coherence: 5
- Accuracy: 5
- Factuality: 5
- Completeness: 4

The summary is coherent, well-structured, and easy to follow. It accurately captures the key points discussed in the original text, including Utah's high ranking in prescription drug overdose deaths, the influence of the Mormon religion and its teachings on healthy living, the acknowledgment of the issue by the Mormon church hierarchy, and the personal stories of individuals like Shannon who struggled with addiction. The summary presents factual information that aligns with the details provided in the original discussion. However, in terms of completeness, the summary could have included a few additional details, such as the specific statistic that pill-related deaths in Utah have increased by 400% over the past decade, and the mention of the CNN crew spending 12 days in Utah to report on the issue.
