# Summarization Scoring using AWS Bedrock

In this example notebook, we will be comparing how different models in AWS Bedrock perform at summarization tasks using **Arthur Bench**. 

The overall summarization comparison is setup as a **Bench TestSuite**, and each model is compared head-to-head against every other model
in a **Bench TestRun**. Bench provides the ability to view these comparisons in the provided User Interface, as well as access statistics
from the Test and TestRuns themselves for further analysis. For example, in the notebook below we provide a means for using the ELO Scoring
Algorithm to determine which model performs best at this summarization task. 
                                                                                                                 
The task is to summarize 49 News Articles, and comparison is done using ChatGPT 3.5 Turbo (see summary_quality.py)

## 1. Setup AWS Bedrock Client

In [None]:
"""
Authentication is handled using the AWS_PROFILE environment variable. Check the AWS Boto3 documentation and the provided
utility library for connecting to Bedrock for additional information
"""
from bedrock_client import client

bedrock_runtime = client.get_bedrock_client(region="us-east-1")

## 2. Load the data and prepare it for inferencing

In [None]:
import csv

articles = []

with open("data/news_summary/example_summaries.csv", "r") as f:
    dr = csv.DictReader(f)
    for row in dr: 
        articles.append(row["input_text"])

len(articles)

## 3. Generate inferences (summaries) for the articles

In [None]:
import json

prompt = """\
Summarize the following news document down to its most important points in less than 250 words.
{}
"""

def generate_summary_from_llama(model_id, article): 
    body = json.dumps({"prompt": prompt.format(article)})
    modelId = model_id
    accept = "application/json"
    contentType = "application/json"
    
    response = bedrock_runtime.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("generation")


def generate_summary_from_titan(model_id, article):
    body = json.dumps({"inputText": prompt.format(article)})
    modelId = model_id
    accept = "application/json"
    contentType = "application/json"
    
    response = bedrock_runtime.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())
    return response_body.get("results")[0].get("outputText")

In [None]:
models = (
    "meta.llama2-13b-chat-v1",
    "meta.llama2-70b-chat-v1",
    "amazon.titan-text-lite-v1"
)

summaries = {}
summaries["input"] = articles

for model in models:
    summaries.setdefault(model, [])
    for i, article in enumerate(articles): 
        try:
            if "meta" in model:
                summary = generate_summary_from_llama(model, article)
            elif "amazon" in model:
                summary = generate_summary_from_titan(model, article)
            else:
                print(f"Unable to determine what {model} is")
                continue
                
            summaries[model].append(summary)
        except:
            print(f"Couldn't generate summary for article {i} using model {model}")
            summaries[model].append("")
            continue


### 3.5 Save (and load) inferences to a pickle file

In [None]:
import pickle

with open("summaries.pkl", "wb") as f:
    pickle.dump(summaries, f)

In [None]:
import pickle
with open('summaries.pkl', 'rb') as f:
    summaries = pickle.load(f)

## 4. Setup a Bench TestSuite and run Bench TestRuns

At this point, make sure you've set the `BENCH_FILE_DIR` environment variable. 

See the [Quickstart Guide](https://bench.readthedocs.io/en/latest/quickstart.html#view-results-in-local-ui) for additional information.

In [None]:
from arthur_bench.run.testsuite import TestSuite
import os 

os.environ["OPENAI_API_KEY"] = "REPLACE_ME"


suite = TestSuite(
    "News Summarization Using Bedrock", 
    'summary_quality',
    input_text_list=summaries["input"],
    reference_output_list=summaries["amazon.titan-text-lite-v1"]  # Use Titan's summarization as our "reference"
)

In [None]:
# Compare titan's summarization against llama 13b
run_llama_13b_compare = suite.run(
    run_name="titan_vs_llama_13b",
    candidate_output_list=summaries["meta.llama2-13b-chat-v1"]
)

# Compare titan's summarization against llama 70b
run_llama_13b_compare = suite.run(
    run_name="titan_vs_llama_70b",
    candidate_output_list=summaries["meta.llama2-70b-chat-v1"]
)

# 5. Open the Bench UI and view your results

From your command line, enter the `bench` command and load the UI from the URL output in your terminal window