In [None]:
import re
import time

import evaluate
import kscope
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://kaleidoscope-sdk.readthedocs.io/en/latest/). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/kaleidoscope-sdk) and underlying code [here](https://github.com/VectorInstitute/kaleidoscope).

First we connect to the service through which we'll interact with the LLMs and see which models are avaiable to us

In [None]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=6001)

Show all supported models

In [None]:
client.models

Show all model instances that are currently active

In [None]:
client.model_instances

To start, we obtain a handle to a model. In this example, let's use the OPT-175B model.

In [None]:
model = client.load_model("OPT-175B")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

We need to configure the model to generate in the way we want it to. So we set a number of important parameters. For a discussion of the configuration parameters see: `src/reference_implementations/prompting_vector_llms/CONFIG_README.md`

In [None]:
long_generation_config = {"max_tokens": 128, "top_k": 4, "top_p": 1.0, "rep_penalty": 1.2, "temperature": 0.5}

Let's try a basic prompt for factual information.

__Note__ that if you run the cell multiple times, you'll get different responses due to sampling.

In [None]:
generation = model.generate("What is the capital of Canada?", long_generation_config)
# Extract the text from the returned generation
generation.generation["text"]

In [None]:
def post_process_generations(generation_text: str) -> str:
    # This simply attempts to extract the first three "sentences" within a generated string
    split_text = re.findall(r".*?[.!\?]", generation_text)[0:3]
    split_text = [text.strip() for text in split_text]
    return " ".join(split_text)

### Basic Prompts

Now let's create a basic prompt template that we can reuse for multiple text inputs. This will be an instruction prompt with an unconstrained answer space as we're going to try to get OPT to summarize texts. We'll try several different templates and examine performance for each. Note that this section simply considers "manual" or "human-level" inspection to determine the quality of the summary. At the bottom of this notebook, we consider measuring the quality of two prompts on a sample of the CNN Daily Mail task using a ROUGE-1 Score.

In [None]:
prompt_template_summary_1 = "Summarize the preceding text."
prompt_template_summary_2 = "Short Summary:"
prompt_template_summary_3 = "TLDR;"

In [None]:
with open("resources/news_summary_datasets/examples_news.txt", "r") as file:
    news_stories = file.readlines()

In [None]:
prompts_with_template_1 = [f"{news_story} {prompt_template_summary_1}" for news_story in news_stories]
prompts_with_template_2 = [f"{news_story} {prompt_template_summary_2}" for news_story in news_stories]
prompts_with_template_3 = [f"{news_story} {prompt_template_summary_3}" for news_story in news_stories]

In these examples, we use the prompt structures

* (text) Summarize the preceding text.
* (text) Short Summary:
* (text) TLDR;

In [None]:
generation_1 = model.generate(prompts_with_template_1, long_generation_config)
print(f"Prompt: {prompt_template_summary_1}")
for summary, original_story in zip(generation_1.generation["text"], news_stories):
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

In [None]:
generation_2 = model.generate(prompts_with_template_2, long_generation_config)
for summary, original_story in zip(generation_2.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_2}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

In [None]:
generation_3 = model.generate(prompts_with_template_3, long_generation_config)
for summary, original_story in zip(generation_3.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_3}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Story 2 is about the possibility of severe flooding in California and an evacuation order being issued. Let's see what we get that from the three summaries and maybe which worked better.

In [None]:
print(f"{prompt_template_summary_1}|| {post_process_generations(generation_1.generation['text'][1])}")
print("====================================================================================")
print(f"{prompt_template_summary_2}|| {post_process_generations(generation_2.generation['text'][1])}")
print("====================================================================================")
print(f"{prompt_template_summary_3}|| {post_process_generations(generation_3.generation['text'][1])}")

### Can we improve the results by providing additional context to our instructions?

In this example, we prompt the model to provide a summary and try to do so in a compact way. We still post-process the text to grab the first three sentences, but hopefully the model tries to pack more information into those first sentences.

In [None]:
prompt_template_summary_4 = "Summarize the text in as few words as possible:"
prompts_with_template_4 = [f"{news_story} {prompt_template_summary_4}" for news_story in news_stories]
generation_4 = model.generate(prompts_with_template_4, long_generation_config)
for summary, original_story in zip(generation_4.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_4}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

OPT, and generative models in general, have been reported to perform better when not prompted with "declarative" instructions or direct interogatives (See the [OPT Paper](https://arxiv.org/abs/2205.01068)). As such, let's ask for the summary as a question!

In [None]:
prompt_template_summary_5 = "How would you briefly summarize the text?"
prompts_with_template_5 = [f"{news_story} {prompt_template_summary_5}" for news_story in news_stories]
generation_5 = model.generate(prompts_with_template_5, long_generation_config)
for summary, original_story in zip(generation_5.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_5}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Rephrasing the question will likely induce different summarization and possibly improve the results

In [None]:
prompt_template_summary_6 = "Briefly, what is this story about?"
prompts_with_template_6 = [f"{news_story} {prompt_template_summary_6}" for news_story in news_stories]
generation_6 = model.generate(prompts_with_template_6, long_generation_config)
for summary, original_story in zip(generation_6.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_6}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

As a final example, rather than asking a question, we putting the task in a context that might be more natural for a generative model. That is, we ask it to "sum up" the article with a natural phrase prefix to be completed in a "conversational" way.

In [None]:
prompt_template_summary_7 = "In short,"
prompts_with_template_7 = [f"{news_story} {prompt_template_summary_7}" for news_story in news_stories]
generation_7 = model.generate(prompts_with_template_7, long_generation_config)
for summary, original_story in zip(generation_7.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_7}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

### Measuring Performance on CNN Daily Mail

In [None]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

We load the data from the CNN Daily Mail Test set, the ROUGE metric scorer from Hugging Face, and a Tokenizer from OPT. The tokenizer is used to truncate the text such that it fits nicely into the OPT model context. We truncate the text to 1023, so that it is of length 1024 when the start-of-sentence token (`<s>`) is added.

__NOTE__: All OPT models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

In [None]:
opt_tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
dataloader = DataLoader(dataset["test"], shuffle=False, batch_size=10)
rouge = evaluate.load("rouge")
prompt_template_summary_1 = "How would you briefly summarize the text?"
prompt_template_summary_2 = "In short,"

In [None]:
def truncate_article_text(article_text: str, tokenizer: AutoTokenizer, max_sequence_length: int = 1023) -> str:
    tokenized_article = tokenizer.encode(article_text, truncation=True, max_length=max_sequence_length)
    return tokenizer.decode(tokenized_article, skip_special_tokens=True)

We'll try two different prompts from the examples above and consider how well they each do in terms of rouge score against reference summaries on the CNN Daily Mail task, which is a common summarization benchmark. You can see a discussion of this dataset here: [CNN Daily Mail](https://huggingface.co/datasets/cnn_dailymail). 

__Note__: On a big model, like OPT-175, this process will likely take a bit of time, given the length of the articles and the fact that we are asking for 100 summaries.

Running the First prompt structure

(text) How would you briefly summarize the text?

In [None]:
# Running the first prompt type
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text} {prompt_template_summary_1}" for article_text in truncated_articles]
    summaries = model.generate(prompts, long_generation_config).generation["text"]
    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE-1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Running the second prompt structure

(text) In short,

In [None]:
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text} {prompt_template_summary_2}" for article_text in truncated_articles]
    summaries = model.generate(prompts, long_generation_config).generation["text"]
    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE-1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

The second prompt, as measured by ROUGE-1 scores, appears to produce summaries of higher quality than the first prompt. This is likely due to the way it is structured. It fits into the "generative" training setting a bit better than asking a point blank question.