In [1]:
import re
import time

import evaluate
import kscope
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from utils import split_prompts_into_batches

### Conecting to the Service
First we connect to the Kaleidoscope service through which we'll interact with the LLMs and see which models are avaiable to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=6001)

Show all model instances that are currently active

In [4]:
client.model_instances

[{'id': 'f6e55e38-2c01-4def-aaf9-d4531f9a2598',
  'name': 'OPT-175B',
  'state': 'ACTIVE'},
 {'id': '4213e687-fdaf-4046-a03c-ca59b484ad14',
  'name': 'OPT-6.7B',
  'state': 'ACTIVE'}]

To start, we obtain a handle to a model. In this example, let's use the OPT-175B model.

In [5]:
model = client.load_model("OPT-175B")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

In [6]:
long_generation_config = {"max_tokens": 128, "top_k": 4, "top_p": 1.0, "rep_penalty": 1.2, "temperature": 0.5}

In [16]:
def post_process_generations(generation_text: str) -> str:
    # This simply attempts to extract the first three "sentences" within a generated string
    split_text = re.findall(r".*?[.!\?]", generation_text)[0:3]
    split_text = [text.strip() for text in split_text]
    return "\n".join(split_text)

### Basic Prompts

Now let's create a basic prompt template that we can reuse for multiple text inputs. This will be an instruction prompt with an unconstrained answer space as we're going to try to get OPT to summarize texts. We'll try several different templates and examine performance for each. Note that this section simply considers "manual" or "human-level" inspection to determine the quality of the summary. At the bottom of this notebook, we consider measuring the quality of two prompts on a sample of the CNN Daily Mail task using a ROUGE 1 Score.

In [17]:
prompt_template_summary_1 = "Summarize the preceding text."
prompt_template_summary_2 = "TLDR;"

In [18]:
with open("news_summary_datasets/examples_news.txt", "r") as file:
    news_stories = [line.strip() for line in file.readlines()]

In [19]:
prompts_with_template_1 = [f"{news_story}\n{prompt_template_summary_1}" for news_story in news_stories]
prompts_with_template_2 = [f"{news_story}\n{prompt_template_summary_2}" for news_story in news_stories]

In these examples, we use the prompt structures

* (text) Summarize the preceding text.
* (text) TLDR;

In [21]:
print(f"{prompts_with_template_1[0]}\n")
print(prompts_with_template_2[0])

Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN. In many of those cases, Russia has then flown the equipment to Iran to dismantle and analyze, likely so the Iranian military can attempt to make their own version of the weapons, sources said. Russia believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine, the sources said. US officials don’t believe that the issue is widesprea

In [12]:
generation_1 = model.generate(prompts_with_template_1, long_generation_config)
print(f"Prompt: {prompt_template_summary_1}")
for summary, original_story in zip(generation_1.generation["text"], news_stories):
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the preceding text.
Original Length: 1262, Summary Length: 613
The Pentagon has confirmed that Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN.

Original Length: 1181, Summary Length: 552

Original Length: 1260, Summary Length: 446
The request is the latest in a string of efforts by states to restrict the rights of transgender youth, including in sports, to use the bathroom of their choice, and to receive gender-affirming medical care. The Supre

In [14]:
generation_2 = model.generate(prompts_with_template_2, long_generation_config)
for summary, original_story in zip(generation_2.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_2}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: TLDR;
Original Length: 1262, Summary Length: 581
Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN.

Prompt: TLDR;
Original Length: 1181, Summary Length: 138
I'm glad I'm not in California right now. I'm glad I don't live in California, too. But I'm not sure what this has to do with the article.

Prompt: TLDR;
Original Length: 1260, Summary Length: 371
West Virginia wants to prevent trans women and girls from playing sports. The state is appealing a ruling 

Story 2 is about the possibility of severe flooding in California and an evacuation order being issued. Let's see what we get that from the three summaries and maybe which worked better.

In [15]:
print(f"{prompt_template_summary_1}\n{post_process_generations(generation_1.generation['text'][1])}")
print("====================================================================================\n")
print(f"{prompt_template_summary_2}\n {post_process_generations(generation_2.generation['text'][1])}")
print("====================================================================================\n")

Short Summary:|| This is a dangerous situation. Flooding is imminent and may have already occurred. Do not attempt to travel unless it is absolutely necessary to do so.
TLDR;|| I'm glad I'm not in California right now. I'm glad I don't live in California, too. But I'm not sure what this has to do with the article.


### Can we improve the results by providing additional context to our instructions?

In this example, we prompt the model to provide a summary and try to do so in a compact way. We still post-process the text to grab the first three sentences, but hopefully the model tries to pack more information into those first sentences

In [16]:
prompt_template_summary_3 = "Summarize the text in as few words as possible:"
prompts_with_template_3 = [f"{news_story}\n{prompt_template_summary_3}" for news_story in news_stories]
generation_3 = model.generate(prompts_with_template_3, long_generation_config)
for summary, original_story in zip(generation_3.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_3}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the text in as few words as possible:
Original Length: 1262, Summary Length: 46
Russia is supplying Iran with US-made weapons.

Prompt: Summarize the text in as few words as possible:
Original Length: 1181, Summary Length: 141
A. A powerful storm is likely to deliver severe rainfall and cause widespread flooding across the central and northern parts of the state. B.

Prompt: Summarize the text in as few words as possible:
Original Length: 1260, Summary Length: 410
The West Virginia law prohibits transgender women and girls from playing in any public school sports that are designated for girls and women. It allows for some transgender boys and men to play in sports designated for boys and men. The law also states that “any student athlete who is unable to prove her biological sex” through a blood test may participate in sports for the gender with which she identifies.



OPT, and generative models in general, have been reported to perform better when not prompted with "declarative" instructions or direct interogatives (See the [OPT Paper](https://arxiv.org/abs/2205.01068)). As such, let's ask for the summary as a question!

In [18]:
prompt_template_summary_4 = "Briefly, what is this story about?"
prompts_with_template_4 = [f"{news_story}\n{prompt_template_summary_4}" for news_story in news_stories]
generation_4 = model.generate(prompts_with_template_4, long_generation_config)
for summary, original_story in zip(generation_4.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_4}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Briefly, what is this story about?
Original Length: 1262, Summary Length: 582
Russian has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN.

Prompt: Briefly, what is this story about?
Original Length: 1181, Summary Length: 199
A wildfire that started in the hills of Los Angeles quickly spread to more than 1,000 acres and forced evacuations in the area. The blaze, dubbed the La Tuna fire, was reported just before 3:30 p. m.

Prompt: Briefly, what is this story about?


As a final example, rather than asking a question, we putting the task in a context that might be more natural for a generative model. That is, we ask it to "sum up" the article with a natural phrase prefix to be completed in a "conversational" way.

In [19]:
prompt_template_summary_5 = "In short,"
prompts_with_template_5 = [f"{news_story} {prompt_template_summary_5}" for news_story in news_stories]
generation_5 = model.generate(prompts_with_template_5, long_generation_config)
for summary, original_story in zip(generation_5.generation["text"], news_stories):
    print(f"Prompt: {prompt_template_summary_5}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: In short,
Original Length: 1262, Summary Length: 219
the US is fighting a proxy war against Russia in Syria, and Russia is fighting a proxy war against the US in Ukraine. It is a proxy war of a proxy war. The US is also fighting a proxy war against Russia in Central Asia.

Prompt: In short,
Original Length: 1181, Summary Length: 254
California is in a state of emergency as a result of the largest storm in its history. “It’s a big storm, but it’s not going to be a huge storm,” said Jan Null, a meteorologist with Golden Gate Weather Services. “It’s going to be a very manageable storm.

Prompt: In short,
Original Length: 1260, Summary Length: 374
the state is asking the Supreme Court to step in and allow it to enforce the law while the appeals process plays out. The Supreme Court is currently shorthanded, with only eight justices, and it’s unclear if the court will take up the case. Justice Samuel Alito, who handles emergency requests from the 4th Circuit, could act on his own to 

### Measuring Performance on CNN Daily Mail

In [22]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

Found cached dataset cnn_dailymail (/Users/david/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

We load the data from the CNN Daily Mail Test set, the ROUGE metric scorer from Hugging Face, and a Tokenizer from OPT. The tokenizer is used to truncate the text such that it fits nicely into the OPT model context. We truncate the text to 1023, so that it is of length 1024 when the start-of-sentence token (`<s>`) is added.

__NOTE__: All OPT models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

In [23]:
opt_tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
dataloader = DataLoader(dataset["test"], shuffle=False, batch_size=10)
rouge = evaluate.load("rouge")
prompt_template_summary_1 = "How would you briefly summarize the text?"
prompt_template_summary_2 = "In short,"

In [24]:
def truncate_article_text(article_text: str, tokenizer: AutoTokenizer, max_sequence_length: int = 1023) -> str:
    tokenized_article = tokenizer.encode(article_text, truncation=True, max_length=max_sequence_length)
    return tokenizer.decode(tokenized_article, skip_special_tokens=True)

We'll try two different prompts from the examples above and consider how well they each do in terms of rouge score against reference summaries on the CNN Daily Mail task, which is a common summarization benchmark. You can see a discussion of this dataset here: [CNN Daily Mail](https://huggingface.co/datasets/cnn_dailymail). 

__Note__: On a big model, like OPT-175, this process will likely take a bit of time, given the length of the articles and the fact that we are asking for 100 summaries.

Running the First prompt structure

(text) How would you briefly summarize the text?

In [23]:
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text}\n{prompt_template_summary_1}" for article_text in truncated_articles]
    prompt_batches = split_prompts_into_batches(prompts, batch_size=1)
    for prompt_batch in prompt_batches:
        summaries = model.generate(prompt_batch, long_generation_config).generation["text"]
        # Let's just take the first 3 sentences, split by periods
        summaries = [post_process_generations(summary) for summary in summaries]
        # References for the metric need to be in the form of list of lists
        # (ROUGE can admit multiple references per prediction)
        highlights = [[highlight] for highlight in batch["highlights"]]
        results = rouge.compute(
            predictions=summaries,
            references=highlights,
            rouge_types=["rouge1"],
        )
        batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE 1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processing Batch: 1
Processing Batch: 2
Processing Batch: 3
Processing Batch: 4
Processing Batch: 5
Processing Batch: 6
Processing Batch: 7
Processing Batch: 8
Processing Batch: 9
Processing Batch: 10
Final Rouge Score: 0.12556041350754338


Running the second prompt structure

(text) In short,

In [24]:
max_batches = 10
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    print(f"Processing Batch: {batch_number}")
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text}\n{prompt_template_summary_2}" for article_text in truncated_articles]
    prompt_batches = split_prompts_into_batches(prompts, batch_size=1)
    for prompt_batch in prompt_batches:
        summaries = model.generate(prompt_batch, long_generation_config).generation["text"]
        # Let's just take the first 3 sentences, split by periods
        summaries = [post_process_generations(summary) for summary in summaries]
        # References for the metric need to be in the form of list of lists
        # (ROUGE can admit multiple references per prediction)
        highlights = [[highlight] for highlight in batch["highlights"]]
        results = rouge.compute(
            predictions=summaries,
            references=highlights,
            rouge_types=["rouge1"],
        )
        batch_rouge_scores.append(results["rouge1"])
# Average all the ROUGE 1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processing Batch: 1
Processing Batch: 2
Processing Batch: 3
Processing Batch: 4
Processing Batch: 5
Processing Batch: 6
Processing Batch: 7
Processing Batch: 8
Processing Batch: 9
Processing Batch: 10
Final Rouge Score: 0.20367009239505895


The second prompt, as measured by ROUGE 1 scores, appears to produce summaries of higher quality than the first prompt. This is likely due to the way it is structured. It fits into the "generative" training setting a bit better than asking a point blank question.