In [1]:
import re
import time

import evaluate
import kscope
from datasets import load_dataset
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

### Conecting to the Service
First we connect to the Kaleidoscope service through which we'll interact with the LLMs and see which models are avaiable to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all model instances that are currently active

In [3]:
client.model_instances

[]

To start, we obtain a handle to a model. In this example, let's use the OPT-175B model.

In [4]:
model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

In [5]:
long_generation_config = {"max_tokens": 128, "top_k": 4, "top_p": 1.0, "temperature": 0.5}

In [6]:
def post_process_generations(generation_text: str) -> str:
    # This simply attempts to extract the first three "sentences" within a generated string
    split_text = re.findall(r".*?[.!\?]", generation_text)[0:3]
    split_text = [text.strip() for text in split_text]
    return "\n".join(split_text)

### Basic Prompts

Now let's create a basic prompt template that we can reuse for multiple text inputs. This will be an instruction prompt with an unconstrained answer space as we're going to try to get OPT to summarize texts. We'll try several different templates and examine performance for each. Note that this section simply considers "manual" or "human-level" inspection to determine the quality of the summary. At the bottom of this notebook, we consider measuring the quality of two prompts on a sample of the CNN Daily Mail task using a ROUGE 1 Score.

In [7]:
prompt_template_summary_1 = "Summarize the preceding text."
prompt_template_summary_2 = "TLDR;"

In [8]:
with open("news_summary_datasets/examples_news.txt", "r") as file:
    news_stories = [line.strip() for line in file.readlines()]

In [9]:
prompts_with_template_1 = [f"{news_story} {prompt_template_summary_1}" for news_story in news_stories]
prompts_with_template_2 = [f"{news_story} {prompt_template_summary_2}" for news_story in news_stories]

In these examples, we use the prompt structures

* (text) Summarize the preceeding text.
* (text) TLDR;

In [10]:
print(f"{prompts_with_template_1[0]}\n")
print(prompts_with_template_2[0])

Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN. In many of those cases, Russia has then flown the equipment to Iran to dismantle and analyze, likely so the Iranian military can attempt to make their own version of the weapons, sources said. Russia believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine, the sources said. US officials don’t believe that the issue is widesprea

In [11]:
summaries_1 = [
    model.generate(prompt, long_generation_config).generation["sequences"][0] for prompt in prompts_with_template_1
]
print(f"Prompt: {prompt_template_summary_1}")
for summary, original_story in zip(summaries_1, news_stories):
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the preceding text.
Original Length: 1261, Summary Length: 423
The US has been providing Ukraine with weapons and equipment since the beginning of the war, and the Ukrainian military has been reporting to the US any losses of US-provided equipment to Russian forces.
The US believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine.
The US doesn’t believe that the issue is widespread or systematic.

Original Length: 1180, Summary Length: 151
m.
Friday.

Original Length: 1259, Summary Length: 243
The state’s request comes as the Supreme Court is considering a case that could have a major impact on the rights of transgender people.
The justices are scheduled to hear arguments in the case, Bostock v.
Clayton County, Georgia, on April 28.



1) Summary fairly good, though a bit unclear. 
2) The second is captures the flash flood warning but hallucinates that it is for San Francisco. It is not.
3) The first sentence is a bit vague, but is relevant. The second sentence is a hallucination.

In [12]:
summaries_2 = [
    model.generate(prompt, long_generation_config).generation["sequences"][0] for prompt in prompts_with_template_2
]
print(f"Prompt: {prompt_template_summary_2}")
for summary, original_story in zip(summaries_2, news_stories):
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: TLDR;
Original Length: 1261, Summary Length: 345
Russia is sending captured US weapons to Iran to reverse engineer.
The US is concerned that Iran will use the captured weapons to attack US forces in the Middle East.
Source: Russia is sending captured US weapons to Iran to reverse engineer – CNNThe US is concerned that Iran will use the captured weapons to attack US forces in the Middle East.

Original Length: 1180, Summary Length: 248
California is in the middle of a massive storm that could cause widespread flooding and mudslides.
The post California braces for ‘historic’ storm appeared first on TheGrio.
The post California braces for ‘historic’ storm appeared first on TheGrio.

Original Length: 1259, Summary Length: 382
The state is asking the Supreme Court to allow it to enforce a law that prohibits transgender women and girls from participating in public school sports.
The law was temporarily blocked by a federal judge earlier this year.
The state is asking the Supreme Cour

1) The first sentence is a good gist. The second sentence is hallucination. The third is a non- correct citation.
2) The first is a bit vague, but true given the story, the rest is not useful and likely hallucinated
3) This is a good summary.

### Can we improve the results by providing additional context to our instructions?

In this example, we prompt the model to provide a summary and try to do so in a compact way. We still post-process the text to grab the first three sentences, but hopefully the model tries to pack more information into those first sentences

In [14]:
prompt_template_summary_3 = "Summarize the text in as few words as possible:"
prompts_with_template_3 = [f"{news_story} {prompt_template_summary_3}" for news_story in news_stories]
summaries_3 = [
    model.generate(prompt, long_generation_config).generation["sequences"][0] for prompt in prompts_with_template_3
]
for summary, original_story in zip(summaries_3, news_stories):
    print(f"Prompt: {prompt_template_summary_3}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Summarize the text in as few words as possible:
Original Length: 1261, Summary Length: 460
“The US believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine.
”The US believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine.
The US believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine.

Prompt: Summarize the text in as few words as possible:
Original Length: 1180, Summary Length: 537
“The most dangerous amount of rain could impact nearly 70,000 people along the central California coast, stretching from Salinas southward to San Luis Obispo and including parts of Ventura and Monterey counties.

Prompt: Summarize the text in as few words as possible:
Original Length: 1259, Summary Length: 336
The state is asking the Supr

Ignoring the repetition:
1) The first is relevant to the article, but not necessarily a great summary
2) Again, relevant to the article, but not a great summary.
3) A decent summary

OPT, and generative models in general, have been reported to perform better when not prompted with "declarative" instructions or direct interogatives (See the [OPT Paper](https://arxiv.org/abs/2205.01068)). As such, let's ask for the summary as a question!

In [15]:
prompt_template_summary_4 = "Briefly, what is this story about?"
prompts_with_template_4 = [f"{news_story}\n{prompt_template_summary_4}" for news_story in news_stories]
summaries_4 = [
    model.generate(prompt, long_generation_config).generation["sequences"][0] for prompt in prompts_with_template_4
]
for summary, original_story in zip(summaries_4, news_stories):
    print(f"Prompt: {prompt_template_summary_4}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: Briefly, what is this story about?
Original Length: 1261, Summary Length: 277
The story is about the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sent to Iran.
What is the significance of this story?
The story is about the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sent to Iran.

Prompt: Briefly, what is this story about?
Original Length: 1180, Summary Length: 154
The story is about the California weather.
What is the main point of this story?
The main point of this story is that California is getting a lot of rain.

Prompt: Briefly, what is this story about?
Original Length: 1259, Summary Length: 361
The West Virginia law, which was enacted in 2021, prohibits transgender women and girls from participating in public school sports.
The law was challenged by a transgender student athlete who sued the state, and a district court temporarily blocked the law three months after it was enacted.
But earlier thi

Ignoring the repetition:
1) More of a preview than a summary
2) Very vague
3) A good summary

As a final example, rather than asking a question, we putting the task in a context that might be more natural for a generative model. That is, we ask it to "sum up" the article with a natural phrase prefix to be completed in a "conversational" way.

In [16]:
prompt_template_summary_5 = "In short,"
prompts_with_template_5 = [f"{news_story} {prompt_template_summary_5}" for news_story in news_stories]
summaries_5 = [
    model.generate(prompt, long_generation_config).generation["sequences"][0] for prompt in prompts_with_template_5
]
for summary, original_story in zip(summaries_5, news_stories):
    print(f"Prompt: {prompt_template_summary_5}")
    # Let's just take the first 3 sentences, split by periods
    summary = post_process_generations(summary)
    print(f"Original Length: {len(original_story)}, Summary Length: {len(summary)}")
    print(summary)
    print("====================================================================================")
    print("")

Prompt: In short,
Original Length: 1261, Summary Length: 411
the US is concerned that Russia is trying to use captured Western weapons to help Iran reverse-engineer the systems, the sources said.
The US has been providing Ukraine with Javelin anti-tank missiles, Stinger anti-aircraft missiles and other weapons since the beginning of the war.
The US has also provided Ukraine with a variety of other weapons, including small arms, ammunition, drones, and other equipment.

Prompt: In short,
Original Length: 1180, Summary Length: 279
the storm is expected to bring “life-threatening” flooding to the region, according to the National Weather Service.
“The combination of heavy rain and melting snow will result in widespread and potentially historic flooding,” the NWS said.
“This is a life-threatening situation.

Prompt: In short,
Original Length: 1259, Summary Length: 385
the state is asking the Supreme Court to allow the law to go into effect while the case is pending.
The state’s request com

1) A pretty good summary, but missing a few major points.
2) A good summary, except that the location of interest is not mentioned.
3) Sort of a good summary, but the final point is not true, it is about *West* Virgina transgendered athletes participating in sports

### Measuring Performance on CNN Daily Mail

In [17]:
dataset = load_dataset("cnn_dailymail", "3.0.0")

Found cached dataset cnn_dailymail (/Users/david/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/1b3c71476f6d152c31c1730e83ccb08bcf23e348233f4fcc11e182248e6bf7de)


  0%|          | 0/3 [00:00<?, ?it/s]

We load the data from the CNN Daily Mail Test set, the ROUGE metric scorer from Hugging Face, and a Tokenizer from OPT. The tokenizer is used to truncate the text such that it fits nicely into the OPT model context. We truncate the text to 1023, so that it is of length 1024 when the start-of-sentence token (`<s>`) is added.

__NOTE__: All OPT models, regardless of size, used the same tokenizer. However, if you want to use a different type of model, a different tokenizer may be needed.

In [18]:
opt_tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
dataloader = DataLoader(dataset["test"], shuffle=False, batch_size=1)
rouge = evaluate.load("rouge")
prompt_template_summary_1 = "How would you briefly summarize the text?"
prompt_template_summary_2 = "In short,"

In [19]:
def truncate_article_text(article_text: str, tokenizer: AutoTokenizer, max_sequence_length: int = 1023) -> str:
    tokenized_article = tokenizer.encode(article_text, truncation=True, max_length=max_sequence_length)
    return tokenizer.decode(tokenized_article, skip_special_tokens=True)

We'll try two different prompts from the examples above and consider how well they each do in terms of rouge score against reference summaries on the CNN Daily Mail task, which is a common summarization benchmark. You can see a discussion of this dataset here: [CNN Daily Mail](https://huggingface.co/datasets/cnn_dailymail). 

__Note__: On a big model, like OPT-175, this process will likely take a bit of time, given the length of the articles and the fact that we are asking for 50 summaries.

Running the First prompt structure

(text) How would you briefly summarize the text?

In [20]:
max_batches = 50
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text}\n{prompt_template_summary_1}" for article_text in truncated_articles]
    summaries = model.generate(prompts, long_generation_config).generation["sequences"]
    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
    print(f"Processed Batch: {batch_number}")
# Average all the ROUGE 1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processed Batch: 1
Processed Batch: 2
Processed Batch: 3
Processed Batch: 4
Processed Batch: 5
Processed Batch: 6
Processed Batch: 7
Processed Batch: 8
Processed Batch: 9
Processed Batch: 10
Processed Batch: 11
Processed Batch: 12
Processed Batch: 13
Processed Batch: 14
Processed Batch: 15
Processed Batch: 16
Processed Batch: 17
Processed Batch: 18
Processed Batch: 19
Processed Batch: 20
Processed Batch: 21
Processed Batch: 22
Processed Batch: 23
Processed Batch: 24
Processed Batch: 25
Processed Batch: 26
Processed Batch: 27
Processed Batch: 28
Processed Batch: 29
Processed Batch: 30
Processed Batch: 31
Processed Batch: 32
Processed Batch: 33
Processed Batch: 34
Processed Batch: 35
Processed Batch: 36
Processed Batch: 37
Processed Batch: 38
Processed Batch: 39
Processed Batch: 40
Processed Batch: 41
Processed Batch: 42
Processed Batch: 43
Processed Batch: 44
Processed Batch: 45
Processed Batch: 46
Processed Batch: 47
Processed Batch: 48
Processed Batch: 49
Processed Batch: 50
Final Rou

Running the second prompt structure

(text) In short,

In [21]:
max_batches = 50
batch_rouge_scores = []
for batch_number, batch in enumerate(dataloader, 1):
    if batch_number > max_batches:
        break
    truncated_articles = [truncate_article_text(text, opt_tokenizer) for text in batch["article"]]
    prompts = [f"{article_text}\n{prompt_template_summary_2}" for article_text in truncated_articles]
    summaries = model.generate(prompts, long_generation_config).generation["sequences"]
    # Let's just take the first 3 sentences, split by periods
    summaries = [post_process_generations(summary) for summary in summaries]
    # References for the metric need to be in the form of list of lists
    # (ROUGE can admit multiple references per prediction)
    highlights = [[highlight] for highlight in batch["highlights"]]
    results = rouge.compute(
        predictions=summaries,
        references=highlights,
        rouge_types=["rouge1"],
    )
    batch_rouge_scores.append(results["rouge1"])
    print(f"Processed Batch: {batch_number}")
# Average all the ROUGE 1 scores together for the final one
print(f"Final Rouge Score: {sum(batch_rouge_scores)/len(batch_rouge_scores)}")

Processed Batch: 1
Processed Batch: 2
Processed Batch: 3
Processed Batch: 4
Processed Batch: 5
Processed Batch: 6
Processed Batch: 7
Processed Batch: 8
Processed Batch: 9
Processed Batch: 10
Processed Batch: 11
Processed Batch: 12
Processed Batch: 13
Processed Batch: 14
Processed Batch: 15
Processed Batch: 16
Processed Batch: 17
Processed Batch: 18
Processed Batch: 19
Processed Batch: 20
Processed Batch: 21
Processed Batch: 22
Processed Batch: 23
Processed Batch: 24
Processed Batch: 25
Processed Batch: 26
Processed Batch: 27
Processed Batch: 28
Processed Batch: 29
Processed Batch: 30
Processed Batch: 31
Processed Batch: 32
Processed Batch: 33
Processed Batch: 34
Processed Batch: 35
Processed Batch: 36
Processed Batch: 37
Processed Batch: 38
Processed Batch: 39
Processed Batch: 40
Processed Batch: 41
Processed Batch: 42
Processed Batch: 43
Processed Batch: 44
Processed Batch: 45
Processed Batch: 46
Processed Batch: 47
Processed Batch: 48
Processed Batch: 49
Processed Batch: 50
Final Rou

The second prompt, as measured by ROUGE 1 scores, appears to produce summaries of higher quality than the first prompt. This is likely due to the way it is structured. It fits into the "generative" training setting a bit better than asking a point blank question.