In [13]:
import re
import time

import kscope

### Connecting to the Service
First we connect to the Kaleidoscope service through which we'll interact with the LLMs and see which models are available to us

In [14]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=5001)

Show all model instances that are currently active

In [15]:
client.model_instances

[{'id': 'cd5378f8-3efc-4d21-8984-edf8a73deb3a',
  'name': 'falcon-7b',
  'state': 'ACTIVE'},
 {'id': '47a9b6b3-cf7c-4aab-91e4-3639fade08e7',
  'name': 'falcon-40b',
  'state': 'ACTIVE'},
 {'id': '60e7d1ff-a30f-4ce8-8a3c-32860cdb8fbb',
  'name': 'llama2-7b',
  'state': 'ACTIVE'},
 {'id': 'a99db30d-ecb2-4e16-9387-302a1e88a1fd',
  'name': 'llama2-13b',
  'state': 'ACTIVE'}]

To start, we obtain a handle to a model. In this example, we'll use the Falcon-7B model.

In [16]:
model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

In [17]:
# Note that because we have not set do_sample to true, the model will perform greedy decoding
# regardless of top_k or temperature
long_generation_config = {"max_tokens": 128, "top_k": 4, "temperature": 0.5}

In [18]:
def post_process_generations(generation_text: str) -> str:
    # This simply attempts to extract the first three "sentences" within a generated string
    split_text = re.findall(r".*?[.!\?]", generation_text)[0:3]
    split_text = [text.strip() for text in split_text]
    return "\n".join(split_text)

### LLM "Hallucinations"

In this example, we're going to take a look at some LLM "hallucinations," which is the commonly used term to refer to situations when an LLM produces convincing but erroneous or false information in a given context. In this case, we'll specifically consider a setting where the LLM is actually "grounded." That is, the LLM is provided information from which it can draw factual information. In spite of this, the LLM, in this case Falcon-7B still generates output that could be considered a hallucination.

The task that we'll be using to demonstrate these hallucinations is summarization. We'll provide the model with a news story and prompt it, in different ways, to produce a summary of the provided text. The model is "grounded" in that all of the relevant facts are present in the story. The model need only condense them into a coherent summary. In spite of this, the we see the model produce factually incorrect summaries and details.

Now let's create a basic prompt template that we can reuse for multiple text inputs. This will be an instruction prompt with an unconstrained answer space as we're going to try to get Falcon to summarize texts.

In [20]:
prompt_template_summary_1 = "Summarize the preceding text. Note that Iran is a beautiful and historic country!"
prompt_template_summary_2 = "TLDR;"

# Load the example news stories. We're only using the three in the provided path as examples.
with open("resources/news_summary_datasets/examples_news.txt", "r") as file:
    news_stories = [line.strip() for line in file.readlines()]

prompts_with_template_1 = [f"{news_story} {prompt_template_summary_1}" for news_story in news_stories]
prompts_with_template_2 = [f"{news_story} {prompt_template_summary_2}" for news_story in news_stories]

In these examples, we use the prompt structures

* (text) Summarize the preceeding text.
* (text) TLDR;

In [21]:
print(f"{prompts_with_template_1[0]}\n")
print(prompts_with_template_2[0])

Russia has been capturing some of the US and NATO-provided weapons and equipment left on the battlefield in Ukraine and sending them to Iran, where the US believes Tehran will try to reverse-engineer the systems, four sources familiar with the matter told CNN. Over the last year, US, NATO and other Western officials have seen several instances of Russian forces seizing smaller, shoulder-fired weapons equipment including Javelin anti-tank and Stinger anti-aircraft systems that Ukrainian forces have at times been forced to leave behind on the battlefield, the sources told CNN. In many of those cases, Russia has then flown the equipment to Iran to dismantle and analyze, likely so the Iranian military can attempt to make their own version of the weapons, sources said. Russia believes that continuing to provide captured Western weapons to Iran will incentivize Tehran to maintain its support for Russia’s war in Ukraine, the sources said. US officials don’t believe that the issue is widesprea

For our first examples, we'll use the prompt structure 

(text) Summarize the preceding text.

The story to be summarized describes a severe weather and flash flood warning in California "from Salinas southward to San Luis Obispo and including parts of Ventura and Monterey counties."

In [22]:
summary = model.generate(prompts_with_template_1[1], long_generation_config).generation["sequences"][0]
print(f"Prompt: {prompt_template_summary_1}")
# Let's just take the first 3 sentences, split by periods
summary = post_process_generations(summary)
print(f"Original Length: {len(news_stories[1])}, Summary Length: {len(summary)}")
print(summary)

Prompt: Summarize the preceding text. Note that Iran is a beautiful and historic country!
Original Length: 1180, Summary Length: 151
m.
Friday.


The summary captures the flash flood warning from the National Weather Service, which is good. On the other hand, the model erroneously states that the warning was issued for San Francisco, which is not true. It also inserts a expiration time for the warning that is not part of the original text.

The second story that we'll summarize here describes a court case being heard in West Virginia around a law prohibiting transgender women and girls from participating in public school sports.

In [23]:
summary = model.generate(prompts_with_template_1[2], long_generation_config).generation["sequences"][0]
print(f"Prompt: {prompt_template_summary_1}")
# Let's just take the first 3 sentences, split by periods
summary = post_process_generations(summary)
print(f"Original Length: {len(news_stories[2])}, Summary Length: {len(summary)}")
print(summary)

Prompt: Summarize the preceding text. Note that Iran is a beautiful and historic country!
Original Length: 1259, Summary Length: 332
The post West Virginia asks Supreme Court to allow transgender ban on school sports appeared first on TheGrio.
The post West Virginia asks Supreme Court to allow transgender ban on school sports appeared first on TheGrio.
The post West Virginia asks Supreme Court to allow transgender ban on school sports appeared first on TheGrio.


The first sentence is an accurate summary, as the court hearing the case is the US Supreme Court and it will have an impact on transgender rights. However, in the second part of the summary, the model supplies a name for a concurrent case, "Bostock v. Clayton County," which is a very different case which took place in Georgia, not West Virginia, in 2020. It is therefore not concurrent and is never mentioned in the article

Despite the model having all of the necessary information to produce a summary, it still produces key details in the summaries themselves that do not appear in the text and are, at best, misleading.

Let's consider a few more examples.

In this example, we'll use our other prompt to produce summaries.

(text) TLDR;

In the article considered, the US is concerned that Russian forces are diverting NATO-provided weapons from Ukrainian battlefield to Iran to be reverse engineered.

In [11]:
summary = model.generate(prompts_with_template_2[0], long_generation_config).generation["sequences"][0]
print(f"Prompt: {prompts_with_template_2}")
# Let's just take the first 3 sentences, split by periods
summary = post_process_generations(summary)
print(f"Original Length: {len(news_stories[0])}, Summary Length: {len(summary)}")
print(summary)

Original Length: 1261, Summary Length: 345
Russia is sending captured US weapons to Iran to reverse engineer.
The US is concerned that Iran will use the captured weapons to attack US forces in the Middle East.
Source: Russia is sending captured US weapons to Iran to reverse engineer – CNNThe US is concerned that Iran will use the captured weapons to attack US forces in the Middle East.


The first sentence of the summary is fairly straight forward and captures the full gist of the article. However, the second sentence is not connected with the article. In the article, it is never mentioned that the US is concerned that the weapons are going to be used in combat against US troops. Further, the model cites a "source." While the article is from CNN that we're summarizing, the title cited is not correct.

As a final example, Let's ask the model to "sum up" the article with a natural phrase prefix to be completed in a "conversational" way. We'll use the prompt

(text) In short, 

We'll take a look at the response generated for the West Virginia Supreme Court case again.

In [12]:
prompt_template_summary_3 = "In short,"
prompts_with_template_3 = [f"{news_story} {prompt_template_summary_3}" for news_story in news_stories]

summary = model.generate(prompts_with_template_3[2], long_generation_config).generation["sequences"][0]
print(f"Prompt: {prompt_template_summary_3}")
# Let's just take the first 3 sentences, split by periods
summary = post_process_generations(summary)
print(f"Original Length: {len(news_stories[2])}, Summary Length: {len(summary)}")
print(summary)

Prompt: In short,
Original Length: 1259, Summary Length: 385
the state is asking the Supreme Court to allow the law to go into effect while the case is pending.
The state’s request comes as the Supreme Court is considering a case that could have a major impact on the rights of transgender people.
The justices are scheduled to hear arguments in a case involving a Virginia student who was barred from using the boys’ bathroom at his high school.


The first two sentences are fairly good summaries of the article and a few details. However, the final sentence is untrue in several dimensions. First, the case is happening in is about *West* Virginia, rather than Virginia. Second, the case considers transgendered athletes participating in sports, rather than the use of school restrooms. This summary seems to have a similar confusion with a "concurrent" case that is never mentioned in the article body

### A Final Word

It should be noted that these examples are "cherry picked." That is, we found these examples during experimentation but the model does not __always__ "hallucinate." There are many prompting examples where the summaries were fairly well grounded in the facts of the article. The examples presented here are merely demonstrations of the behavior LLMs can display, even when performing "grounded" generation.