In [1]:
import re
import time
from collections import Counter
from typing import List

import kscope
from tqdm import tqdm

### Conecting to the Service
First we connect to the Kaleidoscope service through which we'll interact with the LLMs and see which models are avaiable to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all model instances that are currently active

In [3]:
client.model_instances

[{'id': 'e0c9bcf8-0495-4975-a394-5ea8167ee5b7',
  'name': 'llama2-7b',
  'state': 'ACTIVE'},
 {'id': '011f033b-261a-4dfe-b504-09330f61d83f',
  'name': 'llama2-7b_chat',
  'state': 'ACTIVE'},
 {'id': 'ba9f4677-c765-48f4-86d1-5a795ca68dba',
  'name': 'falcon-7b',
  'state': 'ACTIVE'},
 {'id': 'c8b99d4a-7972-4f92-98ab-4b8108619782',
  'name': 'falcon-40b',
  'state': 'ACTIVE'}]

To start, we obtain a handle to a model. In this example, let's use the Falcon-7B model.

In [4]:
model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

In [5]:
small_generation_config = {"max_tokens": 20, "top_k": 1}
# top_k of 40 and temperature of 0.7 coincide with the PaLM-540B settings from the original paper
moderate_generation_config = {"max_tokens": 100, "top_k": 40, "temperature": 0.7, "do_sample": True}

Let's ask the model a simple question to start

In [6]:
generation = model.generate("What is the capital of Canada?", small_generation_config)
# Extract the text from the returned generation
print(generation.generation["sequences"][0])


Ottawa is the capital of Canada.
What is the capital of Canada?
Ottawa


# Self-Consistency in Chain of Thought

The Self-Consistent Chain-of-thought prompting method was originally proposed in ["Self-Consistency Improves Chain Of Thought Reasoning In Language Models"](https://arxiv.org/pdf/2203.11171.pdf). The approach uses the stochasticity of an LLM's decoding process (through sampling) to generate distinct reasoning traces and extract concensus from the resulting collection of answers. The method uses few-shot chain-of-thought to coherent reasoning traces.

We'll start by prompting Falcon-40B to solve a word problem through self-consistent CoT as an example

First, let's see what happens if we try to solve the word problems with a zero-shot prompt.

In [7]:
question = (
    "Sam had 15 socks. If he threw away 3 old ones that he didn't like and bought 36 new ones, "
    "how many socks would he have?"
)
zero_shot_prompt = f"Q: {question}\nA: The answer is"

print(zero_shot_prompt)

Q: Sam had 15 socks. If he threw away 3 old ones that he didn't like and bought 36 new ones, how many socks would he have?
A: The answer is


In [8]:
generation_example = model.generate(zero_shot_prompt, generation_config=small_generation_config)
print(generation_example.generation["sequences"][0])

 42.
Q: Sam had 15 socks. If he threw away 3 old ones


The correct answer to this word problem is 3. Zero-shot prompting didn't produce the correct answer.

Now we'll construct a 5-shot CoT prompt to be used to produce multiple reasoning traces and answer our question

In [9]:
few_shot_cot_examples = (
    "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, "
    "there will be 21 trees. How many trees did the grove workers plant today?\nA: We start with 15 trees. Later we "
    "have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 "
    "trees. The answer is 6\n\nQ: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in "
    "the parking lot?\nA: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. "
    "The answer is 5.\n\nQ: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they "
    "have left in total?\nA: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + "
    "42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39."
    "\n\nQ: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did "
    "Jason give to Denny?\nA: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. "
    "The number of lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.\n\n"
    "Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?\nA: She bought 5 "
    "bagels for $3 each. This means she spent 5 * $3 = $15 on the bagels. She had $23 in beginning, so now she has "
    "$23 - $15 = $8. The answer is 8."
)
few_shot_cot_prompt = f"{few_shot_cot_examples}\n\nQ: {question}"
print(few_shot_cot_prompt)

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted. So, they must have planted 21 - 15 = 6 trees. The answer is 6

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74 chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to

Now that we have our few-shot examples and prompt setup, we generate 10 reasoning traces and answers to consider the diversity.

In [10]:
n_traces_to_generate = 5
for i in range(n_traces_to_generate):
    generation_example = model.generate(few_shot_cot_prompt, generation_config=moderate_generation_config)
    generated_text = generation_example.generation["sequences"][0]
    print(f"Generation {i+1}:\n{generated_text}")

Generation 1:

A: Sam had 15 socks. He threw away 3 of the old ones and bought 36 new ones. So he has 15 - 3 + 36 = 48 socks now. The answer is 48.

Q: The total number of students in a school is 240. If 30% of the students are boys, what percent of the students are girls?
A: The total number of students in the school is 240.
Generation 2:

A: Sam had 15 socks. If he threw away 3 old ones and bought 36 new ones, he would have 15 - 3 - 36 = 48 socks. The answer is 48.

Q: The square is 2 inches. If the side is cut in half, the square becomes 1 inch. What number of inches did the square have originally?
A: The square is 2 inches. So if the side is
Generation 3:

A: Sam had 15 socks. He threw away 3 old socks and bought 36 new socks. That means he has 15 - 3 = 12 socks left. If he throws away 3 socks, he will have 12 - 3 = 9 socks left. The answer is 9.

Q: What shape is a square with an area of 12 cm?
A: A square has four equal sides. The area of
Generation 4:

A: Sam had 15 socks. If he

Each of the reasoning traces is slightly different. While we can see that the model is not getting the correct answer in every trace, the correct answer is present in some of the generations. In the self-consistency paper, there are several ways that might be used to aggregate the answers (what they term "marginalizing out the reasoning"). However, the simplest and, in there results one of the best performing approach is simple voting. So let's try that out.

In [11]:
def extract_answer_from_response(response: str) -> str:
    match = re.search(r"The answer is (\d+)", response)
    if match:
        return match.group(1)
    else:
        print(f"Failed to match in response: {response}")
        return "0.0"


def compute_majority_response(answers: List[float]) -> float:
    return Counter(answers).most_common()[0][0]

In [12]:
n_traces_to_generate = 20
responses = []
for i in tqdm(range(n_traces_to_generate)):
    generation_example = model.generate(few_shot_cot_prompt, generation_config=moderate_generation_config)
    generated_text = generation_example.generation["sequences"][0]
    responses.append(generated_text)

answers = [float(extract_answer_from_response(response)) for response in responses]

  0%|          | 0/20 [00:00<?, ?it/s]

100%|██████████| 20/20 [04:29<00:00, 13.48s/it]


In [13]:
print(f"The answer, as determined by self-consistent prompting is: {compute_majority_response(answers)}")

The answer, as determined by self-consistent prompting is: 48.0
