In [1]:
import json
import re
import time
from typing import List, Optional

import evaluate
import kscope
from tqdm import tqdm
from utils import split_prompts_into_batches

# Getting Started

There is a bit of documentation on how to interact with the large models [here](https://kaleidoscope-sdk.readthedocs.io/en/latest/). The relevant github links to the SDK are [here](https://github.com/VectorInstitute/kaleidoscope-sdk) and underlying code [here](https://github.com/VectorInstitute/kaleidoscope).

First we connect to the service through which we'll interact with the LLMs and see which models are avaiable to us

In [2]:
# Establish a client connection to the kscope service
client = kscope.Client(gateway_host="llm.cluster.local", gateway_port=3001)

Show all supported models

In [3]:
client.models

['gpt2',
 'llama2-7b',
 'llama2-7b_chat',
 'llama2-13b',
 'llama2-13b_chat',
 'llama2-70b',
 'llama2-70b_chat',
 'falcon-7b',
 'falcon-40b',
 'sdxl-turbo']

Show all model instances that are currently active

In [4]:
client.model_instances

[{'id': 'b56efa65-0477-43fc-9028-62f78983e86f',
  'name': 'falcon-40b',
  'state': 'LAUNCHING'},
 {'id': 'fa78651c-0876-4e60-b016-22998efc6377',
  'name': 'llama2-70b',
  'state': 'LAUNCHING'},
 {'id': 'a8675b3a-4898-438e-a772-ddc70bd1357e',
  'name': 'llama2-7b',
  'state': 'ACTIVE'},
 {'id': '815ca1d4-d4df-4f0b-8c00-41a0fb82a3c0',
  'name': 'falcon-7b',
  'state': 'ACTIVE'}]

To start, we obtain a handle to a model. In this example, let's use the Falcon 7B model.

In [5]:
model = client.load_model("falcon-7b")
# If this model is not actively running, it will get launched in the background.
# In this case, wait until it moves into an "ACTIVE" state before proceeding.
while model.state != "ACTIVE":
    time.sleep(1)

We need to configure the model to generate in the way we want it to. So we set a number of important parameters. For a discussion of the configuration parameters see: `src/reference_implementations/prompting_vector_llms/CONFIG_README.md`

In [6]:
long_generation_config = {"max_tokens": 75, "top_k": 4, "top_p": 1.0, "temperature": 0.5}

We're going to try out some few shot and zero shot translation from French to English. We take a sample from the very large WMT14 translation dataset, specifically considering French->English translation only

In [7]:
french_texts = []
english_texts = []
with open("resources/translation_dataset/wmt14_sample.json") as file:
    data = json.load(file)["dataset"]
    for french_english_pair in data:
        french_text = french_english_pair["fr"]
        french_texts.append(french_text)
        english_text = french_english_pair["en"]
        english_texts.append(english_text)

### Zero-Shot Prompt Structure

For our zero-shot prompt example, we'll start by using the same format as the original [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf). That is:

Q: What is the {target language} translation of {source text} A: 

In [8]:
def construct_zero_shot_prompt(french_text: str) -> str:
    return f"Q: What is the English translation of {french_text}\nA:"

In [9]:
zero_shot_prompts = []
for french_text in french_texts:
    zero_shot_prompt = construct_zero_shot_prompt(french_text)
    zero_shot_prompts.append(zero_shot_prompt)

In [10]:
print(f"Example Prompt\n{zero_shot_prompts[0]}")

Example Prompt
Q: What is the English translation of Reprise de la session
A:


### Few-Shot Prompt Structure

In order to speed up inference a bit, we only use 10-shot prompts for our translation task. The original [GPT-3 paper](https://arxiv.org/pdf/2005.14165.pdf), as do many others papers, uses a very large 64-shot prompt to induce their observed performance. The first prompt format that we use is distinctly different from the zero-shot setting. We borrow the GPT-3 structure of:

{source text} = {target text}\n\n

but add on an instruction at the beginning 

"Translate the follow sentences from French to English.\n\n"

To create the demonstrations, we take the first `n_examples` from the dataset sample, and use the rest for testing.

In [11]:
def construct_few_shot_prompt_structure(french_text: str, english_text: Optional[str] = None) -> str:
    if english_text:
        return f"{french_text} = {english_text}\n\n"
    else:
        return f"{french_text} = "

In [12]:
def construct_few_shot_prompt(french_text: str, demonstrations: str) -> str:
    return f"{demonstrations}{construct_few_shot_prompt_structure(french_text)}"

In [13]:
n_examples = 10
demonstrations = []
# Create the demonstrations for translation
for french_text, english_text in zip(french_texts[0:n_examples], english_texts[0:n_examples]):
    demonstrations.append(construct_few_shot_prompt_structure(french_text, english_text))

demonstration_str = "".join(demonstrations)
demonstration_str = f"Translate the following sentences from French to English.\n\n{demonstration_str}"

In [14]:
# Note that we're only taking the remaining 90 examples from the test set, since we used the first 10 for few-shot
# examples
few_shot_prompts = [
    construct_few_shot_prompt(french_text, demonstration_str) for french_text in french_texts[n_examples:]
]

In [15]:
print(f"Example Prompt\n{few_shot_prompts[0]}")

Example Prompt
Translate the following sentences from French to English.

Reprise de la session = Resumption of the session

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. = I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.

Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. = Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

Vous avez souhaité un débat à ce sujet dans les proch

### Testing Some Generations

Let's give each a try with some basic french sentences

In [16]:
example_1 = "J'aime mon chien."  # I love my dog.
example_2 = "Il y a des gens partout."  # There are people everywhere.
example_3 = (
    "Jusqu'à présent, l'hiver a été étrange à Toronto."  # It has been an very weird winter in Toronto thus far.
)

In [17]:
def parse_model_response(response: str) -> str:
    # find the first sentence
    sentences = re.findall(r".*?[.!\?]", response)
    first_sentence = sentences[0] if len(sentences) > 0 else response
    # strip leading and trailing non-alpha numeric characters
    lead_strip = re.sub(r"^[^A-Za-z0-9 _\.,!\"\'\?]+", "", first_sentence.strip())
    return re.sub(r"[^A-Za-z0-9 _\.,!\"\'\?]+$", "", lead_strip)

#### Zero Shot Examples.

We only grab the first sentence in the response because we are targeting translation of only one sentence

In [18]:
# Place examples in the zero shot template
zero_shot_prompt = construct_zero_shot_prompt(example_1)
generation = model.generate(zero_shot_prompt, long_generation_config)
print(f"Prompt\n{zero_shot_prompt}")
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Q: What is the English translation of J'aime mon chien.
A:
Parsed Response
I love my dog.


In [19]:
# Place examples in the zero shot template
zero_shot_prompt = construct_zero_shot_prompt(example_2)
generation = model.generate(zero_shot_prompt, long_generation_config)
print(f"Prompt\n{zero_shot_prompt}")
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Q: What is the English translation of Il y a des gens partout.
A:
Parsed Response
There are people everywhere.


In [20]:
# Place examples in the zero shot template
zero_shot_prompt = construct_zero_shot_prompt(example_3)
generation = model.generate(zero_shot_prompt, long_generation_config)
print(f"Prompt\n{zero_shot_prompt}")
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Q: What is the English translation of Jusqu'à présent, l'hiver a été étrange à Toronto.
A:
Parsed Response
Until now, winter has been strange in Toronto.


In spite of not being trained explicitly on significant mulitlingual texts, zero-shot actually works pretty well for these examples.

### Few-shot Examples 

Again, we only grab the first sentence in the response because we are targeting translation of only one sentence

In [21]:
# Place examples in the few shot template
few_shot_prompt = construct_few_shot_prompt(example_1, demonstration_str)
print(f"Prompt\n{few_shot_prompt}")
generation = model.generate(few_shot_prompt, long_generation_config)
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Translate the following sentences from French to English.

Reprise de la session = Resumption of the session

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. = I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.

Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. = Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

Vous avez souhaité un débat à ce sujet dans les prochains jou

In [22]:
# Place examples in the few shot template
few_shot_prompt = construct_few_shot_prompt(example_2, demonstration_str)
print(f"Prompt\n{few_shot_prompt}")
generation = model.generate(few_shot_prompt, long_generation_config)
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Translate the following sentences from French to English.

Reprise de la session = Resumption of the session

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. = I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.

Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. = Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

Vous avez souhaité un débat à ce sujet dans les prochains jou

In [23]:
# Place examples in the few shot template
few_shot_prompt = construct_few_shot_prompt(example_3, demonstration_str)
print(f"Prompt\n{few_shot_prompt}")
generation = model.generate(few_shot_prompt, long_generation_config)
# Grab the first sentence output.
print("Parsed Response")
print(parse_model_response(generation.generation["sequences"][0]))

Prompt
Translate the following sentences from French to English.

Reprise de la session = Resumption of the session

Je déclare reprise la session du Parlement européen qui avait été interrompue le vendredi 17 décembre dernier et je vous renouvelle tous mes vux en espérant que vous avez passé de bonnes vacances. = I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.

Comme vous avez pu le constater, le grand "bogue de l'an 2000" ne s'est pas produit. En revanche, les citoyens d'un certain nombre de nos pays ont été victimes de catastrophes naturelles qui ont vraiment été terribles. = Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.

Vous avez souhaité un débat à ce sujet dans les prochains jou

### Let's measure the BLEU scores for the dataset that we have sampled.

In [24]:
bleu_metric = evaluate.load("bleu")

### Zero-shot results for full translation sample test set

In [25]:
# Split prompts into batches for memory management.
zero_shot_translations = []
zero_shot_batches = split_prompts_into_batches(zero_shot_prompts, batch_size=1)
for zero_shot_batch in tqdm(zero_shot_batches):
    generations = model.generate(zero_shot_batch, long_generation_config)
    for single_generation in generations.generation["sequences"]:
        generation_text = parse_model_response(single_generation)
        zero_shot_translations.append(generation_text)

  0%|          | 0/100 [00:00<?, ?it/s]

100%|██████████| 100/100 [08:34<00:00,  5.14s/it]


In [26]:
def convert_references_for_bleu(references: List[str]) -> List[List[str]]:
    # The bleu metric requires inputs to be stored as lists of lists. So we encapsulate each reference in a list
    return [[reference] for reference in references]

In [27]:
bleu_metric.compute(predictions=zero_shot_translations, references=convert_references_for_bleu(english_texts))

{'bleu': 0.07443099041449071,
 'precisions': [0.30645944633317146,
  0.13067891781521185,
  0.07423345884884347,
  0.046617396247868106],
 'brevity_penalty': 0.6859970294287353,
 'length_ratio': 0.7262786596119929,
 'translation_length': 2059,
 'reference_length': 2835}

### Few-shot example results for full translation sample test set

__NOTE__: This takes quite a while to run due to the sequence length associated with 10-shot prompts.

In [28]:
# Split prompts into batches for memory management.
few_shot_translations = []
few_shot_batches = split_prompts_into_batches(few_shot_prompts, batch_size=1)
for few_shot_batch in tqdm(few_shot_batches):
    generations = model.generate(few_shot_batch, long_generation_config)
    for single_generation in generations.generation["sequences"]:
        generation_text = parse_model_response(single_generation)
        few_shot_translations.append(generation_text)

100%|██████████| 90/90 [18:20<00:00, 12.23s/it]


In [29]:
# Note that we're only taking the remaining 90 examples from the test set, since we used the first 10 for few-shot
# examples
bleu_metric.compute(
    predictions=few_shot_translations, references=convert_references_for_bleu(english_texts[n_examples:])
)

{'bleu': 0.2745064853905504,
 'precisions': [0.5810978126289723,
  0.347621088726961,
  0.23207126948775056,
  0.16411682892906815],
 'brevity_penalty': 0.9268728194309799,
 'length_ratio': 0.9294207901802839,
 'translation_length': 2423,
 'reference_length': 2607}

Clearly the few-shot prompt does a much better job in translation, at least as measured by the BLEU Score. Let's briefly consider why.

In [34]:
zero_shot_translations[-10:]

['Je ne suis pas au courant de ce que vous dites.',
 "Je voudrais donc répéter, Madame la Présidente, que, pour notre part, nous avons discuté du programme d'action pour les cinq prochaines années et que nous sommes prêts à venir présenter le programme pour les cinq prochaines années quand le Parlement le décidera, y compris cette semaine, si telle est sa décision, et le programme pour",
 'I propose that we vote on the request of the Socialist Group for the Commission to adopt its strategic objectives.',
 'The President.',
 'I would like to make a statement on the question of the day, Wednesday, on the matter of the capital gains tax.',
 "The English translation of Le groupe PPE-DE demande de retirer ce point de l'ordre du jour is: The Group of the European People's Party (Christian Democrats) asks to withdraw this item from the agenda.",
 'There is no English translation for this.',
 'Madame la Présidente, je voudrais répondre à la question de M.',
 "Je ne sais pas si cette informatio

In [36]:
few_shot_translations[-10:]

["I should add, Madam President, that, on the Commission' s side, we are ready and willing to hold this debate when you see fit, that we were ready to develop it this week, as had been decided at the outset, based on the fact that it was presented the day before in a speech to the groups in Parliament.",
 'Mr President, I should like to repeat, Madam President, that, for our part, we have discussed the programme for the next five years and that we are ready to come and present the programme for the next five years when the House decides to do so, including this week, if that is its decision, and the programme for the year 2000, next',
 "I propose that we vote on the request of the Socialist Group for the reinsertion of the Commission' s statement on its strategic objectives.",
 'Je vous remercie.',
 'On to the next point of order, Madam President, I have a further proposal concerning the oral question on the capital tax.',
 'The PPE-DE Group is asking that this item be taken off the ag

Just by inspecting the outputs of the first few translations of the model, we can see where a portion, at least, of the disparity is coming from. The zero shot prompts often provide decent translations, but the model does not always produce an english translation immediately or at all. On the other hand, the few shot translations are all in English and of high fidelity