# INFOMNLG - Practical Take-home task

In this assignment, the goal is to perform some simple experiments with a pretrained, instruction-tuned generative model. To do this task you will need:

1. Basic coding knowledge with Python. If you have used the Huggingface transformers library before, then this will be useful. But it is not necessary.
2. A good understanding of the basic task of data-to-text generation, which has been discussed in class at various points.

## Things you need in advance
The task centres around the [WebNLG Challenge](https://synalp.gitlabpages.inria.fr/webnlg-challenge/) (also discussed in class at various points). Take some time to familiarise yourself with this task.

You will be using the English version of WebNLG and you will only need the `dev` and `test` sets. You will not be using the training data.

* Download the WebNLG v2.1 dev and test sets from [here](https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v2.1/json?ref_type=heads).

We will also be using a pretrained, instruction tuned model called Phi-4, which was developed by Microsoft. We will use a small version of this model (bigger versions are quite compute-expensive).

* Take a look at the [Phi-4 model description](https://huggingface.co/microsoft/phi-4) on huggingface.

# Basic housekeeping

We first need to install a few libraries. We'll be using the `transformers` library from Huggingface.

The class `AutoModelForCausalLM` allowed you to load a model's weights and use  it directly for generation. `AutoTokenizer` ensures that you load the right tokenizer for this model. It's also  useful to put things in a pipeline to run them smoothly.


In [1]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

## Model preparation

The code below will instantiate the pretrained model and will also create a pipeline to perform generation based on a prompt.

In this exercise, you will be designing prompts and passing them to the pipeline, getting back the generated text and evaluating it against the reference outputs.

In [None]:
#Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

# Create a pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Device set to use cuda


# Data Handling

The webnlg data is in `json` format and takes the form of a list of dictionaries, where each dictionary is a single instance. There are three important pieces of information you will need about each instance:

1. A **tripleset**, which is the set of RDF triples that constitute the input. In the data you downloaded, you need to use the field `modifiedtripleset` (you can ignore the other one). Note that triplesets consist of multiple triples, each with a subject, a property and an object. For example:

```
 {
 "object": "United_States",
 "property": "country",
 "subject": "Alpharetta,_Georgia"
},

```
The above basically says that Alpharetta, Georgia is in the country United States.

2. A set of **lexicalisations**, which are actual texts written by people, to render the information in the RDF triples. Note that there are usually multiple reference texts for a single tripleset.

3. A **category**, which gives an indication of the general topic (e.g. cities, people etc).

## Task 1: Zero-shot generation

The first task is to generate text from RDF triples in zero-shot mode. By *zero-shot*, we mean a setting in which the model is not trained, and is also not given any examples. Rather, it  is expected to generate text directly from the input, based on a simple instruction. Here are the steps you need to execute:

1. Load the data in `test` (**NB: not dev!**)
2. For each test instance, create a prompt that includes the RDF triples as input data.
3. Pass the prompt with the input data to your generation pipeline and retrieve the result.

## Some useful tips
* Try out a few examples and think about a suitable format for the input prompt. The triples in the dataset are represented as dictionaries. You can transform this into a different format, for example in the form of a string.
* Think of the prompt itself. Do you need to give instructions about the format of the input data? Do you need to give the model a "persona" (see below)?
* You can make your processing more efficient by batching multiple instructions in a single call to the model. See the example below, and especially what we do with `messages`.

Below is an example of how you can create a prompt and run it  in your pipeline to generate. You can use and modify this example for the WebNLG case.

### What to report
Reproduce your prompt for one of the WebNLG instances in your answer to this task in the space provided.

In [None]:
# This is an example of a prompt (not with WebNLG data)

#We can create a persona for the model, and then give it an instruction.
persona = "You are a fluent speaker of Dutch who has been employed by the Utrecht City Council."
instruction = "Your task is to generate a slogan for the city of Utrecht. This will be used to attract Dutch visitors from other cities."

#organise the prompt and specify that it's coming from the user.
# Note how we can pass on multiple messages, each taking the form of a dictionary
messages = [{"role": "user", "content": persona + instruction}]

#To see the format this will take once passed to the model, we can inspect how the tokenizer formats it by applying a template for model instructions.
print(tokenizer.apply_chat_template(messages, tokenize=False))

#now we generate the output:
output = pipe(messages)
print(output[0]["generated_text"])



<|user|>
You are a fluent speaker of Dutch who has been employed by the Utrecht City Council.Your task is to generate a slogan for the city of Utrecht. This will be used to attract Dutch visitors from other cities.<|end|>
<|endoftext|>


The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48


 "Utrecht: De Stad van Historie en Innovatie - Verbinden van Verleden en Toekomst."

This slogan translates to "Utrecht: The City of History and Innovation - Connecting Past and Future." It highlights the city's rich history and its commitment to innovation, while also emphasizing the connection between the past and the future. This slogan aims to attract Dutch visitors from other cities by showcasing Utrecht's unique blend of tradition and modernity.


### First try

In [None]:
import json

# Load WebNLG test dataset
with open("webnlg_release_v2.1_test.json", "r") as f:
    data = json.load(f)

# Now access the entries
entries = data["entries"]

# Function to format RDF triples
def format_triples(triples):
    return "; ".join([f"{t['subject']} - {t['property']} - {t['object']}" for t in triples])

# Example instance from dataset
example_instance = entries[0]["1"]  # Access the first entry in the "entries" list
triples_text = format_triples(example_instance["modifiedtripleset"])

# Creating the prompt
persona = "You are a knowledgeable AI that converts structured data into fluent English text."
instruction = f"Generate a natural language description for the following RDF triples: {triples_text}"

messages = [{"role": "user", "content": persona + " " + instruction}]

# Generate output
output = pipe(messages)
print("Generated Text:", output[0]["generated_text"])

Generated Text:  The Abilene Regional Airport serves the city of Abilene, which is located in Texas.


In [None]:
print(triples_text)

Abilene_Regional_Airport - cityServed - Abilene,_Texas


### Second try

In [None]:
import json

# Load the WebNLG test data
with open("webnlg_release_v2.1_test.json", "r") as f:
    test_data = json.load(f)

def create_prompt(triples):
    # Convert triples to a readable string format
    triples_str = "\n".join([f"{triple['subject']} | {triple['property']} | {triple['object']}" for triple in triples])

    # Create the prompt
    prompt = f"""You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following 'subject | property | object' triples:
{triples_str}

Please generate a concise and accurate description of the data provided above.
"""
    return prompt

# Access the list of instances
entries = test_data["entries"]

# Generate text for the first 5 instances
for entry in entries[:1]:  # Limiting to the first 5 instances for demonstration
    # Each entry is a dictionary with a single key (e.g., "1", "2")
    instance_key = list(entry.keys())[0]  # Get the key (e.g., "1")
    instance = entry[instance_key]  # Access the actual instance data

    # Extract the modified triples
    triples = instance["modifiedtripleset"]

    # Create the prompt
    prompt = create_prompt(triples)

    # Organize the prompt as a message
    messages = [{"role": "user", "content": prompt}]

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    print(f"Prompt:\n{prompt}\n")
    print(f"Generated Text:\n{generated_text}\n")
    print("-" * 50)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Prompt:
You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following 'subject | property | object' triples:
Abilene_Regional_Airport | cityServed | Abilene,_Texas

Please generate a concise and accurate description of the data provided above.


Generated Text:
 The Abilene Regional Airport serves the city of Abilene, Texas.

--------------------------------------------------


In [None]:
import json

# Load the WebNLG test data
with open("webnlg_release_v2.1_test.json", "r") as f:
    test_data = json.load(f)

def create_prompt(triples):
    # Convert triples to a readable string format
    triples_str = "\n".join([f"{triple['subject']} | {triple['property']} | {triple['object']}" for triple in triples])

    # Create the prompt
    prompt = f"""You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following RDF triples:
{triples_str}

Please generate a concise and accurate description of the data provided above.
"""
    return prompt

# Access the list of instances
entries = test_data["entries"]

# Generate text for the first 5 instances
for entry in entries[:5]:  # Limiting to the first 5 instances for demonstration
    # Each entry is a dictionary with a single key (e.g., "1", "2")
    instance_key = list(entry.keys())[0]  # Get the key (e.g., "1")
    instance = entry[instance_key]  # Access the actual instance data

    # Extract the modified triples
    triples = instance["modifiedtripleset"]

    # Create the prompt
    prompt = create_prompt(triples)

    # Organize the prompt as a message
    messages = [{"role": "user", "content": prompt}]

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    print(f"Prompt:\n{prompt}\n")
    print(f"Generated Text:\n{generated_text}\n")
    print("-" * 50)

Prompt:
You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following RDF triples:
Abilene_Regional_Airport | cityServed | Abilene,_Texas

Please generate a concise and accurate description of the data provided above.


Generated Text:
 The Abilene Regional Airport serves the city of Abilene, Texas.

--------------------------------------------------
Prompt:
You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following RDF triples:
Adolfo_Suárez_Madrid–Barajas_Airport | location | "Madrid, Paracuellos de Jarama, San Sebastián de los Reyes and Alcobendas"

Please generate a concise and accurate description of the data provided above.


Generated Text:
 Adolfo Suárez Madrid–Barajas Airport is located in Madrid, specifica

## Evaluation

You now want to evaluate the model output. The most straightforward way to do this is to compare the generated text to the text in the reference lexicalisations that are available in the dataset.

For this task, we will use `bleu`, a widely-used metric that computes a precision score based on n-grams. The idea is that generated texts are scored higher if they contain n-grams which are also in the reference texts. You can read more about this metric [here](https://aclanthology.org/P02-1040/).

We will use a standard library to compute the `bleu` score, called `sacrebleu`. You will find the documentation and examples here. **NB: Remember that you have multiple reference texts for each instance. BLEU is perfectly suited to comparing one generated output to multiple references. (In fact, it works better if you have more than one reference to compare to.)

### What to report
In the answer sheet, fill out the first cell of the results table with the bleu score obtained by your model.*italicised text*

In [None]:
pip install sacrebleu

Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-none-any.whl.metadata (8.6 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Downloading sacrebleu-2.5.1-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Downloading portalocker-3.1.1-py3-none-any.whl (19 kB)
Installing collected packages: portalocker, colorama, sacrebleu
Successfully installed colorama-0.4.6 portalocker-3.1.1 sacrebleu-2.5.1


In [None]:
from sacrebleu import corpus_bleu
import json

# Load the WebNLG test data
with open("webnlg_release_v2.1_test.json", "r") as f:
    test_data = json.load(f)

def create_prompt(triples):
    # Convert triples to a readable string format
    triples_str = "\n".join([f"{triple['subject']} | {triple['property']} | {triple['object']}" for triple in triples])

    # Create the prompt
    prompt = f"""You are a fluent English speaker who has been employed to generate natural language descriptions from structured data.
Your task is to generate a natural language description based on the following RDF triples:
{triples_str}

Please generate a concise and accurate description of the data provided above.
"""
    return prompt

# Lists to store generated texts and reference texts
generated_texts = []
reference_texts = []

i = 1

# Process each instance in the test data
for entry in test_data["entries"]:
    print("Entry: ", i)
    i = i + 1

    # Get the instance key and data
    instance_key = list(entry.keys())[0]
    instance = entry[instance_key]

    # Extract the modified triples
    triples = instance["modifiedtripleset"]

    # Create the prompt
    prompt = create_prompt(triples)

    # Organize the prompt as a message
    messages = [{"role": "user", "content": prompt}]

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    # Append the generated text to the list
    generated_texts.append(generated_text)

    # Extract the reference lexicalisations
    references = [lex["lex"] for lex in instance["lexicalisations"]]

    # Append the references to the list
    reference_texts.append(references)

# Compute the BLEU score
bleu_score = corpus_bleu(generated_texts, reference_texts)
print(f"BLEU Score: {bleu_score.score:.2f}")

Entry:  1
Entry:  2
Entry:  3
Entry:  4
Entry:  5
Entry:  6
Entry:  7


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Entry:  8
Entry:  9
Entry:  10
Entry:  11
Entry:  12
Entry:  13
Entry:  14
Entry:  15
Entry:  16
Entry:  17
Entry:  18
Entry:  19
Entry:  20
Entry:  21
Entry:  22
Entry:  23
Entry:  24
Entry:  25
Entry:  26
Entry:  27
Entry:  28
Entry:  29
Entry:  30
Entry:  31
Entry:  32
Entry:  33
Entry:  34
Entry:  35
Entry:  36
Entry:  37
Entry:  38
Entry:  39
Entry:  40
Entry:  41
Entry:  42
Entry:  43
Entry:  44
Entry:  45
Entry:  46
Entry:  47
Entry:  48
Entry:  49
Entry:  50
Entry:  51
Entry:  52
Entry:  53
Entry:  54
Entry:  55
Entry:  56
Entry:  57
Entry:  58
Entry:  59
Entry:  60
Entry:  61
Entry:  62
Entry:  63
Entry:  64
Entry:  65
Entry:  66
Entry:  67
Entry:  68
Entry:  69
Entry:  70
Entry:  71
Entry:  72
Entry:  73
Entry:  74
Entry:  75
Entry:  76
Entry:  77
Entry:  78
Entry:  79
Entry:  80
Entry:  81
Entry:  82
Entry:  83
Entry:  84
Entry:  85
Entry:  86
Entry:  87
Entry:  88
Entry:  89
Entry:  90
Entry:  91
Entry:  92
Entry:  93
Entry:  94
Entry:  95
Entry:  96
Entry:  97
Entry:  98
E

In [None]:
import json

# Save generated_texts_zero_shot to a JSON file
with open("generated_texts_zero_shot.json", "w") as f:
    json.dump(generated_texts, f)

## Task 2: Few-shot generation

Few-shot generation refers to the case where, before giving the model an instruction to generate, we first give it one or more examples.

As an example, consider the prompt below, which is the same as the example above, but now supplies an example.


In [None]:
# This is an example of a one-shot prompt (not with WebNLG data)

#To simplify, here we combine the persona and instruction.
instruction_1 = "You are a fluent speaker of Dutch who has been employed by the Utrecht City Council. Generate a slogan for the city of Utrecht to attract tourists."

instruction_2 = "You are a fluent speaker of Dutch who has been employed by the Utrecht University. Generate a slogan for the university to attract Dutch students."

#organise the prompt and specify that it's coming from the user.
# Note how we can pass on multiple messages, each taking the form of a dictionary
messages = [
  {"role": "user", "content": instruction_1},
  {"role": "assistant", "content": "Het weer in Utrecht is prachtig."},
  {"role": "user", "content": instruction_2}
  ]

## What you need to do
For this task, you need to re-run the generation over the entire test set, but this time, for each instance, you need to include an example input/output pair, before the input for which you actually want to generate. You choose the input/output example from the dev set (**NB: not the test set!**)

You will do this in two ways:
1. Using a random example from the dev set as the example in the prompt;
2. Using an example from the dev set which has the same category as the test example (e.g. they are both about cities)

Finally, evaluate the output in each case, once again using the bleu score.

### What to report
Complete the results table in the answer sheet and answer the questions about the results.



In [None]:
import json

# Load WebNLG dev and test datasets
with open("webnlg_release_v2.1_dev.json", "r") as f:
    dev_data = json.load(f)

with open("webnlg_release_v2.1_test.json", "r") as f:
    test_data = json.load(f)

In [None]:
import random

def create_few_shot_prompt_random(test_instance, dev_data):
    # Select a random example from the dev set
    random_entry = random.choice(dev_data["entries"])
    random_instance_key = list(random_entry.keys())[0]
    random_instance = random_entry[random_instance_key]

    # Extract the random example's triples and lexicalisations
    random_triples = random_instance["modifiedtripleset"]
    random_reference = random_instance["lexicalisations"][0]["lex"]  # Use the first reference

    # Create the prompt for the random example
    random_prompt = create_prompt(random_triples)

    # Create the prompt for the test instance
    test_triples = test_instance["modifiedtripleset"]
    test_prompt = create_prompt(test_triples)

    # Combine the random example and test instance into a few-shot prompt
    messages = [
        {"role": "user", "content": random_prompt},
        {"role": "assistant", "content": random_reference},
        {"role": "user", "content": test_prompt}
    ]

    return messages

In [None]:
def create_few_shot_prompt_same_category(test_instance, dev_data):
    test_category = test_instance["category"]

    # Find an example from the dev set with the same category
    same_category_examples = []
    for entry in dev_data["entries"]:
        instance_key = list(entry.keys())[0]
        instance = entry[instance_key]
        if instance["category"] == test_category:
            same_category_examples.append(instance)
            break  # Stop after finding the first matching example

    # Select the first matching example (if any)
    if same_category_examples:
        example_instance = same_category_examples[0]
        example_triples = example_instance["modifiedtripleset"]
        example_reference = example_instance["lexicalisations"][0]["lex"]  # Use the first reference
    else:
        # If no matching category, fall back to a random example
        example_instance = random.choice(dev_data["entries"])
        example_instance_key = list(example_instance.keys())[0]
        example_instance = example_instance[example_instance_key]
        example_triples = example_instance["modifiedtripleset"]
        example_reference = example_instance["lexicalisations"][0]["lex"]

    # Create the prompt for the example
    example_prompt = create_prompt(example_triples)

    # Create the prompt for the test instance
    test_triples = test_instance["modifiedtripleset"]
    test_prompt = create_prompt(test_triples)

    # Combine the example and test instance into a few-shot prompt
    messages = [
        {"role": "user", "content": example_prompt},
        {"role": "assistant", "content": example_reference},
        {"role": "user", "content": test_prompt}
    ]

    return messages

In [None]:
from sacrebleu import corpus_bleu

# Lists to store generated texts and reference texts
generated_texts_random = []
reference_texts = []

i = 1

# Process each instance in the test data
# for entry in test_data["entries"][:5]:
for entry in test_data["entries"]:
    print("Entry: ", i)
    i = i + 1

    # Get the instance key and data
    instance_key = list(entry.keys())[0]
    instance = entry[instance_key]

    # Create the few-shot prompt with a random example
    messages = create_few_shot_prompt_random(instance, dev_data)

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    # Append the generated text to the list
    generated_texts_random.append(generated_text)

    # Extract the reference lexicalisations
    references = [lex["lex"] for lex in instance["lexicalisations"]]
    reference_texts.append(references)

# Compute the BLEU score for the random example approach
bleu_score_random = corpus_bleu(generated_texts_random, reference_texts)
print(f"BLEU Score (Random Example): {bleu_score_random.score:.2f}")

Entry:  1
Entry:  2
Entry:  3
Entry:  4
Entry:  5
Entry:  6
Entry:  7
Entry:  8
Entry:  9


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Entry:  10
Entry:  11
Entry:  12
Entry:  13
Entry:  14
Entry:  15
Entry:  16
Entry:  17
Entry:  18
Entry:  19
Entry:  20
Entry:  21
Entry:  22
Entry:  23
Entry:  24
Entry:  25
Entry:  26
Entry:  27
Entry:  28
Entry:  29
Entry:  30
Entry:  31
Entry:  32
Entry:  33
Entry:  34
Entry:  35
Entry:  36
Entry:  37
Entry:  38
Entry:  39
Entry:  40
Entry:  41
Entry:  42
Entry:  43
Entry:  44
Entry:  45
Entry:  46
Entry:  47
Entry:  48
Entry:  49
Entry:  50
Entry:  51
Entry:  52
Entry:  53
Entry:  54
Entry:  55
Entry:  56
Entry:  57
Entry:  58
Entry:  59
Entry:  60
Entry:  61
Entry:  62
Entry:  63
Entry:  64
Entry:  65
Entry:  66
Entry:  67
Entry:  68
Entry:  69
Entry:  70
Entry:  71
Entry:  72
Entry:  73
Entry:  74
Entry:  75
Entry:  76
Entry:  77
Entry:  78
Entry:  79
Entry:  80
Entry:  81
Entry:  82
Entry:  83
Entry:  84
Entry:  85
Entry:  86
Entry:  87
Entry:  88
Entry:  89
Entry:  90
Entry:  91
Entry:  92
Entry:  93
Entry:  94
Entry:  95
Entry:  96
Entry:  97
Entry:  98
Entry:  99
Entry:  10

NameError: name 'corpus_bleu' is not defined

In [None]:
from sacrebleu import corpus_bleu

# Compute the BLEU score for the random example approach
bleu_score_random = corpus_bleu(generated_texts_random, reference_texts)
print(f"BLEU Score (Random Example): {bleu_score_random.score:.2f}")

BLEU Score (Random Example): 60.43


In [None]:
from sacrebleu import corpus_bleu

# Lists to store generated texts and reference texts
generated_texts_same_category = []
reference_texts = []

i = 1
# Process each instance in the test data
for entry in test_data["entries"]:
    print("Entry: ", i)

    i = i + 1

    # Get the instance key and data
    instance_key = list(entry.keys())[0]
    instance = entry[instance_key]

    # Create the few-shot prompt with a same-category example
    messages = create_few_shot_prompt_same_category(instance, dev_data)

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    # Append the generated text to the list
    generated_texts_same_category.append(generated_text)

    # Extract the reference lexicalisations
    references = [lex["lex"] for lex in instance["lexicalisations"]]
    reference_texts.append(references)

# Compute the BLEU score for the same-category example approach
bleu_score_same_category = corpus_bleu(generated_texts_same_category, reference_texts)
print(f"BLEU Score (Same-Category Example): {bleu_score_same_category.score:.2f}")

Entry:  1
Entry:  2
Entry:  3
Entry:  4
Entry:  5
Entry:  6
Entry:  7
Entry:  8
Entry:  9
Entry:  10
Entry:  11
Entry:  12
Entry:  13
Entry:  14
Entry:  15
Entry:  16
Entry:  17
Entry:  18
Entry:  19
Entry:  20
Entry:  21
Entry:  22
Entry:  23
Entry:  24
Entry:  25
Entry:  26
Entry:  27
Entry:  28
Entry:  29
Entry:  30
Entry:  31
Entry:  32
Entry:  33
Entry:  34
Entry:  35
Entry:  36
Entry:  37
Entry:  38
Entry:  39
Entry:  40
Entry:  41
Entry:  42
Entry:  43
Entry:  44
Entry:  45
Entry:  46
Entry:  47
Entry:  48
Entry:  49
Entry:  50
Entry:  51
Entry:  52
Entry:  53
Entry:  54
Entry:  55
Entry:  56
Entry:  57
Entry:  58
Entry:  59
Entry:  60
Entry:  61
Entry:  62
Entry:  63
Entry:  64
Entry:  65
Entry:  66
Entry:  67
Entry:  68
Entry:  69
Entry:  70
Entry:  71
Entry:  72
Entry:  73
Entry:  74
Entry:  75
Entry:  76
Entry:  77
Entry:  78
Entry:  79
Entry:  80
Entry:  81
Entry:  82
Entry:  83
Entry:  84
Entry:  85
Entry:  86
Entry:  87
Entry:  88
Entry:  89
Entry:  90
Entry:  91
Entry:  

### Different method for same category

In [None]:
from collections import defaultdict

def preprocess_dev_data(dev_data):
    category_to_examples = defaultdict(list)

    for entry in dev_data["entries"]:
        instance_key = list(entry.keys())[0]
        instance = entry[instance_key]

        category = instance["category"]
        triples = instance["modifiedtripleset"]
        lexicalisation = instance["lexicalisations"][0]["lex"]  # Use the first lexicalisation

        category_to_examples[category].append((triples, lexicalisation))

    return category_to_examples

In [None]:
category_to_examples = preprocess_dev_data(dev_data)

In [None]:
category_to_examples

defaultdict(list,
            {'Airport': [([{'object': 'Jacob_Bundsgaard',
                 'property': 'leaderName',
                 'subject': 'Aarhus'}],
               'The leader of Aarhus is Jacob Bundsgaard.'),
              ([{'object': '2702.0',
                 'property': 'runwayLength',
                 'subject': 'Aarhus_Airport'}],
               "Aarhus Airport's runway length is 2702.0."),
              ([{'object': '507',
                 'property': 'elevationAboveTheSeaLevel_(in_metres)',
                 'subject': 'Adirondack_Regional_Airport'}],
               'Adirondack Regional Airport is 507 metres above sea level.'),
              ([{'object': 'Harrietstown,_New_York',
                 'property': 'location',
                 'subject': 'Adirondack_Regional_Airport'}],
               'Adirondack Regional airport is located at Harrietstown, New York.'),
              ([{'object': 'San_Sebastián_de_los_Reyes',
                 'property': 'location',
        

In [None]:
import random

def create_few_shot_prompt_same_category(test_instance, category_to_examples):
    test_category = test_instance["category"]

    print("Category: ", test_category)

    # Check if there are examples for the test category
    if test_category in category_to_examples:
        # Randomly select an example from the same category
        example_triples, example_reference = random.choice(category_to_examples[test_category])
        print("Same category example: ", example_reference)
    else:
        # If no examples for the category, fall back to a random example from any category
        all_examples = [example for examples in category_to_examples.values() for example in examples]
        example_triples, example_reference = random.choice(all_examples)
        print("Random category example: ", example_reference)

    # Create the prompt for the example
    example_prompt = create_prompt(example_triples)

    # Create the prompt for the test instance
    test_triples = test_instance["modifiedtripleset"]
    test_prompt = create_prompt(test_triples)

    # Combine the example and test instance into a few-shot prompt
    messages = [
        {"role": "user", "content": example_prompt},
        {"role": "assistant", "content": example_reference},
        {"role": "user", "content": test_prompt}
    ]

    return messages

In [None]:
from sacrebleu import corpus_bleu

# Lists to store generated texts and reference texts
generated_texts_same_category = []
reference_texts = []

i = 1
# Process each instance in the test data
for entry in test_data["entries"]:
    print("Entry: ", i)
    i = i + 1

    # Get the instance key and data
    instance_key = list(entry.keys())[0]
    instance = entry[instance_key]

    # Create the few-shot prompt with a same-category example
    messages = create_few_shot_prompt_same_category(instance, category_to_examples)

    # Generate the output
    output = pipe(messages)
    generated_text = output[0]["generated_text"]

    # Append the generated text to the list
    generated_texts_same_category.append(generated_text)

    # Extract the reference lexicalisations
    references = [lex["lex"] for lex in instance["lexicalisations"]]
    reference_texts.append(references)

# Compute the BLEU score for the same-category example approach
bleu_score_same_category = corpus_bleu(generated_texts_same_category, reference_texts)
print(f"BLEU Score (Same-Category Example): {bleu_score_same_category.score:.2f}")

Entry:  1
Category:  Airport
Same category example:  Agra airport is in Uttar Pradesh, which is part of Awadh and Bundelkhand, is located 167.94 metres above sea level.
Entry:  2
Category:  Airport
Same category example:  The 3rd runway at Ardmore Airport (New Zealand) is made of Poaceae which is of the order Poales and belongs to the division of flowering plants.
Entry:  3
Category:  Airport
Same category example:  Al Asad Airbase is 618 ft above sea level and is located in Al Anbar Province, Iraq.
Entry:  4
Category:  Airport
Same category example:  The leader of Aarhus is Jacob Bundsgaard.
Entry:  5
Category:  Airport
Same category example:  At 159 meters above sea level, Angola International Airport is located at Ícolo e Bengo, part of Luanda Province, in Angola.
Entry:  6
Category:  Airport
Same category example:  Al Asad Airbase is operated by the United States Air Force who deploy the Lockheed AC-130 as an attack aircraft and the McDonnell Douglas F-15 Eagle as a fighter aircraf

In [None]:
import json

# Save generated_texts_same_category to a JSON file
with open("generated_texts_few_shot_category.json", "w") as f:
    json.dump(generated_texts_same_category, f)

# Bonus task (Optional. You get 10 extra points if you do this)
We have evaluated our output purely based on n-gram overlap with human-written text.

Another useful dimension to measure is variation, i.e. how much a model's text is repetitive, or makes use of a variety of words and expressions.

Can you think of a metric to evaluate variation in model output? It should be able to quantify the variation, returning a value (just as bleu does) that allows us to compare different models and settings.

Describe your method in the answer sheet and report your results for zero-shot and one-shot settings.

In [None]:
# Lists to store generated texts
generated_texts_zero_shot = []
generated_texts_few_shot_random = []
generated_texts_few_shot_category = []

# Generate text for zero-shot and few-shot approaches
for entry in test_data["entries"][:3]:
    instance_key = list(entry.keys())[0]
    instance = entry[instance_key]

    # Zero-shot generation
    zero_shot_prompt = create_prompt(instance["modifiedtripleset"])
    zero_shot_output = pipe([{"role": "user", "content": zero_shot_prompt}])
    generated_texts_zero_shot.append(zero_shot_output[0]["generated_text"])

    # Few-shot generation (random example)
    few_shot_prompt_random = create_few_shot_prompt_random(instance, dev_data)
    few_shot_output_random = pipe(few_shot_prompt_random)
    generated_texts_few_shot_random.append(few_shot_output_random[0]["generated_text"])

    # Few-shot generation (category example)
    few_shot_prompt_category = create_few_shot_prompt_same_category(instance, dev_data)
    few_shot_output_category = pipe(few_shot_prompt_category)
    generated_texts_few_shot_category.append(few_shot_output_category[0]["generated_text"])

In [None]:
print(generated_texts_zero_shot)
print(generated_texts_few_shot_random)
print(generated_texts_few_shot_category)

[' The Abilene Regional Airport serves the city of Abilene, Texas.', ' Adolfo_Suárez_Madrid–Barajas_Airport is located in Madrid, specifically in the areas of Paracuellos de Jarama, San Sebastián de los Reyes, and Alcobendas.', ' The runway at Adolfo Suárez Madrid–Barajas Airport is named "18L/36R".']
[' Abilene Regional Airport serves the city of Abilene, Texas.', ' Adolfo Suárez Madrid–Barajas Airport is located in Madrid, specifically in the areas of Paracuellos de Jarama, San Sebastián de los Reyes, and Alcobendas.', ' Adolfo Suárez Madrid–Barajas Airport has a runway named "18L/36R".']
[' The Abilene Regional Airport serves the city of Abilene, Texas.', ' Adolfo_Suárez_Madrid–Barajas_Airport is located in Madrid, specifically in the areas of Paracuellos de Jarama, San Sebastián de los Reyes, and Alcobendas.', ' The runway at Adolfo Suárez Madrid–Barajas Airport is named "18L/36R."']


In [None]:
import json

# Load generated_texts_zero_shot from the JSON file
with open("generated_texts_zero_shot.json", "r") as f:
    generated_texts_zero_shot = json.load(f)

In [None]:
import json

# Load generated_texts_few_shot_random from the JSON file
with open("generated_texts_few_shot_random.json", "r") as f:
    generated_texts_few_shot_random = json.load(f)

In [None]:
import json

# Load generated_texts_few_shot_category from the JSON file
with open("generated_texts_few_shot_category.json", "r") as f:
    generated_texts_few_shot_category = json.load(f)

In [None]:
from sacrebleu import corpus_bleu

def compute_self_bleu(generated_texts):
    self_bleu_scores = []
    for i in range(len(generated_texts)):
        print(i)
        hypothesis = generated_texts[i]
        references = generated_texts[:i] + generated_texts[i+1:]  # Exclude the hypothesis itself
        self_bleu_scores.append(corpus_bleu([hypothesis], [references]).score)
    return sum(self_bleu_scores) / len(self_bleu_scores)

In [None]:
# Compute Self-BLEU for zero-shot and few-shot approaches
self_bleu_zero_shot = compute_self_bleu(generated_texts_zero_shot)
self_bleu_few_shot_random = compute_self_bleu(generated_texts_few_shot_random)
self_bleu_few_shot_category = compute_self_bleu(generated_texts_few_shot_category)

print(f"Self-BLEU (Zero-Shot): {self_bleu_zero_shot:.2f}")
print(f"Self-BLEU (Few-Shot-Random): {self_bleu_few_shot_random:.2f}")
print(f"Self-BLEU (Few-Shot-Category): {self_bleu_few_shot_category:.2f}")

Self-BLEU (Zero-Shot): 2.15
Self-BLEU (Few-Shot-Random): 2.25
Self-BLEU (Few-Shot-Category): 2.15


In [None]:
# Compute Self-BLEU
self_bleu_zero_shot = compute_self_bleu(generated_texts_zero_shot)

print(f"Self-BLEU (Zero-Shot): {self_bleu_zero_shot:.2f}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [None]:
# Compute Self-BLEU
self_bleu_few_shot_random = compute_self_bleu(generated_texts_few_shot_random)

print(f"Self-BLEU (Few-Shot-Random): {self_bleu_few_shot_random:.2f}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [None]:
# Compute Self-BLEU
self_bleu_few_shot_category = compute_self_bleu(generated_texts_few_shot_category)

print(f"Self-BLEU (Few-Shot-Category): {self_bleu_few_shot_category:.2f}")

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

Interpretation
- A lower Self-BLEU score indicates higher diversity in the generated text.
- A higher Self-BLEU score indicates more repetitive text.