# Deep Learning for NLP - Exercise 05
In this exercise, we will use Llama-2 to learn about and experiment with various prompting strategies to solve classification tasks with text generation models.

___
General hints:
* Have a look at the imports below when solving the tasks
* Use the given modules and all submodules of the imports, but don't import anything else!
    * For instance, you can use other functions under the `torch` or `nn` namespace, but don't import e.g. PyTorch Lightning, etc.
* It is recommended to install all packages from the provided environment file
* Feel free to test your code between sub-tasks of the exercise sheet, so that you can spot mistakes early (wrong shapes, impossible numbers, NaNs, ...)
* Just keep in mind that your final submission should be compliant to the provided initial format of this file

Submission guidelines:
* Make sure that the code runs on package versions from the the provided environment file
* Do not add or change any imports (also don't change the naming of imports, e.g. `torch.nn.functional as f`)
* Remove your personal, additional code testings and experiments throughout the notebook
* Do not change the class, function or naming structure as we will run tests on the given names
* Additionally export this notebook as a `.py` file, and submit **both** the executed `.ipynb` notebook with plots in it **and** the `.py` file
* **Deviation from the above guidelines will result in partial or full loss of points**

In [8]:
# !pip install datasets==2.13.1
# !pip3 install transformers==4.33.2
# !pip3 install optimum==1.13.2
# !pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

* The [13-billion](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf) parameter, instruction-finetuned version of [Llama-2](https://arxiv.org/abs/2307.09288) will serve as our base model
* However, we will use a quantized version of the model since loading the original model (even in 16-bit floating point precision) would require approximately 26GB of GPU vRAM
* Due to that limitation, people started to distill and quantize [[1]](https://www.youtube.com/watch?v=2ETNONas068), [[2]](https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34) models into low-precision versions, which dramatically reduced memory requirements for inference while still keeping the performance (depending on quantization technique and strength) above the lower-parameter, full-precision version
* In our case, we will use the [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) version from Hugging Face, which takes up "only" around 8-10GB of vRAM, making it possible to use it on the T4 GPUs of Google Colab.
* This technique allows us to load and use a full 13 billion parameter model for inference
    * For comparison, the BERT model from exercises 3 and 4 had around 110 *million*
* Since the model file already requires 8GB of disk space and you need the current versions of transformers, optimum, and auto-gtpq (as indicated above), it is generally recommended to complete this exercise on Colab
    * Of course, you can still do it locally if you have access to a suitable GPU, just be aware that you need the above packages with the current versions (as also described [on this page](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ))

# Task 1: Prompting

In [1]:
import re
import json
import random
from collections import defaultdict, Counter
from tqdm import tqdm, trange

import torch

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

* Start by loading the [gptq-4bit-32g-actorder_True](https://huggingface.co/TheBloke/Llama-2-13B-chat-GPTQ) version from Hugging Face
* Use the `from_pretrained` method with the options `TheBloke/Llama-2-13B-chat-GPTQ` as the general model path, `'gptq-4bit-32g-actorder_True'` as the `revision`, use `device_map='auto'` and `trust_remote_code=False`
* Then, load the corresponding tokenizer from the same general model path (no `revision` path required)

In [2]:
torch.cuda.empty_cache()

model_name_or_path = 'TheBloke/Llama-2-13B-chat-GPTQ'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,device_map="cuda:0")

CUDA extension not installed.
CUDA extension not installed.
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


* As previously, define the device

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Task 1.1: Changing the system prompt

* The system prompt serves as a guiding instruction for the LLM
* It helps define the context, style, and focus of all subsequently generated text
* For instance, you can put the model in different 'roles', meaning that it will answer in the style of a pirate, rapper, poet, teacher etc.
* Choosing the 'right' system prompt for your task is crucial, as the task below will highlight
* The default system prompt in LLama-2 (and also in ChatGPT) is, according to the [Llama-2 paper](https://arxiv.org/abs/2307.09288):
> You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.\
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don’t know the answer to a question, please don’t share false information.
* To embed instructions and system prompts, the training procedure of Llama-2 introduced some new tokens, which follow the HTML style
    * `[INST]` and  `[/INST]` are used to mark the beginning and end of an instruction
    * `<<SYS>>\n` and `\n<</SYS>>\n\n` are used to mark the beginning and end of the system prompt itself
* Write a helper function `create_prompt` that takes in an `instruction`, `system_prompt`, and the start and end tokens of both the instruction and system prompt
    * It should return the concatenation of an opening instruction prompt token, followed by the opening system prompt token, followed by the system prompt itself, then closed off by the ending system prompt token, followed by the instruction itself, and closed off by the end of instruction token.
    * Save the default system prompt from above in a variable and use it as the default system prompt in the function

In [4]:
B_INST, E_INST = '[INST]', '[/INST]'
B_SYS, E_SYS = '<<SYS>>\n', '\n<<SYS>>\n\n'

DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.\
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.\
Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\
If you don’t know the answer to a question, please don’t share false information.\
"""

def create_prompt(instruction, system_prompt=DEFAULT_SYSTEM_PROMPT, b_inst=B_INST, e_inst=E_INST, b_sys=B_SYS, e_sys=E_SYS):
    return f"{b_inst}{b_sys}{system_prompt}{e_sys}{instruction}{e_inst}"

* In principle, using these instruction fine-tuned models is the same as the decoding strategies from last exercise
* However, we generally want to remove the input prompt itself from the output, since these finetuned models don't just complete our initial prompt but actually follow our instruction
* Therefore, write a small helper function `format_output` that takes in the raw model outputs, the prompt, and the tokenizer
    * The function first decodes the generated indices using the tokenizer
        * Enable the option `skip_special_tokens=True` in the decoding function
    * Remove the input prompt string
    * Finally, remove all additional spaces in the beginning and end
    * Return the cleaned string
    * Hint: If you want a concrete example, play around with the tokenizer and model to generate some outputs. The tokenization $\rightarrow$ generation $\rightarrow$ decoding procedure is the same as in last exercise

In [7]:
def format_output(raw_out, prompt, tokenizer):
    decoded = tokenizer.decode(raw_out, skip_special_tokens=True)
    return decoded.replace(prompt, '').strip()

* Then, write a `generate` function that lets us reuse the same decoding and printing process, but flexibly change the instruction and system prompt
* The function takes in the `instruction`, `system_prompt`, `model`, `tokenizer`, and `device`
    * `system_prompt`, `model`, `tokenizer`, and `device` can be initialized as default with the respective default system prompt, model, tokenizer, and device
* Use the `create_prompt` function to form the prompt using the given `instruction` and `system_prompt`
* Tokenize the prompt to torch tensors and move them to the respective `device`
* Run them through the model to get the output indices
    * In this case, we use [top-p nucleus sampling](https://huggingface.co/blog/how-to-generate#top-p-nucleus-sampling) with the hyperparameters `temperature=0.7`, `do_sample=True`, `top_p=0.95`, `top_k=40`, and limit `max_new_tokens=512`
* Then, we format the output using the `format_output` function to get the cleaned string from the returned indices
* Return the cleaned string

In [17]:
def generate(instruction, system_prompt=DEFAULT_SYSTEM_PROMPT, model=model, tokenizer=tokenizer, device=device):
    prompt = create_prompt(instruction, system_prompt)
    tokenized = tokenizer(prompt, return_tensors='pt', truncation=True).to(device)
    raw_out = model.generate(
        **tokenized,
        temperature=0.7,
        do_sample=True,
        top_p = 0.95,
        top_k = 40,
        max_new_tokens=512
    )

    output = format_output(raw_out[0], prompt, tokenizer)
    return output

* Now, invent an instruction for the model, use the `generate` function with the default system prompt and output your results
    * Always print the returned strings (when the output needs to be visible, such as in this case), so that newline and similar symbols are actually rendered

In [19]:
inst = "Explain the concept of LLMs to a high school student."
output = generate(inst)
print(output) # TODO

KeyboardInterrupt: 

* Secondly, repeat the generation process while sticking to your instruction from before, but change the system prompt to something very different
    * with "very different", it is meant that you let the model act as e.g. a car salesman, youtube influencer, rapper, magician, angry uncle, etc. instead of the default nice-and-helpful-assistant prompt
    * You can either invent your own system prompt, or take inspiration from one of the below links
    * https://github.com/mustvlad/ChatGPT-System-Prompts#prompts-by-category
    * https://github.com/f/awesome-chatgpt-prompts#prompts

In [None]:
# TODO

print() # TODO

* Compare the results and briefly discuss how the system prompt has influenced the model's output

___
Student answers here:
___

## Task 1.2: Zero-Shot Inference

* The [first GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) model showed that language modeling can produce successful results at downstream task applications, e.g. for applications in natural language inference.
* [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) used the same architecture from GPT, but scaled it to 1.5B parameters and trained on 40GB web text data instead of the previous 4GB book corpus
* A key ability observerd in GPT-2 was the *zero-shot learning* ability for multiple tasks, i.e. the model could complete tasks *without* examples and *without* gradient updates
* By simply prompting the model in the right way, it could perform tasks where you previously needed supervised models for, e.g. classification
* For instance, in the paper they simply prompt the model with a news article, followed by the string "TL;DR:", and the model was able to summarize the news article
* As the model wasn't specifically trained for summarization, this ability is named *zero-shot inference*, since we evaluate the model's performance at the "zero-th shot", i.e. before it was trained to do the task

* In the following, we will make use of the system prompt to perform a zero-shot, multi-class classification task on the [DAIR-AI Emotion Dataset](https://huggingface.co/datasets/dair-ai/emotion) from the last two weeks
* Load the test split
    * select only the first 100 samples (since it still takes some time to pass each sample through 13B parameters)
    * add a new column `index` with the index number of the selected samples to get a permanent index for each sample across experiments

In [None]:
emotion_dataset = # TODO
test_samples = # TODO

### Task 1.2.1: Zero-shot classification

* We will instruct the model to generate a label for each tweet
* To achieve that, we derive a new system prompt that instructs the model to:
    * read each sample
    * output only a single word: the class as a string

In [None]:
CLASSIFICATION_PROMPT = # TODO

* Sometimes, the model does not perfectly follow instructions, but still fulfills the task
    * e.g. the model might add phrases like `"Sure! Here is the result:"` to the output, despite us instructing otherwise
    * However, after such phrases, it might still add the correct output
    * Therefore, we will have some mercy with the model as it has "only" 13B parameter and lost about 4x its current numerical precision, so we accept some errors
* To deal with that, write a function `get_label_from_string` that checks whether *exactly one* of the labels appears in the output string
    * The function takes as input the `output_str`, as well as the six `possible_labels` as a list of strings
    * It returns the label if exactly one label was found, and `False` otherwise
    * Make sure that your check is case sensitive

In [None]:
DATASET_LABELS = # TODO

def get_label_from_string(output_str, possible_labels=DATASET_LABELS):
    # TODO

    return # TODO

* Create a `lbl2idx` dictionary to map the strings to classification indices of the labels
* Write a function `get_classification`
    * It takes as input a `pred_label`, the `true_lbl`, and the `lbl2idx` helper dictionary
    * Compare the predicted label to the label of the dataset
    * Return `1` if the correct label has been predicted, else `0`

In [None]:
lbl2idx = # TODO

def get_classification(pred_label, true_label, lbl2idx):
    # TODO

    return # TODO

* Iterate through all samples
    * Embed the iterator in `tqdm` to visualize the current progress
    * When generating new output, the text of each samples should be used as the `instruction` of the `generate` function, along with the new classification system prompt
    * Put together all previously created functions and fill in the necessary logic to perform the classification task
    * When evaluating the `get_label_from_string`, work in the logic to repeat this step for up to additionally 2 times if the model did not produce a valid string
* If, after three attempts, still no valid label could be found, save the output string in a separate list and print some of these samples after the task
    * Discuss what went wrong in these cases
* Report the accuracy between the correct classifications and:
    * all possible test samples
    * the number of strings containing a label
* Briefly comment the results. At which step(s) in the process does the model make mistakes? Remember that our supervised baseline from tutorial 3 was between 93%-96%.
* HINT: Due to the inference time, it might be helpful to complete all functions and testing on only a small subset of the 100 samples

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

___
Students answers here:
___

### Task 1.2.2: Batch classification with structured in- and output

* Now, we will perform the same task again but instruct the model to
    * read a structured JSON input containing batched samples with their respective index
    * produce another JSON output containing the batch of indices with their classification labels
* First, derive a new system prompt that instructs the model to:
    * read the JSON structure containing the batch
    * classify each sample
    * create a JSON output that links the indices of each sample to their classification
    * each index (the keys) should be integers
    * each class should be the labels as strings

In [None]:
STRUCTURED_CLASSIFICATION_PROMPT = # TODO

* As before, the model might fail in producing the desired output
    * It might preface the correct answer with some additional text, or it may say something after the JSON structure is generated
    * It might also not produce a parseable JSON
    * Or it might produce a parseable JSON but fail to produce valid indices and/or labels
    * In the next functions, we check for all those possibilities
* Write a function `get_parseable_json_from_string` to extract a parseable JSON structure from the model's `output_str` string
    * The function should scan the output string and check whether it contains a parseable JSON
    * If a parseable JSON structure has been found, return it, otherwise `False`

In [None]:
def get_parseable_json_from_string(output_str):
    # TODO

    return # TODO

* Adapt the earlier created `get_label_from_string` function to a batched version: `parse_labels_from_batch`
    * It takes as input the `parseable_json` we returned from the `get_parseable_json_from_string` function, and the `possible_labels`
    * Parse the batch and check that all generated labels actually exist
    * Make sure that the check is case sensitive
    * Return a list of tuples which contain for each sample:
        * the index as type `int`
        * `True` or `False`, depending on whether or not the label exists
        * and the label itself

In [None]:
def parse_labels_from_batch(parseable_json, possible_labels=DATASET_LABELS):
    # TODO

    return # TODO

* Write a function `has_correct_indices` to assess whether the model was able to (re-)produce the correct indices of the input
    * The function takes as input the `parseable_json`, and a list of `true_indices`
    * Transform the type `str` indices of the JSON to `int` for easier comparison and subsequent use after returning
    * Return a list of tuples which contain for each sample:
        * the type `int` index
        * `True` or `False`, depending on whether the original index was (re-)produced
    * It is not important whether the order is kept in the reproduced indices, only that all indices are present and correct
    * If the (re-)produced indices contain more indices, pair those with `False`
    * If the (re-)produced indices contain fewer indices, get the missing indices from the `true_indices` and pair those with `False`

In [None]:
def has_correct_indices(parseable_json, true_indices):
    # TODO

    return # TODO

* Now we need to track the indices of our classification output
* The `update_index_dicts` function takes in the `output_str`, `true_indices`, a `correct_index_dict`, and a `incorrect_index_dict`
    * The dictionaries will be created in the next step (you can scroll below to see, they will be `defaultdict(list)`, and `defaultdict(int)`
    * The `correct_index_dict` maps the valid indices to a list of label predictions (there will be multiple predictions for each label)
    * The `incorrect_index_dict` maps each invalid index to the number of tries which have been done to produce a valid index or label
* As the first step, we pass the `output_str` through the `get_parseable_json_from_string` function
    * If no parseable JSON structure was found, we iterate through all `true_indices` of the batch and increase their count by one in the `incorrect_index_dict`
    * Afterwards, we return both `correct_index_dict` and `incorrect_index_dict` and end the function here
* If a successful JSON structure was found, check the validity of labels and indices with `parse_labels_from_batch` and `has_correct_indices`
    * Sort both outputs in ascending order of their respective indices to guarantee non-random ordering
* Now we iterate through the outputs of the index and label parsing functions
    * If the indices match in both pairs and the index is correct (as checked by its parsing function), and if the label is correct (as checked by its parsing function), we append the label to the list of its index in the `correct_index_dict`
    * If the above condition is not met, be increate the counter in the `incorrect_index_dict`
* We return the `correct_index_dict` and `incorrect_index_dict`

In [None]:
def update_index_dicts(output_str, true_indices, correct_index_dict, incorrect_index_dict):
    # TODO

    return # TODO

* Now we put everything together
* Iterate through all samples in batches of 10
    * Embed the iterator in `tqdm` to visualize the current progress
    * When generating new output, the text of each samples should be used as the `instruction` of the `generate` function, along with the new classification system prompt
    * Put together all previously created functions and fill in the necessary logic to perform the classification task
    * HINT: The returned `correct_index_dict` and `incorrect_index_dict` of one loop will serve as the input for the `update_index_dicts` function in the next loop
    * Here, we additionally store `all_outputs`, which should save tuples of a batch of indices with their generated JSON output
        * After the classification task, print 2 batches and discuss their outputs. What did the model do well? What didn't work?
* HINT: Due to the inference time, it might be helpful to complete all functions and testing on only a small subset of the 100 samples

In [None]:
BATCH_SIZE = # TODO
NUM_VOTES = # TODO

# TODO

* After the loop, we now have 2 outputs:
    * A dictionary containing the valid indicies with a list of at maximum length 3 (for 3 label predictions)
    * A dictionary containing the invalid indices with the number of tries performed to produce a valid index and label prediction
* Create a function `get_incomplete_indices`, which takes as input the `correct_index_dict`, the `true_indices`, and the `num_votes=NUM_VOTES` parameter
    * It checks whether we already have 3 valid predictions per index
    * If an index is missing entirely or in there are fewer than 3 predicted labels, store that index in a list
    * Return that list of incomplete indices

In [None]:
def get_incomplete_indices(correct_index_dict, true_indices, num_votes=NUM_VOTES):
    # TODO

    return # TODO

* Until now, we have always tried to get predictions for each sample while keeping the ordering of samples inside batches equal
* Now, take all invalid indices (returned from the `get_incomplete_indices` function), pair them together in batches, and try to fill up the vote count dictionaries to 3 predictions per label
    * Create new batches of size 10 from the invalid indices
        * For the last batch, simply take the remaining samples, even if they don't add up to 10
    * Repeat the steps performed above (you can keep using the `correct_index_dict`
        * Classify the invalid incies for an additional 5 tries OR until they have 3 valid predictions  
            * In the end, you should have at maximum a counter of 8 for the invalid indices
        * Run it

In [None]:
# TODO

* Now, we just have to extract the majority class vote per sample and calculate the accuracy
* Create a function `get_majority_vote`, which takes in the `correct_index_dict`
* Iterate through the dictionary and extract the labels based on the following logic:
    * If only one valid vote exists, take it
    * If all votes are distinct, randomly sample one (using the imported random package, do not hardcode a choice of your own)
    * In all other cases, take the predicted label with the most votes
* Return a dictionary mapping the valid indices to their final prediction class

In [None]:
def get_majoriy_vote(correct_index_dict):
    # TODO

    return # TODO

* Compare the majority votes to the ground truth class label of the dataset
    * Since the model very likely still could not produce valid indices and/or classes for some samples, make sure to compare the correct indices
* Report two accuracy metrics:
    * The overall accuracy based on all possible test samples
    * The accuracy given a valid index-label sample
* Briefly discuss these results and compare them to the accuracy numbers we obtained in task 1.2.1
* Also, remember to print the 2 batches that we saved in `all_outputs` and discuss what the model did well and what didn't work.

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

___
Student answers here:
___

## Task 1.3: Few-Shot Learning

* [GPT-3](https://arxiv.org/abs/2005.14165) was introduced by a paper titled "Language Models are Few-Shot Learners"
* It showed that with increasing model size (from 1.5 billion parameters to 175 billion parameters) and increasing pre-training data size (from 40GB to >600 GB), "few-shot" learning abilities were observable in LLMs
* When the LLM is given a new task with some examples of the task (i.e. "few-shots") before asking the question, it can learn this task
* Again, no gradients are updated
* As a result, this learning is also called "in-context learning", underscoring that the task is learned "in context" without updating the parameters
* For instance, a prompt that makes use of 3-shot learning to teach a model to translate English to French would look like:
```
Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush girafe => girafe peluche
cheese => .........
```

* In the following, we will work with the [FewRel](https://huggingface.co/datasets/few_rel) dataset
* It contains textual descriptions of relations
* The goal is to classify whether a relation exists between two entities mentioned in the text
* The dataset includes various sets of relations, making it suitable for few-shot learning

* Start by loading the validation split of the dataset, specifically the `split='val_wiki'`
* Select the first 10 samples as train split and the next 50 samples as validation samples
    * We choose these splits to minimize the distribution shift for this exercise because we only validate on 50 samples due to runtime constraints

In [None]:
fewrel_train = # TODO
fewrel_val = # TODO

* Get familiar with the layout of the dataset
* A sample contains
    * `relation`: a string feature containing PID of the relation
    * `tokens`: a list of string features containing tokens for the text
    * `head`: a dictionary containting
        * `text`: a string feature representing the head entity
        * `type`: a string feature representing the type of the head entity
        * `indices`: a list containing list of token indices
    * tail: a dictionary containing:
        * `text`: a string feature representing the tail entity
        * `type`: a string feature representing the type of the tail entity
        * `indices`: a list containing list of token indices
    * `names`: a list of string features containing relation names

* In this exercise, we will provide 3-shot examples of `tokens` along with their `names` and `head[text]` as well as `tail[text]` to extract the `head`- and `tail`-texts of new token sequences.
* To achieve this, we will use a random selection mechanism of few-shot samples from the train split and evaluate on the 50 samples from the validation split
* First, we need to design our system prompt template
    * Here, the system prompt should contain
        * the task description
        * the context as a string (i.e. not the list of strings from above) given by the `tokens`
        * the descriptions from the `names` list. The first list item is the name of the relation, the second list item the description of the relation
        * the entities, as given by `head[text]` and `tail[text]`
    * The prompt template should reserve the space and positions for all 3 sampled examples
    * Furthermore, for subsequent parsing and text generation speed, it will be helpful to add instructions to only return the next, incomplete example
    * Experiment with the template and some samples until you find a suitable version
    * Embed this structure in a function `get_few_shot_sys_prompt`
        * It takes the `train_set` as input and the `num_shots`
        * Inside the function, sample `num_shots` indices or samples from the train set
        * Fill in the `num_shots` train samples' context, descriptions, head, and tail text inside the earlier created template
    * Return the complete prompt string

In [None]:
def get_few_shot_sys_prompt(train_set, num_shots=3):
    # TODO

    return # TODO

* Secondly, we need to prepare our `instruction` for the to-be-completed samples from the validation set
* Create a function `get_few_shot_instruction`
    * It takes in a sample of the validation set (as we will later iterate through the validation dataset)
    * The function body is structured similar to one of the examples from the system prompt template
    * It consists of:
        * the instruction to fill up the blanks below (optional, depending on how you ended the system prompt)
        * the "context" and the "names" descriptions
    * Instead of completing the `head` and `tail`, leave those positions empty, e.g. in the style of
    ```
    head entity:
    tail entity:
    ```
    * Return the instruction string

In [None]:
def get_few_shot_instruction(val_sample):
    # TODO

    return # TODO

* Thirdly, we need a fuction `get_valid_relations` to again check the validity of the output
* Model this function in the style of `get_parseable_json_from_string`, which also takes as input the `output_str`
    * But adjust it to the individual structure you chose to output the `head` and `tail`
        * i.e. if you chose to output head and tail in separate lines, with or without preceding `:` or not, etc.
        * parse and clean the output as needed
    * return a tuple `(head, tail)` if parsing was successful, else `False`

In [None]:
def get_valid_relations(output_str):
    # TODO

    return # TODO

* Then, write a function `get_relation_classification`
    * It takes in the tuple of extracted entities, as well as a tuple of ground-truth labels' head and tail entities
    * Compare both lower-cased head and tail strings
    * If both match, return 2
    * If one matches, return 1
    * If none match, return 0

In [None]:
def get_relation_classification(extracted, labels):
    # TODO

    return # TODO

* Finally, iterate through the 50 validation set samples
    * Embed the iterator in `tqdm` to visualize the current progress
    * Put together all previously created functions and fill in the necessary logic to perform the classification task
        * Separately save the samples in a list when `get_valid_relations` returned `False`
    * Use the validation sample for the `instruction`, and the sampled 3-shot examples in the system prompt position of the `generate` function
    * As previously, if we receive invalid output, try for an additional 2 times
* Report the accuracy
    * against all validation samples
    * against validation samples with valid head and tail outputs
    * make sure that each 'sample' accounts for 2 relations in your accuracy calculation
* Briefly comment the results. At which step(s) in the process does the model make mistakes?
* Show examples, if they exist, of the invalidly created outputs of the model and discuss what the model did wrong.
* Discuss 2 examples of the validly created outputs of the model where they don't match the label relations.
    
* HINT: Due to the inference time, it might be helpful to complete all functions and testing on only a small subset of the 50 samples

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

___
Student answers here:
___