# Homework Prompt Design

In the last lab we saw that there are different ways to get to the goal of the homework is to apply DoE to guage the effectiveness of different approaches. I.e.,  evaluate different algorithms/approaches that perform classification, and leveraging. **Design of Experiments**. 


This homework consists the following 4 main parts, 1 facultative exercise to get to know a useful templating language, and 1 bonus exercise. Note the course staff reserves the right to provide corrections to this notebook and/or corresponding code.

**N.B.**, this homework is both about using different techniques, and applying DoE. Its purpose is *not* to obtain a State-of-the-Art result, but rather to get to know different methods, understand their respective merrits, and applying them properly.

## Submitting the Homework to Ilias
**N.B.** To submit this homework, you must render this notebook as a PDF, run the following command in the commandline. Make sure to test this command;

```bash
jupyter-nbconvert --to pdfviahtml  homework-reference.ipynb --TagRemovePreprocessor.remove_input_tags='{"hide-cell","hide-student-submission"}'  --TagRemovePreprocessor.remove_all_outputs_tags='{"remove-output"}'         
```

Before submitting make sure your notebook adheres to the following:

1. None of the cells that are tagged as `keep-output` or `hide-cell` are deleted, these are key for the review of your code.
2. You have verified that your submission PDF contains all your complete answers, note that;
   * cells annotated with `hide-cell` will have their input removed,
   * cells annotated with `remove-output` will have their output removed,
   * cells annotated with `hide-student-submission` will have their input removed, e.g., this cell
   
3. Any cells you have added are either: properly annotated with `keep-output` or `hide-cell`, or are manually cleaned.

> ⚠️ The course staff reserves the right to withold awarding (partial) points to any of the (sub)exercises if your submitted PDF and notebook do not adhere to these requirements.


---

# Homework 2 Submission

| **Detail**      | **Description**      |
|-----------------|----------------------|
| **Name**        | Bastien Dimitri Jossen |
| **Student No.** | 20-113-684           |
| **Year**        | 2024                 |
| **Course**      | MSGAI                |


# Before we get started

This notebook seems long, but *most of the code* provides a starting point for the objective of this homework; *basic prompt-design and DoE*.
Read each exercise carefully, you might find some hints here and there in the provided code!

## Homework Overview

This homework consists of the following three parts, each consisting of some implementation, and design of experiment. We provide skeleton code to perform the experiments, but you may wish to deviate from it. We recommend doing the exercises in the provided order.

1. Zero-shot / Instruction based prediction.
2. Few-shot / Example based prediction.
3. Fine-Tuning / Learning based prediction.

Each exercise consists of;
1. A minor implementation of the main concept (see above, except for the `Fine-Tuning / Learning exercise`).
2. Design-of-Experiments. We provide a (mostly filled out) example in Exercise 3 that you may wish to use in Exercises 1 and 2.
3. Analysis of the DoE results, using ANOVA analysis, herein you need to check the model assumptions.

Additionally, there is ONE bonus exercise (2.1.3), worth a maximum of $10$ points, which we recommend tackling last.
3. (Bonus) Classification based prediction / Anything you want. Note, contact the TA before starting this BONUS. This BONUS will be of max. 10 points instead of the Semantic Few-Shot Prompting bonus in exercise 2. You can use the results / insights from this also in your project work.

**N.B.** You can get a maximum of $60$ points in total, **with an additional maximum of $10$ bonus points**.

---


In [21]:
# Install dependencies (same as the env file, so you may wish to skip this if running locally / with persistent conda environment)
%pip install python>=3.10,<4.0.0
%pip install nbconvert==6.5.4        
%pip install lxml_html_clean==0.3.1  
%pip install notebook-as-pdf==0.5.0  
%pip install bitsandbytes~=0.42.0
%pip install configparser~=7.1.0
%pip install datasets>=3.0.1,<4.0.0
%pip install flake8-import-order~=0.18.2
%pip install fqdn~=1.5.1
%pip install isoduration~=20.11.0
%pip install jinja2schema~=0.1.4
%pip install jsonpointer~=3.0.0
%pip install jupyter~=1.1.1,<2.0.0
%pip install peft>=0.13.2,<1.0.0
%pip install pretty-jupyter==2.0.7
%pip install protobuf~=5.28.2,<6.0.0
%pip install pyDOE3~=1.0.4
%pip install researchpy~=0.3.6
%pip install seaborn~=0.13.2
%pip install sentence-transformers~=3.2.0
%pip install sentencepiece~=0.2.0,<1.0.0
%pip install tabulate~=0.9.0
%pip install uri-template==1.3.0
%pip install webcolors==24.8.0

zsh:1: 3.10, not found
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
zsh:1: 3.0.1, not found
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
zsh:1: no such file or directory: 2.0.0
Note: you may need to restart the kernel to use updated packages.
zsh:1: 0.13.2, not found
Note: you may need to rest

In [61]:
# Imports used in most of the exercises
import contextlib
import io
import json

import textwrap
import time
import unittest
import warnings
from collections import defaultdict
from importlib import metadata
from itertools import chain
from os import PathLike
from functools import partial
from typing import Dict, List, Tuple, Union, Dict, Any
from typing import Optional, Type
from unittest import TextTestRunner, defaultTestLoader

import datasets
import jinja2
import jinja2schema
import peft
import torch
import transformers
from IPython.display import HTML, Markdown, display
from tqdm.auto import tqdm
from transformers import T5ForConditionalGeneration, T5Tokenizer
from transformers import T5TokenizerFast

from pathlib import Path

In [62]:
def get_available_device() -> Tuple[torch.device, str]:
    """Helper method to find best possible hardware to run
    Returns:
        torch.device used to run experiments.
        str representation of backend.
    """
    # Check if CUDA is available
    if torch.cuda.is_available():
        return torch.device("cuda"), "cuda"

    # Check if ROCm is available
    if torch.version.hip is not None and torch.backends.mps.is_available():
        return torch.device("rocm"), "rocm"

    # Check if MPS (Apple Silicon) is available
    if torch.backends.mps.is_available():
        return torch.device('cpu'), "mps"

    # Fall back to CPU
    return torch.device("cpu"), "cpu"


def display_dataset_description(name: str, dataset: datasets.DatasetDict) -> None:
    """Helper method to display information about splits in the dataset.
    
    Args:
        name (str): Dataset name that was loaded. 
        dataset (datasets.DatasetDict): Dataset dict with different splits that were loaded 

    Returns:
        None
    """
    split_info = []
    for k, ds in dataset.items():
        split_info.append(f"<tr><td><strong>{k.capitalize()} Samples:</strong></td><td>{len(ds)}</td></tr>")
    html_content = f"""
    <h2>Dataset info</h2>
    <table>
        <tr><td><strong>Dataset Name:</strong></td><td>{name}</td></tr>
        {"<br>".join(split_info)}
    </table>
    """

    # Display the output in the notebook
    display(HTML(html_content))

def get_installed_version(package_name) -> str:
    with warnings.catch_warnings():
        # Supress warnings from packages that have missing attributes that metadata will complain about.
        warnings.simplefilter("ignore")
        distribution = metadata.Distribution()
        try:
            return distribution.from_name(package_name).version
        except metadata.PackageNotFoundError:
            return "Not installed"


def display_configuration() -> None:
    # Check device info
    device, backend = get_available_device()

    # Torch version
    torch_version = torch.__version__

    # HuggingFace Transformers version
    transformers_ver = transformers.__version__

    # BitsAndBytes version (if available)
    bitsandbytes_version = get_installed_version("bitsandbytes")

    # Check for GPU-specific details if CUDA or ROCm is available
    if device.type == "cuda":
        cuda_device_count = torch.cuda.device_count()
        cuda_device_name = torch.cuda.get_device_name(0)
        cuda_version = torch.version.cuda
    elif device.type == "rocm":
        cuda_device_count = torch.cuda.device_count()
        cuda_device_name = torch.cuda.get_device_name(0)
        cuda_version = torch.version.hip
    else:
        cuda_device_count = 0
        cuda_device_name = "N/A"
        cuda_version = "N/A"

    # Prepare HTML formatted output for better display in a notebook
    html_content = f"""
    <h2>System Configuration</h2>
    <table>
        <tr><td><strong>PyTorch version:</strong></td><td>{torch_version}</td></tr>
        <tr><td><strong>Device:</strong></td><td>{device} (Backend: {backend})</td></tr>
        <tr><td><strong>CUDA/ROCm version:</strong></td><td>{cuda_version}</td></tr>
        <tr><td><strong>GPU count:</strong></td><td>{cuda_device_count}</td></tr>
        <tr><td><strong>GPU name:</strong></td><td>{cuda_device_name}</td></tr>
        <tr><td><strong>Hugging Face Transformers version:</strong></td><td>{transformers_ver}</td></tr>
        <tr><td><strong>BitsAndBytes version:</strong></td><td>{bitsandbytes_version}</td></tr>
    </table>
    """

    # Display the output in the notebook
    display(HTML(html_content))


# Call the display_configuration() function in your Jupyter notebook to show the configuration
display_configuration()

0,1
PyTorch version:,2.2.2
Device:,cpu (Backend: mps)
CUDA/ROCm version:,
GPU count:,0
GPU name:,
Hugging Face Transformers version:,4.45.1
BitsAndBytes version:,0.42.0


## 0. Preparation

In order to prepare, we will load the model and dataset that we will be using, namely the `standfordnlp/imbd` sentiment dataset, and the `google/flan-T5-small` model.

You likely only need to run these setup cells once before running your code, but you might want to use the functions we provide here for certain DoE variables concerning:

* Precision (`torch.float16`, `torch.float32`, `torch.bfloat16`)
* Quantization (E.g., `bits_and_bytes_config != None`)
* Device (E.g., `cpu` and `cuda`)



In [63]:
def get_model(
        model_name: Union[str, PathLike],
        model_type: Type[transformers.GenerationMixin] = T5ForConditionalGeneration,
        torch_dtype: torch.dtype = torch.float16,
        device=torch.device("cpu"),
        bits_and_bytes_config: Optional[transformers.BitsAndBytesConfig] = None
) -> Tuple[transformers.PreTrainedModel, transformers.PreTrainedTokenizer, Union[transformers.PreTrainedTokenizerFast, transformers.PreTrainedTokenizer]]:
    """Example method to instantiate a model and get a model with optional quantization (using bitsandbytes).
    
    Args:
        model_name (str): Model name (huggingface name), or relative/absolute path to a pretrained model.
        model_type (Type[transformers.PreTrainedModel]): Type of pretrained model, used to instantiate the model you wan to load.
        torch_dtype (torch.dtype, torch.float16): Precision to load the model with. See also the BitsAndBytes documentation.
        device (torch.device, 'cpu'): Device to run the model on.
        bits_and_bytes_config (BitsAndBytesConfig, optional): Configuration for bitsandbytes model quantization / mixed-precisions (consider this one of your factors)
            N.B. for fine-tuning, make sure the optimizer you want to use is available for the defined precision.
        
    Returns:
        transformers.PreTrainedModel: Model instance with provided configuraiton.
        transformers.PreTrainedTokenizer: Tokenizer instance with provided configuration.
        transformers.PreTrainedTokenizerFast: Fast tokenizer if avilable, othersiwe a normal python based optimizer
    
    Notes:
        For using the BitsAndBytes quantization configuration, an Nvidia GPU is required. For this you might want to make 
        use of the Google Collab L4 / K40 GPUs (free-tier).    
        
    """
    
    model: transformers.PreTrainedModel = model_type.from_pretrained(
        pretrained_model_name_or_path=model_name,
        quantization_config=bits_and_bytes_config,
        device_map=device,
        torch_dtype=torch_dtype,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=model_name,
    )
    fast_tokenizer = transformers.AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=model_name,
        use_fast=True
    )
     
    return model, tokenizer, fast_tokenizer

def get_dataset(
        data_name: str,
        splits: List[str]
) -> Tuple[datasets.Dataset, ...]:
    """Helper method to load huggingface dataset.
    
    Args:
        data_name (str): Dataset name to load from huggingface. 
        splits (List[str]): List of splits to load and return. 

    Returns:
        Tuple containing the dataset splits.
    """
    # Load dataset, and assign splits to variables
    dataset: datasets.DatasetDict = datasets.load_dataset(data_name)
    return tuple(dataset[split] for split in splits)

def simple_truncate_text(row, max_length=50, tokenizer: transformers.PreTrainedTokenizerFast = None):
    """Example of a simple truncation method text, based on token count.
    
    You might want to perform 'smarter' truncation / summarization as a level, instead of simply cutting of after `max_length` tokens.
    
    Examples:
        You might want to partially-apply the function, to provide a different tokenizer:
        ```python3
        from functools import partial
        some_other_tokenizer = transformers.AutoTokenizer.from_pretrained('your_fave_tokenizer')
        partial_simple_truncate = partial(simple_truncate_text, tokenizer=some_other_tokenizer)
        ```
    Args:
        row (datasets....): Single instance or row of dataset.
    
    Keyword Args:
        max_length (int, 150): the maximum length of text to be processed. Defaults to 150.
        tokenizer (transformers.PreTrainedTokenizer, `fast_tokenizer`): the tokenizer to use. Defaults to `fast_tokenizer`.
    
    Notes:
        This function requires all cells above to be run.
    """
    token_representation = tokenizer.batch_encode_plus(row['text'], max_length=max_length, truncation=True)['input_ids']
    text_representation = tokenizer.batch_decode(token_representation, skip_special_tokens=True)
    row['text'] = text_representation
    return row


In [64]:
# Create tokenizer for flan family
family: str = "google/flan-t5"
# For the Lab we will use a small model, just to provide some insight into usability.
model: str = f"small"
model_name: str = f"{family}-{model}"

tokenizer: T5Tokenizer
fast_tokenizer: T5TokenizerFast
model: T5ForConditionalGeneration

# NOTE, you might need to change this for different model Families
#   as T5 family specifically is a encoder-decoder whereas most text gen. models are
#   of type AutoModelForCausalLM.
model_type: Type[transformers.GenerationMixin] = transformers.AutoModelForSeq2SeqLM
# model_type: Type[transformers.GenerationMixin] = transformers.AutoModelForCausalLM
device, backend = get_available_device()
model, tokenizer, fast_tokenizer = get_model(
    model_name=model_name,
    model_type=model_type,
    torch_dtype=torch.bfloat16,
    device=device,
)
# Set the model to Evaluation to prevent creating a computational graph
model.eval()



T5ForConditionalGeneration(
  (shared): Embedding(32128, 512)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 512)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=512, out_features=384, bias=False)
              (k): Linear(in_features=512, out_features=384, bias=False)
              (v): Linear(in_features=512, out_features=384, bias=False)
              (o): Linear(in_features=384, out_features=512, bias=False)
              (relative_attention_bias): Embedding(32, 6)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseGatedActDense(
              (wi_0): Linear(in_features=512, out_features=1024, bias=False)
              (wi_1): Linear(in_features=512, out_features=1024, bias=False)
              (wo): 

In [65]:
data_name: str = 'stanfordnlp/imdb'
splits = ['train', 'test', 'unsupervised']
train_set, test_set, *_ = get_dataset(data_name, splits=splits)
text, label = f"{train_set[1239]['text'][:40]}...", train_set[0]['label']
display(
    Markdown(
f"""
| Text  | Label   |
|:-----:|:-------:|
|{text} | {label} |
"""
    )
)


| Text  | Label   |
|:-----:|:-------:|
|i completely agree with jamrom4.. this w... | 0 |


## (Optional) Becoming a Jinja Ninja!

As a starting point for data-manipulation, here are some exercises to get used with Jinja! We recommend looking into Jinja templating, with variables and for loops.
As you will see, Jinja is a very flexible templating engine, that allows you to wrangle the IMDb dataset that we will use into the correct format for your experiments.

In the following exercises you can see how you can:

1. Render parameters in a Jinja Template
2. Render lists in a Jinja Template
3. Render `zip`ped list in a Jinja Template

> N.B. use the test methods to see what is expected / the expected return statement


In [66]:
example_template = jinja2.Template(
    textwrap.dedent(
        """\
        Hello my name is: {{ MY_NAME }}
        """
    )
)
print(example_template.render(MY_NAME="Jinja"))

# Implement a template that uses variables `course` `professor` and `ta`
# Would render `I follow MSGAIs 2024/2025 taught by Prof. L. Y. Chen, and can contact Ir. J. M. Galjaard for questions.`
VAR_TEMPLATE = textwrap.dedent(
    # YOUR CODE GOES HERE
    """\
    I follow {{course}} taught by {{professor}}, and can contact {{ta}} for questions.
    """
    # END OF YOUR CODE
)
variables_template = jinja2.Template(
    VAR_TEMPLATE   
)
variables = jinja2schema.infer(VAR_TEMPLATE)
assert set(variables.keys()) == {'course', 'professor', 'ta'}, 'Not all variables are used'

# As example
print(variables_template.render(course ='MSGAIs 2024/2025', professor='L. Y. Chen', ta='J. M. Galjaard'))


Hello my name is: Jinja
I follow MSGAIs 2024/2025 taught by L. Y. Chen, and can contact J. M. Galjaard for questions.


In [67]:
# TODO: Implement a template that uses a variable `exercises` that contains a list of strings.
#  it should render as a Markdown list
# HINT: use a jinja for-loop
list_expected = """I need to implement:
 * Basic prompting
 * Few-shot Learning
 * Fine-Tuning
 * Bonus"""
LIST_TEMPLATE = textwrap.dedent(
    """\
    I need to implement:{% for exercise in exercises %}
    * {{ exercise }}{% endfor %}
    """
)
list_template = jinja2.Template(
    LIST_TEMPLATE
)
variables = jinja2schema.infer(LIST_TEMPLATE)
assert 'exercises' in set(variables.keys()), 'Exercise variables is not used!'
print(list_template.render(exercises=['Basic prompting', 'Few-shot Learning', 'Fine-Tuning', 'Bonus']))

I need to implement:
* Basic prompting
* Few-shot Learning
* Fine-Tuning
* Bonus


In [68]:
# TODO: Implement a template that uses a variable `points_exercises` that contains a list of tuples.
# HINT: use a jinja for-loop and variable unrolling
zip_expected = """I need to implement:
* (20) Basic prompting
* (20) Few-shot Learning
* (20) Fine-Tuning
* (10) Bonus"""

ZIP_TEMPLATE = textwrap.dedent(
    # YOUR CODE GOES HERE
    """\
    I need to implement:{% for points, exercise in points_exercises%}
    * ({{points}}) {{exercise}} {% endfor %}
    """
    # END OF YOUR CODE
)
zip_template = jinja2.Template(
    ZIP_TEMPLATE
)
variables = jinja2schema.infer(ZIP_TEMPLATE)
assert set(variables.keys()) == {'points_exercises'}, 'Exercise variables is not used!'
print(zip_template.render(points_exercises=[(20, 'Basic prompting'), ( 20, 'Few-shot Learning'), (20, 'Fine-Tuning'), (10, 'Bonus')]))

I need to implement:
* (20) Basic prompting 
* (20) Few-shot Learning 
* (20) Fine-Tuning 
* (10) Bonus 


In [69]:
# Do not edit the following code.
class TestJinjaNinja(unittest.TestCase):
    exercises = ['Basic prompting', 'Few-shot Learning', 'Fine-Tuning', 'Bonus']
    points = [20, 20, 20, 10]
    def test_1_variable(self):
        
        check_against = "I follow MSGAIs 2024/2025 taught by Prof. L. Y. Chen, and can contact Ir. J. M. Galjaard for questions."
        course = 'MSGAIs 2024/2025'
        professor = 'Prof. L. Y. Chen'
        ta = 'Ir. J. M. Galjaard'
        result = variables_template.render(course=course, professor=professor, ta=ta)
        self.assertEqual(result, check_against)
    
    def test_2_list_template(self):
        check_against = textwrap.dedent("""\
        I need to implement:
        * Basic prompting
        * Few-shot Learning
        * Fine-Tuning
        * Bonus""")
        result = list_template.render(exercises=self.exercises)
        self.assertEqual(result, check_against)
        
    
    def test_3_list_zipped(self):
        check_against = textwrap.dedent("""\
        I need to implement:
        * (20) Basic prompting
        * (20) Few-shot Learning
        * (20) Fine-Tuning
        * (10) Bonus""")
        
        result = zip_template.render(points_exercises=list(zip(self.points, self.exercises)))
        self.assertEqual(result, check_against)
        
f = io.StringIO()
with contextlib.redirect_stderr(f):
    display(Markdown("### Exercise 0.1 Optional exercise result"))
    TextTestRunner(verbosity=-1).run(defaultTestLoader.loadTestsFromTestCase(TestJinjaNinja))
    display(Markdown('---'))
    print(f"\033[91m{f.getvalue()}")

### Exercise 0.1 Optional exercise result

---

FAIL: test_3_list_zipped (__main__.TestJinjaNinja.test_3_list_zipped)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/folders/v9/n514v4yd4hn8vqwrpj4cdt8c0000gn/T/ipykernel_38549/1811146518.py", line 34, in test_3_list_zipped
    self.assertEqual(result, check_against)
AssertionError: 'I ne[35 chars]pting \n* (20) Few-shot Learning \n* (20) Fine[19 chars]nus ' != 'I ne[35 chars]pting\n* (20) Few-shot Learning\n* (20) Fine-T[15 chars]onus'
  I need to implement:
- * (20) Basic prompting 
?                       -
+ * (20) Basic prompting
- * (20) Few-shot Learning 
?                         -
+ * (20) Few-shot Learning
- * (20) Fine-Tuning 
?                   -
+ * (20) Fine-Tuning
- * (10) Bonus ?             -
+ * (10) Bonus

----------------------------------------------------------------------
Ran 3 tests in 0.001s

FAILED (failures=1)



# Exercise 1: Prompt-based Evaluation (20 points total) 
Instead of fine-tuning a model specific to a problem, we can use the language model's capability to follow instructions to perform a specific task. In all these tasks, we will make use of the IMDB movie review sentiment dataset. Throughout this, and following exercises, we will be 'asking' the model to predict the sentiment (Positive or Negative). 

A naive idea, is ask the model simply: ``Has the following a Positive or Negative sentiment?''.
 
In this exercise, you will;

1. **Exercise 1.1**     (5 points) implement two 'Zero-Shot' prompts 'templates', that prompt the model to decide upon the semtiment without additional information
2. **Exercise 1.2:**    (8 points) Perform DoE with different system- and/or hyper-parameters during generation, to evaluate how they impact the models performance (accuracy).
3. **Exercise 1.3:**    (7 points) Analyse the result of your DoE experiments, usign ANOVA.


The goal here is to evaluate the impact of different hyper-parameters and/or system-parameters on the classification accuracy of the model.

> ❗One of the levels in your DoE, will be the input representation, i.e., a `simple_prompt` and a more contextual `detailed_prompt`. You will implement these Zero-Shot prompts. The simple prompt should be a mere short question, whereas the detailed prompt should give additional context, e.g., about the domain / task that is performed.


> *N.B.* to guide you through the exericse, we annotate things you will need to implement. In the lab we will provide some example on how to tackle this.

 ```python
 # YOUR CODE GOES HERE!

 # END OF YOUR CODE!
 ```

**⚠ FAIR Warning:**

> YOU should make sure to store results to disk or other persisten storage, i.e. by writing to a file or saving a model. For example when you want to run with different models you should make sure that data is not accidentally overwritten!


## Exercise: 1.1 Prompt-Design (5 points)

First, we ensure that we can represent the data to the model with our designed prompt, for this, you will need to implement the following behavior;

1. A simple (yes/no)-like question for the prompt in `get_simple_prompt_template`. (2 points (left / right)).
    * This should ask for a `positive`/`Positive` or `negative`/`Negative` as answer, i.e., asking to classify the sentiment of text.
2. A detailed (contextual) question for the prompt in `get_detailed_prompt_template`. (3 points (left / right)).
    * This should ask for a `positive`/`Positive` or `negative`/`Negative` as answer, i.e., asking to classify the sentiment of text.
    * The question should provide additional context regarding the task that is performed (e.g., sentiment analysis, type of task that is peformed, etc.).


**N.B**, we don't recommend using a library like `langchain` to do the homework, as they can become restrictive in the specifics that you want to use. You can opt to use it, but the course does not provide support on additional optional frameworks.

**N.B.** we do recommend using Jinja to create templates for prompts. This allows to quickly transform input for your experiments for your execution of DoE.

Additionally, make sure to use the appropriate `textwrap.dedent` option, if you use triple-quoted (multi-line) `str`ings! Otherwise, you will add (unintenional) whitespace `char`s!

> If you are unsure how to do this, see the Preparation exercise above, as they provide some hints.


In [70]:
def get_simple_prompt_template(
        side: str = 'left',
) -> jinja2.Template:
    """Implements a simple Template retrieval function, that takes as argument `review` and renders as a simple prompt.
    Keyword Args:
        side (str, 'left'): Position at which to add the question for the prompt.
        
    Returns:
        jinja.Template that can render an argument `review`, consisting of a string represention of a review.
    """
    # TODO: Implement a simple zero-shot style yes/no style QA Template.    
    match side:
        case 'left':
            # TODO: Implement question first, then `review`

            PROMPT_TEMPLATE = textwrap.dedent(
                # YOUR CODE GOES HERE
                """\
                Positive/Negative?
                {{review}}
                """
                # END OF YOUR CODE
            )
        case 'right':
            # TODO: Implement `review` first, then question
            PROMPT_TEMPLATE = textwrap.dedent(
                # YOUR CODE GOES HERE
                """\
                {{review}}
                Positive/Negative?
                """
                # END OF YOUR CODE
            )
    assert set(jinja2schema.infer(PROMPT_TEMPLATE).keys()) == {'review'}, "Your template does not use the `review` argument."
    return jinja2.Template(PROMPT_TEMPLATE)



simple_template_l = get_simple_prompt_template(side='left')

simple_template_r = get_simple_prompt_template(side='right')

display(
    Markdown(
        simple_template_l.render(review='Review would go here...').replace('\n', '<br>')
    )
)

Positive/Negative?<br>Review would go here...

In [71]:
# RUN EVALUATION
# Don't change the code below.

def nl_to_br(inp, br: str='<br>'):
    return inp.replace('\n', br)

example_review = f"{train_set[1203]['text'][:142]}..."
simple_prompt_l = nl_to_br(simple_template_l.render(review='Review would go here...'))
simple_example_l = nl_to_br(simple_template_l.render(review=example_review))

simple_prompt_r = nl_to_br(simple_template_r.render(review='Review would go here...'))
simple_example_r = nl_to_br(simple_template_r.render(review=example_review))

display(
    Markdown('### Exericse 1.1.1 Result'),
    HTML(
        textwrap.dedent(
            f"""\
            <table style="border-collapse: collapse; width: 100%;">
                <tr>
                    <th style="text-align: left; border: 1px solid black;">My simple prompt (left)</th>
                    <th style="text-align: left; border: 1px solid black;">My simple prompt (right)</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{simple_prompt_l}</td>
                    <td style="text-align: left; border: 1px solid black;">{simple_prompt_r}</td>
                </tr>
                <tr>
                    <th style="text-align: left; border: 1px solid black;">Example</th>
                    <th style="text-align: left; border: 1px solid black;">Example</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{simple_example_l}</td>
                    <td style="text-align: left; border: 1px solid black;">{simple_example_r}</td>
                </tr>
            </table>"""
        )
    )
)

### Exericse 1.1.1 Result

My simple prompt (left),My simple prompt (right)
Positive/Negative? Review would go here...,Review would go here... Positive/Negative?
Example,Example
"Positive/Negative? Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning....","Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning.... Positive/Negative?"


In [72]:
def get_detailed_prompt_template(
        side='left',
) -> jinja2.Template:
    """Implements a detailed contextual Template retrieval function, that takes as argument `review` and renders a detailed prompt
    with contextual information.
    
    Keyword Args:
        side (str, 'left'): Position at which to add the question for the prompt.
        
    Returns:
        Template that can render an argument `review`, consisting of a string representation of a review.
    """
    match side:
        case 'left':
            # TODO: Implement Question-first, Context second template.
            PROMPT_TEMPLATE = textwrap.dedent(
                # YOUR CODE GOES HERE
                """
                This is a movie review about someones experience and emotions towards the movie. Positive/Negative?
                {{review}}
                """
                # END OF YOUR CODE
            )
        case 'right':
            # TODO: Implement Context-first, Question-second template.
            PROMPT_TEMPLATE = textwrap.dedent(
                # YOUR CODE GOES HERE
                """
                {{review}}
                This was a movie review about someones experience and emotions towards the movie. Positive/Negative?
                """
                # END OF YOUR CODE
            )
    assert set(jinja2schema.infer(PROMPT_TEMPLATE).keys()) == {'review'}, "Your template does not use the `review` argument."
    return jinja2.Template(PROMPT_TEMPLATE)

detailed_template_l = get_detailed_prompt_template(
    side='left',
)
detailed_template_r = get_detailed_prompt_template(
    side='right',
)

display(
    Markdown('### Exericse 1.1.2 Result'),
    Markdown(
        textwrap.dedent(
        f"""\
        | **My simple prompt (left)** | **My simple prompt (right)** |
        |-----------------------------|------------------------------|
        | {simple_prompt_l}           | {simple_prompt_r}            |
        | **Example**                 | **Example**                  |
        | {simple_example_l}          | {simple_example_r}           |
        """
        )
    )
)

### Exericse 1.1.2 Result

| **My simple prompt (left)** | **My simple prompt (right)** |
|-----------------------------|------------------------------|
| Positive/Negative?<br>Review would go here...           | Review would go here...<br>Positive/Negative?            |
| **Example**                 | **Example**                  |
| Positive/Negative?<br>Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning....          | Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning....<br>Positive/Negative?           |


In [73]:
# RUN EVALUATION
# Don't change the code below.
example_review = f"{train_set[1203]['text'][:142]}..."

detailed_prompt_l = nl_to_br(detailed_template_l.render(review='Review would go here...'))
detailed_example_l = nl_to_br(detailed_template_l.render(review=example_review))

detailed_prompt_r = nl_to_br(detailed_template_r.render(review='Review would go here...'))
detailed_example_r = nl_to_br(detailed_template_r.render(review=example_review))

display(
    Markdown('### Exericse 1.1.2 Result'),
    HTML(
        textwrap.dedent(
            f"""\
            <table style="border-collapse: collapse; width: 100%;">
                <tr>
                    <th style="text-align: left; border: 1px solid black;">My simple prompt (left)</th>
                    <th style="text-align: left; border: 1px solid black;">My simple prompt (right)</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{detailed_prompt_l}</td>
                    <td style="text-align: left; border: 1px solid black;">{detailed_prompt_r}</td>
                </tr>
                <tr>
                    <th style="text-align: left; border: 1px solid black;">Example</th>
                    <th style="text-align: left; border: 1px solid black;">Example</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{detailed_example_l}</td>
                    <td style="text-align: left; border: 1px solid black;">{detailed_example_r}</td>
                </tr>
            </table>"""
        )
    )
)

### Exericse 1.1.2 Result

My simple prompt (left),My simple prompt (right)
This is a movie review about someones experience and emotions towards the movie. Positive/Negative? Review would go here...,Review would go here... This was a movie review about someones experience and emotions towards the movie. Positive/Negative?
Example,Example
"This is a movie review about someones experience and emotions towards the movie. Positive/Negative? Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning....","Wow. Rarely have I felt the need to comment on movies lately, but this one especially is begging for a beatdown. Let's start at the beginning.... This was a movie review about someones experience and emotions towards the movie. Positive/Negative?"


In [74]:
# Create the truncated eval set.
MAX_50_TOKENS = 50
truncate_to_50_tokens = partial(simple_truncate_text,  max_length=MAX_50_TOKENS, tokenizer=fast_tokenizer)

q1_eval_set = (
    test_set
    .map(truncate_to_50_tokens, batched=True)
)

truncated_example_text, label = nl_to_br(q1_eval_set[0]['text']), q1_eval_set[0]['label']
display(
    Markdown(
    """## Example of truncated data
Do you see how the text is abruptly terminated after `I tried to like this, I`?
"""),
    Markdown(textwrap.dedent(
            f"""\
            > | **Truncated Review**          | **Label** |
            |-------------------------------|-----------|
            | {truncated_example_text}      | {label}   |
            """
        )
    ),
    Markdown('---')
)

## Example of truncated data
Do you see how the text is abruptly terminated after `I tried to like this, I`?


> | **Truncated Review**          | **Label** |
|-------------------------------|-----------|
| I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I      | 0   |


---

## Exercise 1.2: Design of Experiments (8 points)

In this and the following exercise, we are interested in quantifing the effect of different configurations on the zero-shot performance of the model, you will need to select at-least 3 system- and/or hyper-parameters, with each having atleast two or more (2+) levels. Recall that for the first hyper-parameter should use the (`simple` or `detailed`) prompt.

Furthermore, we suggest using one or more from the following parameters in your DoE:

  * The structure of each prompt (i.e., `left` and `right`)
  * Model size, for example (`T5-flan-small`, `T5-flan-base`, `T5-flan-large`, etc.). (only recommended with GPU)
  * Numerical precision (`torch.float16`, `torch.float32`, `torch.bfloat16`). Make sure your hardware / `PyTorch` version supports this!
  * Quantization (only recommended with GPU with `BitsAndBytes` packages).
  * Structured decoding (requires implememtantation).


In short, you will need to perform;

1. (8 points) Design of Experiments in code;
    * Selection of criteria
    * Type of factorial experiment
    * Creation of experimental configuration
    * Run your experiments.
        * Depending on your chosen variables in DoE, you might need to make some minor adaptations to our provided code.

> For your convenience, we have split first DoE part,  and the Design of Experiments (which you have to implement), and the ANOVA analysis into 2 cells. We strongly recommend writing data to disk/persistent storage and loading it in the next cell to make sure you can easily re-run evaluation upon restarting the notebook.


In [80]:
from transformers import AutoTokenizer
def run_q1_evaluation(
        dataloader: torch.utils.data.DataLoader,
        model: transformers.PreTrainedModel,
        generation_config: transformers.GenerationConfig,
        *args,
        **kwargs
) -> Tuple[List[List[str]], List[List[int]]]:
    """Helper function to run evaluation (e.g. under different evaluations).
    
    Notes:
        You likely don't need to make any changes, as likely most of your levels are;
         * system-parameters,
         * generation-parameters,
         * different ways of pre-processing the review data.
    
    Args:
        dataloader (torch.utils.data.DataLoader): Dataloader containing the evaluation set. 
        model (transformers.PreTrainedModel): Pre-Trained model to be evaluated.
        generation_config: Generation configuraiton, that may contain some of your hyper-parameters for DoE.
        *args: Any additional positional args you want to add.
    
    Keyword Args:
        **kwargs: Any additional keyword args you want to add.

    Returns:
        List of list containing the `str`ing representation of the models predicition.
        List of list containing the `int`eger representation of the ground-truth label.
    """
    print(generation_config)
    prediction_list, label_list = [], []
    for idx, batch in (pbar := tqdm(enumerate(dataloader), leave=False, total=len(dataloader))):
        pbar.set_description(f'Batch {idx}')
        input_ids, attention_mask, label = batch['input_ids'].to(device), batch['attention_mask'].to(device), batch[
            'label'].to(device)
        # YOUR CODE GOES HERE
        tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-small')
        bos_token_id = tokenizer.bos_token_id
        generation_config = transformers.GenerationConfig(bos_token_id=bos_token_id)
        # END OF YOUR CODE
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
            max_new_tokens=5,
        )
        prediction = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        prediction_list.append(prediction)
        label_list.append(label.cpu().tolist())
    return prediction_list, label_list


def get_q1_sets(
        dataset: datasets.Dataset
) -> Tuple[datasets.Dataset, datasets.Dataset]:
    """Helper method to create the low and high level datasets with the `simple` and `detailed` prompt.
    
    Notes:
        You likely don't need to edit this code, but feel free to extend this code, in case you want to
        evaluate more different levels
    
    Args:
        dataset (datasets.Dataset): Dataset to be mapped to a simple and detailed representation dataset.
        
    Returns:
        Dataset with text mapped using the `simple_template`.
        Dataset with text mapped using the `detailed_template`.
    """
    # 1. Prepare the simple set (low level)
    simple_set = (
        dataset
        .map(
            lambda batch: fast_tokenizer.batch_encode_plus(
                [simple_template.render(review=row) for row in batch['text']],
                truncation=False,
                padding=True,
            ),
            batched=True,
        )
    )
    # Map to input expected by the model.
    simple_set.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    # 2. Prepare the detailed set (high level)
    detailed_set = (
        dataset
        .map(
            lambda batch: fast_tokenizer.batch_encode_plus(
                [detailed_template.render(review=row) for row in batch['text']],
                truncation=False,
                padding=True,
            ),
            batched=True,
        )
    )
    # Map to input expected by the model
    detailed_set.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    
    return simple_set, detailed_set

### Exericse 1.2.1 Design of Experiments
Define your Design of Experiment configurations in the list `EXPERIMENT_CONFIGURATIONS`, you can use this list to store experiment configurations for the different levels.

In [84]:
# TODO: Implement your experimental design here! Decide on hyper-parameters, levels,
#  and type of factorial experiment you want to do.

EXPERIMENT_CONFIGURATIONS: List[Dict[Any, Any]] = [
    None
]
# YOUR CODE GOES HERE
DoE_1 = {"side": "left",
         "batch_size": 12,
         "dtype": torch.float16}
DoE_2 = {"side": "right",
         "batch_size": 12,
         "dtype": torch.float16}
DoE_3 = {"side": "left",
         "batch_size": 8,
         "dtype": torch.float32}
DoE_4 = {"side": "right",
        "batch_size": 8,
        "dtype": torch.bfloat16}

EXPERIMENT_CONFIGURATIONS = [DoE_1, DoE_2, DoE_3, DoE_4]
EXPERIMENT_CONFIGURATIONS = [{"side": "left", "batch_size":100}]
# END OF YOUR CODE


In [85]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [86]:
# ONLY SET THIS TO True IFF YOU NEED TO RE-RUN EXPERIMENTS, AS IT WILL
#  OVERWRITE YOUR RESULTS.
ALLOW_OVERWRITING_RESULTS = True
"""
# If you want to experiment with the side of the prompt, you will need to make
#  some changes here.
"""


for configuration in (exp_bar := tqdm(EXPERIMENT_CONFIGURATIONS, leave=True)):
    print(get_detailed_prompt_template(side=configuration["side"]))
    # EXAMPLE CODE HERE
    # Determine the maximum lenght given the length of your prompts. 
    # N.B. You might want to use this, but as we propose a pre-tokenized
    
    simple_template = get_simple_prompt_template(side=configuration["side"])
    detailed_template = get_detailed_prompt_template(side=configuration["side"])
    
    # Prepare datasets
    simple_set, detailed_set = get_q1_sets(q1_eval_set)
    
    simple_overhead = len(tokenizer(simple_template.render(), add_special_tokens=False)['input_ids'])
    detailed_overhead = len(tokenizer(detailed_template.render(), add_special_tokens=False)['input_ids'])
    
    """
    # You might need to subsample the dataset to 1000, if your hardware is too slow. Make sure to report
    #   if and how you sub-sample
    simple_set, detailed_set = simple_set.sample(...), detailed_set.sample(...)
    """
    for split, dataset in (split_bar := tqdm(zip(['simple', 'detailed'], [simple_set, detailed_set]), leave=False, total=2)):
        
        q_data_loader = torch.utils.data.DataLoader(
            dataset=dataset,
            batch_size=configuration['batch_size'],  # Feel free to lower / higher this
            shuffle=False,  # Shuffling not needed during evaluation
            num_workers=2,
            prefetch_factor=10,
        )
        
        # print(model.torch_dtype)
        # model.torch_dtype = configuration['dtype']
        # print(model.torch_dtype)
        
        begin_time = time.time()
        prediction_list, label_list = run_q1_evaluation(
            q_data_loader,  # This you should probably not change
            model,  # You might need to change / load a different model for model-parameter
            configuration,  # You might need to update some kwargs int the generation config for your exp.
        )
        end_time = time.time()
        # Create a flat version to work with.
        prediction_list, labels_list = list(chain(*prediction_list)), list(chain(*label_list))
        # TODO: Store your results in a way such that you can load it later!
        # YOUR CODE GOES HERE
        
        # END OF YOUR CODE
        
        # Map the output if we don't recognize it to 
        label_lut = defaultdict(lambda: -1, {'Positive': 1, 'Negative': 0})
        
        predictions = list(map(lambda x: label_lut[x.split(' ')[0].lower()], prediction_list))
        
        accuracy = sum(map(lambda x: x[0] == x[1], zip(predictions, labels_list))) / len(predictions)
        unknown =  sum(map(lambda x: x[0] == -1, zip(predictions, labels_list))) / len(predictions)

        print(f"Accuracy ({split}): {accuracy}, Unknown: {unknown}")
        experiment_description = f"Experiment with configuration: {configuration}"
        save_path_experiment = f"results/{split}_{configuration['name']}.json"
        
        # Write file to disk
        save_path = Path(save_path_experiment)
        if not save_path.parent.exists():
            # Recursively create directory
            save_path.parent.mkdir(exist_ok=True, parents=True)
        if save_path.is_file() and not ALLOW_OVERWRITING_RESULTS:
            print("YOU ARE TRYING TO OVERWRITE AN EXISTING EXPERIMENT FILE!")
            raise Exception("Cannot overwrite existing experiment file without `ALLOW_OVERWRITING_RESULTS` flag set.")
        
        with open(save_path, 'w') as f:
            # TODO: You might want to save some additional results.
            json.dump({
                "description": experiment_description,
                "accuracy": accuracy,
                "unknown": unknown,
                "begin_time": begin_time,
                "end_time": end_time,
            }, f)


  0%|          | 0/1 [00:00<?, ?it/s]

<Template memory:33d8d0750>


  0%|          | 0/2 [00:00<?, ?it/s]

{'side': 'left', 'batch_size': 100}


  0%|          | 0/250 [00:00<?, ?it/s]

KeyboardInterrupt: 

### Exercise 1.3 Report on DoE (7 points)

In [None]:
# TODO: Code for your evaluation of results and write a small report on the 




*TODO: Write your report here, using appropriate tables, and or $math$, to support your claim.*

Make sure to clearly state (among others):

1. Which hyper-parameters you are testing 
2. Which levels you are testing for each experiment
3. How many repetitions you use
4. Which design of experiment you use: full-factorial / fractional-factorial

---

# Exercise 2: Learning Through Examples 'In-Context Learning' (20 points total)

Instead of asking the model its decision at face-value, in this exercise we will provide the model with a view examples. Although the jury is out on why this exactly works, the idea is that the examples allow to 'prime' the model, to understand the task better that it is going to perform.

In short, this exercise consists of the following parts;
 1. **Exercise 2.1.1:**  (2 points) Design and implementation a few-shot template in (`get_few_shot_prompt_template`).
 2. **Exercise 2.1.2:**  (3 points) Implementation of Few-shot dataloader with independently randomly drawn context.
 3. **Exercise 2.1.3:**  (BONUS 10 points) few-shot dataloader with independently drawn semantic context.
 4. **Exercise 2.2:**:   (8 points) perform Design of Experiments.
 5. **Exercise 2.3:**    (7 points) Analyise and write-up the DoE results.

**N.B.** we recommend using Jinja to create templates for prompts. This allows to quickly transform the IMDB samples `text` `str`ings to Few-Shot samples, to be used in your DoE. Additionally, make sure to use `textwrap.dedent` to wrap around triple-quoted (multi-line) `str`ings! Otherwise, you will add (unintentional) whitespace `char`s!

> If you find performing 2 and 3 difficult, you can also hard-code some review, and choose an additional system or hyper-parameter!


## 2.1.2 Creating a Few-Shot Template (2 points)
First design a template that allows to render a varying number of few shot examples.

Your template should take as arguments

* `question_answer_pairs` of type `List[Tuple[str, str]]`, i.e., a list of tuples containing a review and a stringified sentiment.
* `review` of type `str` that contains the review the model should classify.

Your template should render the text in a way that provides the model with examples (`question_answer_pairs`), and then provides the `review` to be classified by the model. You can use your insights from the previous exercise.

In [None]:
def get_few_shot_prompt_template() -> jinja2.Template:
    """Function to get a few-shot template to render render Few-Shot prompts in a `dataset.map` function.
    
    Notes:
        The prompt-template uses a variable `question_answer_pair` and `review` as input.
        
    Examples:
        ```
        template = get_few_shot_prompt_template()
        template.render(question_answer_pair=[('Wow I like this movie', 'Positive'), ('I like the Sequels better...', 'Negative')], review='I like this movie')
        ```

    Returns:
        jinja2.Template that can be rendcered 
    """

    # TODO: Implement a few-shot style evaluation prompt, the prompt should use a 
    #  variable `question_answer_pair`, consisting of a list of Tuples of Reviews and (textual) Labels.
    PROMPT_TEMPLATE = textwrap.dedent(
        # YOUR CODE GOES HERE
        ...
        # END OF YOUR CODE
    )
    assert set(jinja2schema.infer(PROMPT_TEMPLATE).keys()) == {'question_answer_pairs', 'review'}
    template = jinja2.Template(PROMPT_TEMPLATE)
    return template

simple_few_shot_template = get_few_shot_prompt_template()

empty_pairs, empty_review = [('', ''), ('', '')], ''
empty_template_result = simple_few_shot_template.render(question_answer_pairs=empty_pairs, review=empty_review)
example_pairs, example_review = ([('I like the movie', 'Positive'), ('I dislike the movie', 'Negative')], 
                                 'Event the prequels were far better than this!')
example_template_result = simple_few_shot_template.render(question_answer_pairs=example_pairs,review=example_review)

display(
    Markdown('**Your few-shot prompt looks like this.**'),
    Markdown(
        textwrap.dedent(
            f"""\
            {nl_to_br(empty_template_result)}
            """
        )
    ),
    Markdown('**As an example, your few-shot prompt looks like this.**'),
    Markdown(
        textwrap.dedent(
            f"""\
            {nl_to_br(example_template_result)}
            """
        )
    )
)


In [None]:
# Do not edit this cell
display(
    Markdown('### Exercise 2.1.1 output'),
    Markdown('**Few-Shot prompt looks like this.**'),
    HTML(
    textwrap.dedent(
            f"""\
            <table style="border-collapse: collapse; width: 100%;">
                <tr>
                    <th style="text-align: left; border: 1px solid black;">My few-shot prompt (empty)</th>
                    <th style="text-align: left; border: 1px solid black;">My few-shot prompt (example)</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{nl_to_br(empty_template_result)}</td>
                    <td style="text-align: left; border: 1px solid black;">{nl_to_br(example_template_result)}</td>
                </tr>
            </table>"""
        )
    )
)

## 2.1.2 Create a Few-Shot Dataset (3 points)

Next you will complete the implementation to create a Few-Shot `Dataset` that contains the pre-processed few-shot examples, rendered with your template from `2.1.2`. Herein, we will use a `shots` parameter that dictates the size of the context that is provided to the model. Make sure that the shots are randomly drawn for each exercise, but if you find this difficult, hard-coding a set of positive and negative examples is OK as well for 1 out of 3 points.

Within this exercise, points are awarded for: 

* Creating a Dataset with a configurable number of shots (1 point)
* Configurable number of randomly drawn shots (2 point)


> Note, here you can already set one of the level, by making the `K` of shots configurable, you can also think about the ratio of Positive / Negative.

> If you want, and your resources allow for it, you might want to combine the Few-Shot idea with your prompt-based approach, you can use that as a variable, and choose one additional hyper- and/or system-parameter.


In [None]:
import random

def draw_batched_random_shots(
    batch: Dict[str, List[Any]],
    positive_dataset: datasets.Dataset = None,
    negative_dataset: datasets.Dataset = None,
    template: transformers.PreTrainedTokenizer = None,
    shuffle=False,
    shots=4,
) -> Dict[str, List[Any]]:
    """Method to implement drawing random shots of data.
    Args:
        batch (Dict[str, List[str]]): Batch of data to convert to in-context example dataset. 
        dataset (datasets.Dataset): Dataset to use for drawing random shots. 
        tokenizer (transformers.PreTrainedTokenizer): the tokenizer to use to convert shot to... 
        shots (int, 4): Number of shots to sample, defaults to 4. 

    Returns:
        Transformed representation of a batch of samples with the `text` representation updated.
    """
    batch_texts = batch['text']
    """
    Recall that  
    text_labels = 'Positive' if label == 0 else 'Negative'
    """
   
    # These Lists you need to construct.
    result: List[str] = []
    positive_shots: List[List[str]] = []
    negative_shots: List[List[str]] = []
    
    # TODO: Implement code to create random contexts of positive and/or negative reviews.
    # Hint: use the positive_dataset and negative_dataset
    # Hint: dataset can be shuffled, and `take`n from.
    # Hint: if you find this difficult, or as additional level you can also hard-code these lists
    
    # YOUR CODE GOES HERE
    ...
    # END OF YOUR CODE
    
    # Merge your sampled or hard-coded shots into a rendered string.
    for random_positives, random_negatives, review in zip(positive_shots, negative_shots, batch_texts):
        context = random_positives + random_negatives
        if shuffle:
            random.shuffle(context)
        random.shuffle(context)
        result.append(
            template.render(
                question_answer_pairs=context,
                review=review
            )
        )
    batch['text'] = result
    return batch

def get_simple_few_shot_dataset(
        train_set: datasets.Dataset,
        test_set: datasets.Dataset,
        *sample_args,
        **sample_kwargs 
) -> datasets.Dataset:
    """Function to get a few-shot dataloader that loads random examples from the correst split.
    
    Args:
        train_set (): 
        test_set (): 
        shots (int, 4): Number of shots to draw, defaults to 4. 

    Returns:
        
    """
    positive_set = train_set.filter(
        lambda sample: sample['label'] == 1, batched=False
    )
    negative_set = train_set.filter(
        lambda sample: sample['label'] == 0, batched=False
    )
    partial_draw_random_shots = partial(draw_batched_random_shots, positive_dataset=positive_set, negative_dataset=negative_set, **sample_kwargs)
    return_set = (
        test_set
        .map(partial_draw_random_shots, batched=True, num_proc=1) # Map to stringified representation
    )
    return return_set


truncated_train_set = (
    train_set
    .map(truncate_to_50_tokens, batched=True)
)
truncated_test_set = (
    test_set
    .map(truncate_to_50_tokens, batched=True)
)

simple_dataset = get_simple_few_shot_dataset(
    truncated_train_set,
    truncated_test_set,
    template = get_few_shot_prompt_template(),
    shots=4
)

display(
    Markdown("**As an example, here is how your data looks like**"),
    Markdown(
        textwrap.dedent(
            f"""\
            {nl_to_br(simple_dataset[0]['text'])}
            """
        )
    )
)

In [None]:
# Do not edit this code
import random
index_1, index_2 = random.randint(0, 2500), random.randint(0, 2500)
sample_1 = nl_to_br(simple_dataset.shuffle()[index_1]['text'])
sample_2 = nl_to_br(simple_dataset.shuffle()[index_2]['text'])

display(
    HTML(
        textwrap.dedent(
            f"""\
            <table style="border-collapse: collapse; width: 100%;">
                <tr>
                    <th style="text-align: left; border: 1px solid black;">Sample ({index_1})</th>
                    <th style="text-align: left; border: 1px solid black;">Sample ({index_2})</th>
                </tr>
                <tr>
                    <td style="text-align: left; border: 1px solid black;">{sample_1}</td>
                    <td style="text-align: left; border: 1px solid black;">{sample_2}</td>
                </tr>
            </table>"""
        )
    )
    
)

### (BONUS) Exercise 2.1.3 I would like additional context please (BONUS 10)

Instead of randomly sampling datapoints to create a Few-Shot context. However, maybe we can do better. An example of this, is to create a more semanticly relevant content, that provide more relevant information for the model to make a decision.

For this bonus exericse the recipe is (roughly) as follows:

 1. Creating a semantic embeddings of samples to create a context from (an embedding model).
 2. Creating a vector database to lookup examples.
 3. (Pre-compute) set of example to use (I.e. vector lookup).
 4. Render the template (similar as before)

We have provided some skeleton code to get started, but TAs cannot provide any assistant for this exericse (unless our template contains an error :))

> N.B. that this will take some compute power, so you might want to save the (stringified) dataset that you allow to continue.


In [None]:
from typing import Dict
from functools import partial
from pathlib import Path
import pickle
import sentence_transformers
from sentence_transformers import SentenceTransformer

embedding_model = 'multi-qa-MiniLM-L6-cos-v1'
# We advice to use a small model from Sentence transformer, but feel free to use somethign
# completey different, or use this as an additonal level!
embedding_model = SentenceTransformer(embedding_model)

# TODO: Complete and upate the functions to perfrom semantic search.
# As a hint: Look at the imports adn see how they can be used.

def create_semantic_db(
        embedding_model: SentenceTransformer,
        train_set: datasets.Dataset,
        test_set: datasets.Dataset
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:  
    """Function to create a sematnic database.
    
    Args:
        embedding_model (): 
        train_set (): 
        test_set (): 

    Returns:

    """
    # Load embeddings if they already exist.
    data = None
    if (embedding_path := Path('embeddings.pkl')).exists():
        with open(embedding_path, 'rb') as f:
            data = pickle.load(f)
    
    if data is None:
        # YOUR CODE GOES HERE
        ...
        # END OF YOUR CODE
    
    with open('embeddings.pkl', "wb") as fOut:
        data = {
            'sentences': sentences,
            'embeddings': embeddings
        }
        pickle.dump(data, fOut, protocol=pickle.HIGHEST_PROTOCOL)
    positive_embeddings, negative_embedding = embeddings[:len(embeddings)//2], embeddings[len(embeddings)//2:]
    return embeddings, positive_embeddings, negative_embedding

def find_batched_semantic_search(
        batch: Dict[str, List[str]],
        tokenizer: transformers.PreTrainedTokenizer,
        corpus_embeddings: torch.Tensor,
        negative_corpus_embeddings: torch.Tensor,
        positive_corpus_embeddings: torch.Tensor,
        shots=4
) -> Dict[str, List[str]]:
    
    # You will have to create a list of rendered string with the found context.
    results: List[str] = []
    # TODO: Implement a batched `semantic_search` to find relevant items.
    
    # 1. Perform semantic_search for each item in the batch
    # YOUR CODE GOES HERE
    ...
    # END OF YOUR CODE
    # 2. Select relevant samples for each sample in the batch
    # YOUR CODE GOES HERE
    ...
    # END OF YOUR CODE
    # 3. Collate the found results for the batch
    # YOUR CODE GOES HERE
    ...
    # END OF YOUR CODE
    # 4. Render the results using a template.
    # YOUR CODE GOES HERE
    ...
    # END OF YOUR CODE

    batch['text'] = results
    return batch

def get_contextual_drawn_few_shot_dataset(
        train_set: datasets.Dataset,
        test_set: datasets.Dataset,
        shots: int = 4,
        *args,
        **kwargs
) -> datasets.Dataset:
    """Function to get a few-shot dataloader based on context."""
    partial_semantic_search = partial(find_batched_semantic_search, train_set=train_set, test_set=test_set, shots=shots)
    return_set = (
        test_set
        .map(partial_semantic_search, batched=True, num_proc=1) # Map to stringified representation
    )
    return_set.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
    return return_set

# Step 1: Get embeddings
corpus_embedding, positive_embedding, negative_embedding = create_semantic_db(
    embedding_model=embedding_model,
    train_set=train_set,
    test_set=test_set
)

SHOTS = 4   # YOU MIGHT WANT TO CHANGE THIS IF YOU USE SHOTS AS A VARIABLE
q2_complex_set = get_contextual_drawn_few_shot_dataset(
    train_set=train_set,
    test_set=test_set,
    shots=SHOTS
    
)
display(
    # TODO: Implement showing an example that shows that it works
)


## Exercise 2.2: Perform DoE (8 points)

In this and the following exercise, we are interested in quantifing the effect of different configurations on the Few-Shot performance of the model, you will need to select at-least 3 system- and/or hyper-parameters, with each having at least two or more (2+) levels. Recall that you may wish to use the shots hyper-parameters (e.g. $\texttt{shots} \in  [2, 4, 6 ]$).

Furthermore, we suggest using one or more from the following parameters in your DoE:

  * Model size, for example (`T5-flan-small`, `T5-flan-base`, `T5-flan-large`, etc.). (only recommended with GPU)
  * Numerical precision (`torch.float16`, `torch.float32`, `torch.bfloat16`). Make sure your hardware / `PyTorch` version supports this!
  * Quantization (only recommended with GPU with `BitsAndBytes` packages).
  * Structured decoding (requires implememtantation).


In short, you will need to perform;

1. (8 points) Design of Experiments in code;
    * Selection of critaria.
    * Type of factorial experiment.
    * Creation of experimental configuration.
    * Run your experiments.
        * Depending on your chosen variables in DoE, you might need to make some minor adaptations to our provided code.

> For your convenience, we have split first DoE part,  and the Design of Experiments (which you have to implement), and the ANOVA analysis into 2 cells. We strongly recommend writing data to disk/persistent storage and loading it in the next cell to make sure you can easily re-run evaluation upon restarting the notebook.


In [None]:
def run_q2_evaluation(
        dataloader: torch.utils.data.DataLoader,
        model: transformers.PreTrainedModel,
        generation_config,
        *args,
        **kwargs
) -> Tuple[List[List[str]], List[List[int]]]:
    """Helper function to run evaluation (e.g. under different evaluations.
    
    Args:
        dataloader: 
        model: 
        generation_config:
        *args: Any additional positional args you want to add.
    
    Keyword Args:
        **kwargs: Any additional keyword args you want to add.

    Returns:
    """
    prediction_list, label_list = [], []
    for idx, batch in (pbar := tqdm(enumerate(dataloader), leave=False, total=len(dataloader))):
        pbar.set_description(f'Batch {idx}')
        input_ids, attention_mask, label = batch['input_ids'].to(device), batch['attention_mask'].to(device), batch['label'].to(device)
        
        # TODO: You might need to implement something for your experiment here!

        # YOUR CODE GOES HERE
        ...
        # END OF YOUR CODE
        outputs = model.generate(
            input_ids,
            attention_mask=attention_mask,
            generation_config=generation_config,
        )
        prediction = tokenizer.batch_decode(outputs, skip_special_tokens=True)
        prediction_list.append(prediction)
        label_list.append(label.cpu().tolist())
    return prediction_list, label_list

### Exericse 2.2.1 Design of Experiments
Define your Design of Experiment configurations in the list `EXPERIMENT_CONFIGURATIONS`, you can use this list to store experiment configurations for the different levels.

In [None]:
# TODO: Implement your experimental design here! Decide on hyper-parameters, levels, see Exercise 3 for how to set-this up
#  and type of factorial experiment you want to do.

EXPERIMENT_CONFIGURATIONS: List[Dict[Any, Any]] = [
    None
]
# YOUR CODE GOES HERE
...
# END OF YOUR CODE



In [None]:
# Perform inference. See the cells of Exercise 1 and 3 as a starting point


### Exercise 1.3 Report on DoE (7 points)

In [None]:
# TODO: Perform your calculations for DoE HERE.

# 1. Load data

# 2. Create model and fit

# 3. Check assumptions


> TODO: Write your report here, using appropriate tables, and or $math$, to support your claim.

Make sure to clearly state (among others):

 1. Which hyper-parameters you are testing.
 2. Which levels you are testing for each experiment.
 3. How many repetitions you use.
 4. Which design of experiment you use: full-factorial / fractional-factorial.
 5. Whether the assumptions of the model hold.

---

# Exercise 3: Fine-Tuning Based Classification (20 points total)

Lastly, we will perform a fine-tuning based approach, where we will update the model weights in order to 'learn' reply with the clasification of the sentiment of the sentence.

**N.B.** We provide most of the code here, as there are multiple non-trivial implementation details. However, the run-time is likely quitea bit longer, so make sure tostart in time.


Here we would like to advise to;

1. Carefully choose **which hyper-parameters** you want to evaluate, before diving into the implementation, make sure to check that you can reasonably run these experiment within reasonable time.
   1. We strongly recommend using a LORA based approach, and focus on; different `target_modules`, `rank`, `alpha`, `drop_out`, and `epochs`.
   2. Prefer low values for levels over higher, e.g., a level for epochs can be `1`, or for `steps=100`.
   3. You can also try to fine-tune the model, and see whether the fine-tuned model is still capable to perform.
   4. If your hardware / pytorch version allows, we also strongly recommend using `bitsandbytes` to further quantize the model, which will speed-up your experiments considerably.
2. Preferably run with replication, i.e., at-least a `REPLICATION` of `2`, but if time does not permit for this, a single run is OK as well.
3. Look into check-pointing, and recovery, and how much disk-space you need for your experiments.
4. Check that you save models to recoverable paths, i.e., you don't overwrite models you train.


## Exercise 3.1: Perform DoE (10 points)

First, you will need to complete the following code to Design your experiments.

> Note, running training will take some time, so make sure to get started early!


In [None]:
from peft import LoraConfig, get_peft_model, TaskType


def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

def tokenize_function(
        batch,
        prefix='Is the following Positive or Negative?\n',
        post_fix='\nAnswer: '):

    updated_text = [f"{prefix}{review}{post_fix}" for review in batch["text"]]
    batch['text'] = updated_text
    # We also set the 'response', i.e., what the model should learn
    batch['labels'] = tokenizer(['Positive' if label == 1 else 'Negative' for label in batch["label"]], truncation=True, padding='max_length', return_tensors="pt").input_ids
    
    return batch

def train_model(
        peft_model,
        output_dir: str,
        peft_training_args,
        train_set,
        test_set = None,
        
) -> Tuple[transformers.Trainer, peft.PeftModel]:
    assert output_dir is not None, "Provide an output dir to save the model"
    assert not Path(output_dir).exists(), "Provided output dir is not unique!"
    
    peft_trainer = transformers.Trainer(
        model=original_model,
        args=peft_training_args,
        train_dataset=train_set,
        eval_dataset=test_set,
    )
    # Pre-train the model
    peft_trainer.train()
    # Set the fine-tuned model to evaluate, to remove non-deterministic
    #  behavior.
    peft_model.eval()
    return peft_model, peft_trainer


In [None]:
# Example of hyper-parameters.
RANK = 32               # Rank used in model update (lower is faster, less precise)
ALPHA = 64              # Scaling factor for update (∆W x dy ALPHA/RANK)           
DROPOUT = 0.05          # Regularization term
TRAIN_BATCH_SIZE = 32   # Number of samples
# GRADIENT_ACCUMULATION_STEPS=1 # If you have low GPU/hardware, you can increase effective batch-size through this.
#                               # It 'sums' gradient over GRADIENT_ACCUMULATION_STEPS, to create an effective-batch-size of
#                               # GRADIENT_ACCUMULATION_STEPS * TRAIN_BATCH_SIZE
TRAIN_EPOCHS = 5        # Total number of trainnig steps.

# If you want to save some time, you can store checkpoints, and load them, to create multiple levels
# in a single run. Do note, that huggingface by default uses learning-rate scheduling, so this may
# affect your results a bit.

# The modules are specific to the model itself.
MODULES =  ['o'] # Other options are for example, please read the documentation.
                 # ['o'], ['k', 'q'], ['q'], ['k', 'q', 'v'], 'or any other identifier of weights.
TORCH_DTYPE = torch.float16

# TODO: Decide the levels for your experiment. These can be any of the 
# aforementioned parameters, or any other hyper-parameter.

# Hint: Define the levels as a list of numbers for the unique count of 
#   levels for a parameter.
levels: List[int] = ...
# Create a list with the names to keep track of the parameters
parameters: List[str] = ...
# Create a list with levels for each parameter
parameter_levels: Dict[str, List[Any]] = ...

# EXAMPLE ONLY
# Don't actually use this configuration, as this will be a 3 * 2 * 2 * 3 = 36 experiments (without replication)
levels = [3, 2, 2, 3]
level_names = ['rank', 'alpha', 'dtype',  'epochs']
parameter_levels = {
    'rank': [8, 16, 32],
    'alpha': [16, 32],
    'dtype': [torch.float16, torch.float32],
    'epochs': [1, 2, 3]
}
# END OF EXAMPLE 
# YOUR CODE GOES HERE
...
# END OF YOUR CODE

In [None]:
# TODO: Decide the type of (fractional or full) factorial experiment you want to run.
# HINT: use the ANOVAandDOE.ipynb notebook as inspiration, and use functinos from pyDOE3
import pyDOE3
# EXAMPLE ONLY
reduction = 4 # for general factorial experiment.
experiment = pyDOE3.gsd(
    levels, reduction=reduction
)
# END OF EXAMPLE

# YOUR CODE GOES HERE
...
# END OF YOUR CODE

experiment_configs = pd.DataFrame(
    experiment,
    columns=[level_names],
    
)
experiment_configs.index.name = 'Experiment ID'


display(
    experiment_configs,
)


In [None]:
REPETITIONS = 2

# If the number of tokens is a level, you might need to change this
train_dataset = (
    train_set
    .map(truncate_to_50_tokens, batched=True)
    .map(
        tokenize_function, batched=True
    )
    .map(
        lambda batch: fast_tokenizer.batch_encode_plus(
            batch['text'],
            add_special_tokens=True,
            return_tensors="pt",
            padding=True,
            truncation=False,
        ), batched=True
    )
)
# Ensure we can effectively use the model
train_dataset.set_format(type='torch', columns=['input_ids', 'labels'])
EXPERIMENT_CONFIGURATIONS = []
for repetition in range(REPETITIONS):
    for experiment_id, config_row in enumerate(experiment_configs.iterrows()):
        experiment_config = {k[0]: parameter_levels[k[0]][v] for k, v in config_row[1].to_dict().items()}
        
        EXPERIMENT_CONFIGURATIONS.append(experiment_config)
        print(f"Running experiment: {experiment_id + 1}, repetition: {repetition + 1}")
        print(f"Experiment config: {experiment_config}")
        
        # BEGIN OF YOUR UPDATE TO THIS CODE
        rank = experiment_config['rank']
        alpha = experiment_config['alpha']
        exp_dtype = experiment_config['dtype']
        epochs = experiment_config['epochs']
        
        lora_config = LoraConfig(
            r=rank,
            lora_alpha=alpha,
            target_modules=MODULES,
            lora_dropout=DROPOUT,
            bias='none',
            task_type=TaskType.SEQ_2_SEQ_LM # Specific for FLAN-T5 model.
            # task_type=TaskType.CAUSAL_LM # Specific for Auto-regressive model
            # task_type=TaskType.TOKEN_CLS # Specific for Token based classification
        )
        
        original_model, tokenizer, tokenizer_fast = get_model(
            model_name=model_name,
            device=device,
            torch_dtype=exp_dtype,
        )
        
        output_dir = f'./exercise-3/exp_{repetition}_{experiment_id}_rank={rank}_alpha={alpha}_dtype={exp_dtype}_epochs={epochs}'

        
        peft_training_args = transformers.TrainingArguments(
            output_dir=output_dir,
            auto_find_batch_size=False,
            per_device_train_batch_size=TRAIN_BATCH_SIZE,
            learning_rate=1e-4,
            num_train_epochs=epochs,
            logging_steps=1000,     # You might need to change this, esp. if you subsample the train set.
            # max_steps=10000,        # You can use this instead of epochs, for more fine-grained control.
            save_total_limit=2,     # Limit the number of checkpoints to save
            save_strategy='steps',
            save_steps=1000         # You might need to change this
        )
        # END OF YOUR UPDATE CODE
        
        peft_model = get_peft_model(
            model=original_model,
            peft_config=lora_config,
        )
        peft_model, peft_trainer = train_model(
            peft_model=peft_model,
            peft_training_args=peft_training_args,
            output_dir=output_dir,
            train_set=train_dataset,
            test_set=None,
        )
        peft_model.save_pretrained(output_dir)
        
        del peft_model, peft_trainer
        if device == 'cuda':
            torch.cuda.empty_cache()
        
        print('Finished experiment!')
        

In [None]:
# Next do the evaluation

# ONLY SET THIS TO True IFF YOU NEED TO RE-RUN EXPERIMENTS, AS IT WILL
#  OVERWRITE YOUR RESULTS.
ALLOW_OVERWRITING_RESULTS = False
"""
# If you want to experiment with the side of the prompt, you will need to make
#  some changes here.
"""
# If the number of tokens is a level, you might need to change this
test_dataset = (
    test_set
    .map(truncate_to_50_tokens, batched=True)
    .map(
        tokenize_function, batched=True
    )
    .map(
        lambda batch: fast_tokenizer.batch_encode_plus(
            batch['text'],
            add_special_tokens=True,
            return_tensors="pt",
            padding=True,
            truncation=False,
        ), batched=True
    )
)
# Ensure we can effectively use the model
test_dataset.set_format(type='torch', columns=['input_ids', 'labels'])
test_datasloader = torch.utils.data.DataLoader(
            dataset=dataset,
            batch_size=15,  # Feel free to lower / higher this
            shuffle=False,  # Shuffling not needed during evaluation
            num_workers=2,
            prefetch_factor=10,
        )

EXPERIMENT_CONFIGURATIONS = []
for experiment_id, config_row in enumerate(experiment_configs.iterrows()):
    experiment_config = {k[0]: parameter_levels[k[0]][v] for k, v in config_row[1].to_dict().items()}
    
    EXPERIMENT_CONFIGURATIONS.append(experiment_config)
    
original_model, tokenizer, tokenizer_fast = get_model(
            model_name=model_name,
            device=device,
            torch_dtype=torch.float16, # You might need to change this.
)

In [None]:
for experiment_id, experiment_config in (exp_bar := tqdm(enumerate(EXPERIMENT_CONFIGURATIONS), leave=True)):
    # EXAMPLE CODE HERE
    for repetition in tqdm(range(REPETITIONS), leave=False):
    
        # BEGIN OF YOUR UPDATE TO THIS CODE
        rank = experiment_config['rank']
        alpha = experiment_config['alpha']
        exp_dtype = experiment_config['dtype']
        epochs = experiment_config['epochs']
        
        # END OF YOUR UPDATE TO THIS CODE
        # TODO: make sure that your output-dir here has the same format as during training.
        output_dir = f'./exercise-3/exp_{repetition}_{experiment_id}_rank={rank}_alpha={alpha}_dtype={exp_dtype}_epochs={epochs}'
        
        peft_model = peft.PeftModel.from_pretrained(original_model, output_dir)
        begin_time = time.time()

        prediction_list, label_list = run_q1_evaluation(
            test_dataset,  # This you should probably not change
            peft_model,  # You might need to change / load a different model for model-parameter
            experiment_config,  # You might need to update some kwargs int the generation config for your exp.
        )
        end_time = time.time()
        # Create a flat version to work with.
        prediction_list, labels_list = list(chain(*prediction_list)), list(chain(*label_list))
        # TODO: Store your results in a way such that you can load it later!
        # YOUR CODE GOES HERE
        ...
        # END OF YOUR CODE
        
        # Map the output if we don't recognize it to 
        label_lut = defaultdict(lambda: -1, {'positive': 1, 'negative': 0})
        
        predictions = list(map(lambda x: label_lut[x.split(' ')[0].lower()], prediction_list))
        
        accuracy = sum(map(lambda x: x[0] == x[1], zip(predictions, labels_list))) / len(predictions)
        unknown =  sum(map(lambda x: x[0] == -1, zip(predictions, labels_list))) / len(predictions)

        print(f"Accuracy ({configuration}): {accuracy}, Unknown: {unknown}")

        
        # Write file to disk
        save_path = Path(output_dir) / f'result_replication={repetition}.json'
        if not save_path.parent.exists():
            # Recursively create directory
            save_path.parent.mkdir(exist_ok=True, parents=True)
        if save_path.is_file() and not ALLOW_OVERWRITING_RESULTS:
            print("YOU ARE TRYING TO OVERWRITE AN EXISTING EXPERIMENT FILE!")
            raise Exception("Cannot overwrite existing experiment file without `ALLOW_OVERWRITING_RESULTS` flag set.")
        
        with open(save_path, 'w') as f:
            # TODO: You might want to save some additional results.
            json.dump({
                "experiment_config": experiment_config,
                "accuracy": accuracy,
                "unknown": unknown,
                "begin_time": begin_time,
                "end_time": end_time,
            }, f)


## Excercise 3.2 Experimental Analysis (10 points)

In [None]:
 # TODO: Put your code here to perform DoE
 
 # 1. Load data
 
 # 2. Create model and fit
 
 # 3. Check assumptions

# TODO: Write your report here, using appropriate tables, and or $math$, to support your claim.

Make sure to clearly state (among others):

1. Which hyper-parameters you are testing 
2. Which levels you are testing for each experiment
3. How many repetitions you use
4. Which design of experiment you use: full-factorial / fractional-factorial.
5. Whether the assumptionsn of hte model hold
