# Deep Learning for NLP - Exercise 04
In this exercise, we will first look at the inner workings of the attention mechanism and the importance of various heads in pretrained models. The second task compares multiple decoding strategies on the same model and prompt to highlight how much on an effect the decoding algorithm has on the quality of generated text.

Task 1 and Task 2 can be worked in independently.
___

General hints:
* Have a look at the imports below when solving the tasks
* Use the given modules and all submodules of the imports, but don't import anything else!
    * For instance, you can use other functions under the `torch` or `nn` namespace, but don't import e.g. PyTorch Lightning, etc.
* It is recommended to install all packages from the provided environment file
* Feel free to test your code between sub-tasks of the exercise sheet, so that you can spot mistakes early (wrong shapes, impossible numbers, NaNs, ...)
* Just keep in mind that your final submission should be compliant to the provided initial format of this file

Submission guidelines:
* Make sure that the code runs on package versions from the the provided environment file
* Do not add or change any imports (also don't change the naming of imports, e.g. `torch.nn.functional as f`)
* Remove your personal, additional code testings and experiments throughout the notebook
* Do not change the class, function or naming structure as we will run tests on the given names
* Additionally export this notebook as a `.py` file, and submit **both** the executed `.ipynb` notebook with plots in it **and** the `.py` file
* **Deviation from the above guidelines will result in partial or full loss of points**

In [None]:
# !pip install transformers==4.24.0
# !pip install datasets==3.0.1
# !pip install bertviz==1.4.0
# !pip install plotly==5.17.0

# Task 1: Looking 'Inside' the Models

In [2]:
import numpy as np
import pandas as pd

import plotly.express as px
import matplotlib.pyplot as plt
from bertviz import head_view, model_view

import torch

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForSequenceClassification,
    BertConfig,
)

## Task 1.1: Visualizing and Analyzing Attention Maps

* In the following experiment, we compare three models: a randomly initialized BERT model, and trained BERT model, and a trained GPT-2 model
* Start by loading a randomly initialized BERT model
    * You can achieve this by loading an [AutoModelForSequenceClassification.from_config](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification.from_config)
    * Make use of the imported `BertConfig`
    * Specify `output_attentions=True`
* Repeat the process for pre-trained BERT and GPT-2
    * For pre-trained BERT, load [bhadresh-savani/bert-base-uncased-emotion](https://huggingface.co/bhadresh-savani/bert-base-uncased-emotion) using [AutoModelForSequenceClassification.from_pretrained](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForSequenceClassification.from_pretrained)
    * For GPT-2, it is enough to use [AutoModel.from_pretrained](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel.from_config) method
        * This model will not have a language modelling head ontop, but only return hidden states. Since we are only interested in attention outputs, this is enough in this case
    * Also specify `output_attentions=True` for both
* Set all models into `eval()` mode
* Load BERT's and GPT2's tokenizer accordingly using the `AutoTokenizer` import

In [None]:
gpt_model_name = 'gpt2'
bert_model_name = "bhadresh-savani/bert-base-uncased-emotion"

rnd_bert = #TODO
rnd_bert.eval()

bert = #TODO
# TODO

gpt2 = # TODO
# TODO

In [None]:
tokenizer_bert = # TODO
tokenizer_gpt = # TODO

* Encode the sentence using both tokenizers
    * Return the sentences as torch tensors
* Save the tokenized sequence as a list of strings in a variable (required for plotting below)

In [None]:
sentence = "I was so relieved after the phone call yesterday. I smiled the whole day."
# TODO

* Run the sequence through all 3 models
    * Extract and save the attention outputs
    * Disable gradient calculation to save memory and speed up the process

In [None]:
# TODO

* Visualize the attention maps using `bertviz`'s [model_view](https://github.com/jessevig/bertviz#model-view) and [head_view](https://github.com/jessevig/bertviz#head-view) for each of the three models
* Inspect the attention patterns and structures and answer the questions below
* If you encounter a `javascript: require is not defined` error, simply execute the cell again, sometimes jupyterlab's widgets need to be reloaded

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

In [None]:
# TODO

Hint: State approximately 2-3 observations/answers per question. Always point to the specific layer(s) and head(s) where your observation(s) can be found.

Questions:
* What difference can you observe between the randomly initialized model and both trained models?
* What general patterns and differences in attention maps can you see between the trained BERT and GPT-2 models? How can they be explained? Think about the attention and masking behavior of BERT vs. GPT models.
* Compare the trained BERT and GPT-2 model on the following aspects:
    * Hierarchical patterns: Do lower or higher layers attend to more local and syntactic structures?
    * Intra-layer head behavior: Do heads belonging to the same layer capture similar or different structures or patterns?
    * Redundancy: Are there repeated attention patterns across heads and/or layers? Is redundancy rather beneficial or hurtful?
    * Interpretability: Can you find interpretable structures in certain layers and/or heads? You could try to look for DET-NOUN dependencies, subject-verb-object structures, next- or previous-word attention. For instance, GPT-2's layer 5 head 4 attends 'smiled' to 'the whole day'.


___
Student answers here:

___

## Task 1.2: Role and Importance of Individual Heads

### Task 1.2.1: Entropy Per Head

* We saw in our visualizations that sometimes a single head attended only to one specific token, but also that often an individual head attended almost uniformly to all tokens.
* This phenomenon can be measured, i.e. quantified, using [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)
* First, implement the following entropy formula:
$$
H(X) := - \sum_{x \in X} p(x) \log{p(x)}
$$
    * Our attention output tensor of a single layer $L$ of shape `[batch_size, num_heads, seq_len, seq_len]` represents the discrete random variable $X$
    * $X$ consists of `num_heads=12` attention heads in the case of BERT
    * Each attention head has a sequence length of `seq_len` (whatever you chose above for your sentence)
    * $p(x)$ represents the probability distribution over the `seq_len` tokens in the sequence
    * Each $x \in X$ corresponds to a token position in the sequence
    * For each $x \in X$, the value is between 0 and 1, indicating the probability of the attention weight for the token at position $x$ connecting to a subset of tokens within the sequence
        * Remember that due to the softmax operation during attention calculation, each attention head of our output already represents a probability distribution over the token sequence

In [None]:
def entropy(p):
    # TODO
    return # TODO

* Apply the function to all attention layers
    * Per layer, the resulting entropy output should be of shape `[batch_size, num_heads, num_tokens]`
* Calculate the entropy per head by summing up the per-token entropy
    * Repeat the process for all layers in the model
* Compare the entropy across heads and layers by visualizing the trends in a line plot
    * The x-axis should represents the 12 head positions
    * The y-axis should represent the entropy per head
    * The plot should include 12 lines (i.e. the layers) with 12 positions (i.e. the heads)
    * Create an interactive plot by using [plotly.express.line](https://plotly.com/python-api-reference/generated/plotly.express.line)
        * Complete the function `create_plotly_lineplot`
        * As the documentation shows, it expects a DataFrame `df` where each column is a layer, and each row are the entropy values of an attention head in that layer
        * The `plotly.express.line` function returns a [Figure](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html)
        * We can [fig.update_layout](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html#plotly.graph_objects.Figure.update_layout) to include [custom buttons](https://plotly.com/python/custom-buttons/) to toggle between the lines
        * To achieve this, we need to create one dictionary per layer that has a `visible` setting to `True` at its layer number, i.e. if its the third layer, the third index should be `True`, and all remaining set to `False`.
        * In the end, we need a list with 12 dictionaries in the following style
        ```
        {
            "label": "Layer 0",           # the name of the button
            "method": "update",           # means that we should 'update' the plot when clicking this
            "args": [                     # the visibility argument for the respective layer
                {                         
                    "visible": [
                        True,             # set True for the corresponding index of the layer
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                        False,
                    ]
                }
            ],
        }
        ```
        * Add to the list a `Show All` button with visibility settings to all `True`
        * Then, use `fig.update_layout`'s `updatemenus` (as linked above) to include the buttons in the plot
        * Lastly, use [fig.update_traces](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html#plotly.graph_objects.Figure.update_traces) with `mode="lines+markers"` to include markers at each x-tick in the plot
        * Return the figure at this stage
    * Save the plot [as an html-file](https://plotly.com/python-api-reference/generated/plotly.graph_objects.Figure.html#plotly.graph_objects.Figure.write_html). You can now open it in your browser and analyze the entropy per layer and head. You can also hover over each line and see the exact entropy value for that layer and head.
    * You can do this interactive plotting style for [all kinds of plots](https://plotly.com/python-api-reference/generated/plotly.express.html#module-plotly.express), which makes analyzing high-dimensional data much more intuitive
    * Include a screenshot of the html view in your submission
* Go back to the visualized BERT attention maps and check that some of the patterns here match with the observed visualizations
    * Choose some distinct results (e.g. outliers) and compare it with the `head_view`
    * In the drop-down menu, select the outlier layer
        * (1) Hover of the words to see the color patterns of each respective head
        * (2) Double click on the head to see only the attention weights and connections of that head
* Repeat for some (3 or more) patterns and describe what you see
    * What do high entropy heads correspond to when visualized?
    * What do low entropy heads correspond to when visualized?

In [None]:
def create_plotly_lineplot(df):
    # TODO

    return # TODO

In [None]:
# TODO

___
Student answers here:

___

### Task 1.2.2: Importance Per Head

* After having seen how different attention heads correspond to different entropy levels, we can also calculate an [importance score](http://arxiv.org/abs/1905.10650) for each head
* For this experiment, we define a mask $\xi_{h}$, which can drop out (i.e. set to zero) an attention head
    * If the mask is not active (i.e. set to 1), the attention weights of the head remain unchanged (represented by $\xi_h$)
* Then, we feed in a sample $x$, calculate the loss $\mathcal{L}(x)$, and analyze the sensitivity of the loss w.r.t. the masked head
* The importance of that head follows as
$$
I_h = \big\vert \frac{\partial \mathcal{L}(x)}{\partial \xi_h} \big\vert
$$
* The absolute value avoids large negative and positive values from nullifying each other

* Define the `device` to use the GPU

In [None]:
device = # TODO

* For this experiment, we will work with the [DAIR-AI Emotion Dataset](https://huggingface.co/datasets/dair-ai/emotion), but in this case you only need to load the `split='test'`
    * Select only the first 32 samples
    * Tokenize, pad, truncate to maximum 512 tokens, and return torch tensors of those 32 samples
    * Extract the labels of those 32 samples, too
    * This batch will serve as the exemplary classification dataset on which we calculate importance metrics


In [None]:
# TODO

* Then, we will write a function `get_head_importance`, which takes in the `model`, `samples`, `labels`, `device`, and an optional `mask`
* Define a zero-tensor of shape `[num_layers, num_heads]`, which stores the importance scores of all heads
* If no `mask` is provided, create a `mask` of ones (i.e. we are not masking anything currently) in the same shape
* In all cases, whether given or newly created, [enable gradient calculuation](https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad_.html) for the mask
* Move all necessary objects to the `device`
* Forward the samples through the model
    * Include the [labels](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification.forward.labels) and [head_mask](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification.forward.head_mask)
    * Extract loss and logits from the [output](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput)
* Then, call `backward()` on the loss to distribute the gradients among the head mask
    * We do not train the model (i.e. no optimizer here), but we need the gradients (see formula above) for each head mask position w.r.t. each loss sample
    * PyTorch's autograd and `backward()` method allows us calculate and backpropagate any form of gradients, even without the classical training/optimization setup
* Now, we can check the head importance by accessing the [grad](https://pytorch.org/docs/stable/generated/torch.Tensor.grad.html) method of the earlier created `mask` tensor
    * [detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) the tensor from the gradient calculation graph
    * Take the absolute value of it as shown in the formula
* It is common to normalize the head-importance scores in two ways
    * First, divide the head importance scores by the number of real (i.e. non-padding) tokens in the batch
    * Secondly, calculate the [l2-norm](https://mathworld.wolfram.com/L2-Norm.html) per layer
        * Make sure to adjust the shapes of the $l_2$ norm to fit the head importance tensor
        * Add a safety term of `1e-20` to the $l_2$ norm to avoid possible divisons by zero
        * Divide the head importance by the $l_2$ norm
* Return the head importance, logits, and labels

In [None]:
def get_head_importance(model, samples, labels, device, mask=None):
    # TODO
    return # TODO

* Calculate the head importance for our BERT model and the 32 samples

In [None]:
# TODO

* Visualize the head importance in the same interactive way as you did with the entropy plots
* You can re-use `create_plotly_lineplot`, just create a DataFrame from the returned head importance tensor instead of the entropy tensor
* Comment on any trends you see in the importance of heads. Which head is the most important?
* Create again an HTML version of the plot and include a screenshot of your plot in the submission

In [None]:
# TODO

___
Student answers here:

___

* Investigate whether there is a positive, negative, or no correlation between each attention head's entropy and importance score
* Calculate the [correlation coefficient](https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html) between the entropies and importance scores
* Create a scatter plot to visualize your findings
* Comment on the results

In [None]:
# TODO

___
Student answers here:

___

### Task 1.2.3: Masking and Pruning Heads

* As we can see from the line and scatter plots, a lot of attention heads have an importance of near 0
* As a consequence, we can investigate which and how many we can remove from our model without losing a predefined threshold of performance
* Dropping attention heads leads to dropping parameters, which results in a smaller model for further finetuning or inference
* However, directly dropping heads is risky, since dropping a head affects the calculations and, therefore, the performance of the remaining model
* Therefore, we remove heads through masking consecutively and test the performance after each removed head
* Once we found an acceptable performance-masking tradeoff, we can actually remove the heads from the model

* Write a function that takes in a `model`, `samples`, `labels`, `threshold` and `device`
* Start again with an inactive mask
* Set the starting head importance, logits, and labels
* Use logits and labels to calculate a base accuracy performance with the inactive mask
* We perform the following process until either the accuracy dropped more than 5% (i.e. threshold level) below starting accuracy, or until all heads are masked
    * We continuously select the next lowest importance head
    * We mask its position in the mask
    * Recalculate the head importance and accuracy level with the masked heads
      * Hint: Depending on your implementation, you might find [clone](https://pytorch.org/docs/stable/generated/torch.clone.html) and/or [detach](https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html) helpful when dealing with the `mask` and/or `head_importance`
    * Set the head's importance to positive infinity so that it will always be last in future importance rankings
* Save the number of masked heads with the corresponding accuracy level for each iteration as a tuple `(int, float)` in a list
    * The accuracy should be rounded to 4 decimal places
* The function returns the final `mask` and the list of tuples

In [None]:
def find_min_heads(model, samples, labels, threshold, device):
    # TODO

    return # TODO

* Find the minimum number of heads required before performance drops below the given 5% threshold
* Plot the accuracy level per number of masked heads below
* Comment on your findings
    * Keep in mind that we are evaluating on a very small set of 32 samples
        * I.e extreme results are possible, such as masking large percentages of the model's heads without noticing performance differences
    * It is possible that the accuracy can rise intermediately, even though heads are dropped. Try to explain what is happening in such cases.
    * In practice, you would usually run this experiment on the whole [GLUE dataset](https://huggingface.co/datasets/glue) to have a more representative and diverse accuracy baseline

In [None]:
# TODO

___
Student answers here:
___

* Now that we have our mask, we can actually remove from the model the heads we masked before (pruning)
* Create a dictionary that [maps each layer's index to the indices of the heads to prune](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.prune_heads)
* Calculate the number of parameters before and after pruning to show the effect
    * Note that the `prune_heads` method changes the model in-place

In [None]:
# TODO

# Task 2: Decoding

* In the following, we will have a look at various ways of generating text from model output probabilities
* Specifically, we will see how much of an impact the decoding strategy has on the generated text
* Using Hugging Face, we will try both basic and more advanced as well as commonly used decoding strategies for the same input prompt
* Therefore, we can see and analyze the shortcomings and advantages of various decoding strategies *while keeping the model and prompt equal*

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

* Define the device for decoding
* Load GPT-2 with its language modeling head and the corresponding tokenizer
    * Since we are doing open ended text generation, there is no padding token necessary
    * To disable the warning by the model, you can set `pad_token_id=tokenizer.eos_token_id` inside the `from_pretrained` model loading process
    * Set it into eval mode

In [None]:
device = # TODO

tokenizer = # TODO
model = # TODO
# TODO

## Task 2.1: Greedy Search

* Greedy search is the most straightforward method for generating sequences
* Starting from an initial context, e.g. a prompt, the model generates the next token in the sequence by selecting the token that has the highest predicted probability according to the model
* The selected token is added to the sequence, and the process is repeated to generate the entire sequence
* Importantly, in each step the model *only* considers the most likely token based on its current context
* Greedy search is deterministic, meaning it always chooses the same token given the current context
* The decoding algorithm continues until the set maximum generated length has been reached, or until the end-token has been generated

* Given the sentence below:
    * Tokenize the prompt
    * Generate the output using [model.generate()](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)
        * Limit the generation length to 128 tokens
    * Decode the generated output to a human readable format
    * Discuss the quality of output. Relate the possible advantages and disadvantages of the output to the algorithm used.

In [None]:
sentence = "I came back from holidays and"
# TODO

___
Student answers here:

___

## Task 2.2: Beam search

* Beam search is an extension of greedy search and aims to overcome some of its limitations
* Starting from an initial context, the model generates multiple candidate tokens for the next position, typically referred to as "beams"
* Each beam represents a hypothesis for the next token in the sequence
* The model scores and ranks these beams based on their predicted probabilities
* The top-k beams with the highest scores are selected to continue the generation process, where k is a user-defined parameter known as the "beam width"
* The selected beams are extended with candidate tokens, and the process is repeated iteratively
* Beam search maintains multiple active hypotheses in parallel, allowing it to optimize coherence in the sequence across multiple paths
* Decoding continues until one or more beams generate an end token or reach the maximum sequence length

* In the following, we will implement (a simplified version of) the beam search algorithm, and compare it to Hugging Face's implementation of beam search
    * Hint: Aside from efficiency tweaks, we do not include any length bonus or penalization of short sequences, as you may later notice (depending on the prompt)
* First, we define a helper function called `generate_candidates`, which will generate our possible next token along each hypothesis path
    * It takes as input the `model`, the `context`, which is the tokenized prompt, the so far newly generated `sequence`, the `beam_width`, and the `device`
    * The model predicts logits based on the concatenated context and sequence
    * We extract the last predicted token position of the logits, which serves as the next token prediction
    * From this hidden representation, we return the top-k indices of the vocabulary, whereas the `beam_width` parameter determines the $k$
    * Make sure to place everything on the `device` as required
* The main `beam_search` function takes as input the `model`, `context`, `beam_width`, `max_length`, and `device`
    * Create a data structure `hypotheses` that holds all of the newly generated sequences along with their log-probability scores
    * In our simplified version, we generate new tokens until the maximum length is reached
    * In each step, we iterate through all hypotheses
    * Each hypothesis generates candidates with our above helper function based on the current progress of newly generated sequences
        * In the first iteration, this newly generated sequence is simply empty (it is not necessary to consider start tokens here)
        * As a result, we generate the first new token only based on the provided `context`
    * Next, we extend the hypothesis step by step with each candidate token
        * Each extension is paired with the initial context and put through the model
        * Then, we calculate log-probabilities for each position in the vocabulary and extract the candidate position
        * This log-probability is stored together with the hypthesis and extended/summed up with each new candidate token and log-probability
        * Save all scored hypothesis and candidate combinations
    * Based on all hypothesis combinations, extract the top-`beam_width` number of combinations based on the highest scores
    * Overwrite the earlier created `hypotheses` datastructure with those top-`beam_width` hypotheses, so that we start the next iteration with `beam_width`-many hypotheses
    * When `max_length` token have been generated, return the combined indices of the initial context and the generated ones
    * Disable gradient calculation and move everything to `device` as necessary
* Use the function with `beam_width=5` and `max_length=50` to generate the output, then decode it as done above.

In [None]:
def generate_candidates(model, context, sequence, beam_width, device):
    # TODO

    return # TODO

In [None]:
def beam_search(model, context, beam_width, max_length, device):
    # TODO

    return # TODO

In [None]:
# TODO

* Now repeat the beam search process using [model.generate()](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)
    * Limit the generation length to 128 tokens
    * To choose beam search, set `num_beams=5` and `early_stopping=True`
* Decode the generated output to a human readable format
* Discuss the quality of our simplified beam search output and Hugging Face's output. Relate the possible advantages and disadvantages of the output to the algorithm used.

In [None]:
# TODO

___
Student answers here:
___

* In order to specifically reduce repetitions, beam search can be adapted with a `no_repeat_ngram_size=2` option
    * This prevents the model from generating any ngrams of size 2 twice
* Generate its output using the option and discuss the new results again. When can this option be useful? In which case does it (always) deteriorate the output?

In [None]:
# TODO

___
Student answers here:

___

## Task 2.3: Sampling
* Sampling decoding is probabilistic approach for generating sequences of tokens
* Starting from an initial context, the model generates the next token in the sequence by randomly sampling from the distribution of predicted token probabilities
* Instead of deterministically selecting the most likely token as in greedy search, sampling decoding introduces randomness by considering the predicted probabilities as a probability distribution
* The model assigns probabilities to each possible token, and the next token is chosen probabilistically based on these probabilities
* Tokens with higher probabilities are more likely to be selected, but there is still an element of randomness involved
* Sampling decoding can be guided by a temperature parameter, where higher temperatures increases the likelihood of highly probable words, and lower temperatures smoothes out the distribution
* The decoding continues iteratively, with each token influencing the distribution of probabilities for the next token.
* The process is repeated until an end token is generated, or the maximum sequence length is reached

* Generate the output using [model.generate()](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)
    * Limit the generation length to 128 tokens
    * To choose sampling decoding, activate `do_sample=True`, set `top_k=0` and set the temperature e.g. to `temperature=0.6`
* Decode the generated output to a human readable format
* Discuss the quality of output. Relate the possible advantages and disadvantages of the output to the algorithm used.

In [None]:
# TODO

___
Student answers here:

___

## Task 2.4: Top-K Sampling

* Top-k sampling amends the above sampling approach by considering only a restricted set of the most likely tokens per step
* Instead of sampling from the entire vocabulary of possible tokens, top-k sampling limits the selection to the top-k tokens with the highest predicted probabilities
* K is a user-defined parameter and determines the size of the set of candidates
* The model assigns probabilities to each possible token, ranks them based on their predicted probabilities, and selects from the top-k tokens for the next position in the sequence
* This approach combines randomness within a the restricted set of candidates
* Tokens within the top-k can still be selected probabilistically, with tokens having higher probabilities being more likely to be chosen
* The decoding continues iteratively until and end token is generated, or the maximum sequence length is reached

* Generate the output using [model.generate()](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)
    * Limit the generation length to 128 tokens
    * To choose top-k sampling decoding, activate `do_sample=True` and set `top_k=50`
* Decode the generated output to a human readable format
* Discuss the quality of output. Relate the possible advantages and disadvantages of the output to the algorithm used.

In [None]:
# TODO

___
Student answers here:

___

## Bonus: Contrastive Search (bonus)

* [Contrastive search](https://arxiv.org/abs/2202.06417) addresses text generation issues by introducing a learnable decoding framework constisting of two components
* Contrastive Training: **Sim**ple **c**ontrastive framework for neural **t**ext **g**eneration (SimCTG)
    * Aims to improve the quality of token representations generated by language models
    * It trains the model to learn discriminative token representations
    * This greatly assists the model to produce more coherent and contextually relevant text
* Contrastive Search:
    * Complements CTG by first calculating a confidence score among the top-k candidate tokens represented by the model probabilities
    * Then, a degeneration penalty is introduced
    * It measures how discriminative the top-k candidate tokens are w.r.t. the previous context
    * The cosine similarity is used as a measure
    * The larger the degeneration penalty score is, the more similar the next token is to the previous context, and the more likely it is that the output degenerates
* A hyperparameter $\alpha$ regulates the tradeoff between the model confidence and the degeneration penalty
    * If $\alpha=0$, greedy search is performed

* In terms of performance, contrastive search was a big step up over previous algorithms
* However, the algorithm was only recently implemented in the Transformers library (i.e. not available in the 4.19.4 version we used before)
* If you want to try it, install versions >=4.33
    * Depending on the prompt, the sequence could be more relevant and coherent, as well as less repetitive
    * However, due to our very small model size, the answers might still be very repetitive and less coherent, but that is the model's fault in this case
    * See [here](https://github.com/huggingface/transformers/issues/19182#demonstration) for some impressive demonstrations with OPT-6.7b and contrastive search vs. OPT-175b and nucleus sampling (another version top-k sampling)

Bonus: Code is given, simply try it out if you want
* Generate the output using [model.generate()](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate)
    * Limit the generation length to 128 tokens
    * To choose contrastive search, set `penalty_alpha=0.6`, `top_k=4`, and allow `max_new_tokens=128`
* Decode the generated output to a human readable format
* Discuss the quality of output. Relate the possible advantages and disadvantages of the output to the algorithm used.

```
contrastive_out = model.generate(**tokenized, penalty_alpha=0.6, top_k=4, max_new_tokens=128)
contrastive_decoded = tokenizer.decode(contrastive_out[0], skip_special_tokens=True)
print('Contrastive search result:')
print(contrastive_decoded)
```