<a href="https://colab.research.google.com/github/badlogic/genai-workshop/blob/main/05_generative_ai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Large Generative AI Models
Generative AI models come in many shapes and forms. They can [write code](https://github.com/features/copilot) or [chat casually](https://chat.openai.com), [generate images](https://stability.ai/stable-image) based on intricate prompts, [translate your whistled melody](https://deepmind.google/discover/blog/transforming-the-future-of-music-creation/) into a fullblown saxophon track, or create [photorealistic videos](https://openai.com/sora).

All of these models have in common, that they are trained on massive amounts of (unlabeled) data, sourced largely from the internet.

In addition to massive training sets, these models are also massive in terms of their model parameter count, which enable them to learn latent variables to "understand" the real-world to some degree.

Here, we want to focus on the family of models concerned with textual data, known as generative **large language models**.

We've already discussed some large language models briefly in the [unsupervised learning](https://colab.research.google.com/drive/10tlC17BRVoX9aPp66orqiI16iUzLx-4p?usp=sharing) section.

Let's dive a little bit more into the details of these models.



## Model(s)
Before large language models, [recurrent neural networks (RNNs)](https://en.wikipedia.org/wiki/Recurrent_neural_network) were the de-facto standard for [sequence to sequence language modelling](https://en.wikipedia.org/wiki/Seq2seq) tasks, where the goal is to learn a mapping from an input sequence to an output sequence, such as is required for translation, or summarization.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Recurrent_neural_network_unfold.svg/2880px-Recurrent_neural_network_unfold.svg.png" width="600px"></center>

RNNs were designed to capture word order and relations between words across the input sequence in order to produce passable translations or summaries. They do this by processing the sequence one "word" (or token) at a time, building up information that is used to process subsequent words in the sequence.

* **Long sequences**: RNNs are unable handle long sequences like long paragraphs from an eassy due to the vanishing/exploding gradient problem. As the RNN process the sequence, it gradually "forgets" older information. When reaching the end of the sequence, they might have lost context from the beginning, such as the gender of a subject mentioned at the beginning.
* **Hard to train**: RNNs suffer from the [vanishing/exploding gradient problem](https://medium.com/metaor-artificial-intelligence/the-exploding-and-vanishing-gradients-problem-in-time-series-6b87d558d22), which can not be easily mitigated.
* **Inherently sequential**: RNNs process one part of the sequence at the time, from left to right. This makes it impossible to parallelize training and inference.

An RNN variant called [Long short-term memory (LSTM) models](https://en.wikipedia.org/wiki/Long_short-term_memory) (invented in Austria) is able to mitigate some of these issues, like the vanishing/exploding gradien problem or the gradual loss of earlier information. But their architecture is even more complex and they are still inherently sequential models.

Desprite these issues, LSTMs were used extensively, including in popular products such as [Siri](https://machinelearning.apple.com/research/voice-trigger).

But these models' limitations, especially their inherently sequential training and inference meant that they could not be scaled up to bigger datasets and thus better models.

In 2017, a paper called [Attention is all you need](https://arxiv.org/abs/1706.03762) changed everthing. The paper discusses the transformer model architecture for sequence to sequence language modelling in the context of language translation.

<center><img src="https://marioslab.io/uploads/genai/attention.png" width="480" /></center>

The transformer model architecture did away with all the limitations of RNN based language models. The key innovations of the transformer model are:

* **Positional encodings**: Instead of making the sequence order information a part of the model architecture through sequential processing, transformers encode the sequence order in the data representations they process. This enables parallel processing, where the model has access to the entire information extracted from the sequence at all times.
* **Inherently parallel**: due to positional encodings, the model can process the entire input sequence at once, in parallel, allowing it to be trained via GPUs or TPUs. This enables training on many orders of magnitude bigger training sets.
* **Bigger, less complex model**: the model is composed of 2 stacks of relatively simple modules, which, in combination with data parallel training, enables models with orders of magnitude more parameters.
* **Self-attention**: The self-attention mechanism allows the model to weigh the importance of different parts of the input sequence when processing each word (or token). This means that for any given word, the model can focus on the most relevant parts of the input for making predictions or generating output.

Translating from English to French requires a model to not just blindly translate one word at a time. E.g. in French, word order is often flipped compared to other languages. French is also a gendered language, so it is important for the model to be able to resolve and remember subjects within a sequence, to choose the proper gendered form for the translation.

Self-attention enables the model to capture exactly these important types of dependencies and relationships between words, regardless of their position in the sequence, effectively addressing long-range dependency challenges.

This plot found in a 2015 paper called [Neural machine translation by jointly learning to align and translate](https://arxiv.org/pdf/1409.0473.pdf), which was one of the first to introduce attention, illustrates this principle:

<center><img src="https://marioslab.io/uploads/genai/attention-2.png" width="480" /></center>

This plot shows to which words the model attends to, when translating a word. E.g. when the model outputs the word `zone`, it attends heavily to "Area" and "Economic". These two words are further along in the input sequence.

Transformers go one step further. Instead of just using attention to align words across two sequences, attention is  based into every layer of the model architecture and used to discover latent variables in the unlabled input data, which allows the model to learn things like part-of-speech, rules of grammar, synonyms, performing entity resolution, and many other [natural language processing tasks](https://arxiv.org/abs/1905.05950). And on top of these linguistic latent variables, transformers seem to also be able to learn to model higher-level concepts and to some degree facts, such as writting styles, the logic of computer programs, or the birthday of a celebrity.

All of this is a side effect of being able to train larger models on vastly more data than before.

The model architecture of the original transformer architecture as presented in "Attention is all you need" looks like this:

<center><img src="https://marioslab.io/uploads/genai/attention-3.png" width="480" /></center>

The transformer consists of two blocks: the **encode** (left) and the **decoder** (right).

### Encoder
The encoder processes the input sequence and transforms it into a continuous representation that holds both the individual token information and the contextual relationships between tokens.  It achieves this through a stack of identical encoder modules, each featuring a self-attention layer and feed-forward neural networks.

The input to the encoder block is list of token ids with a fixed length, like we've seen in the last section. The fixed size of this token ids list defines the **maximum sequence length** the transformer can process. This is also known as the **token window size**.

To input a text into the encoder, it is first tokenized. The result is a list of token ids. If the list has less token ids than the token window size, then **padding token ids** are added at the end of the list. This indicates to the model that there are no tokens at those positions in the input sequence. If on the other hand the list of token ids is larger than the token window size, then it is **truncated**.

The token ids list (which is really a vector) is then pushed through an **embedding layer**, which generates a word embedding vector for each token id.

These resulting vectors are then combined with the positional encoding and pushed through the stack of `N` identical encoder modules.

The stack learns the context for each token, with each module enriching the embeddings for each token with additional context and information from the entire sequence. E.g. one encoder module might disambiguate synonyms and encode this information (along with other information) in the output embedding vector for a given token.

The output of the encoder block is a list of embedding vectors, one for each input token, which contain all the information the encoder block could learn from the tokens in the sequence, embedded in a latent space.

### Decoder
Structurally similar to the encoder, the decoder is composed of a stack of identical modules. Each module, however, includes additional components to integrate information from the encoder.

The input to the decoder is a list of token ids generated so far, including a special **start-of-sequence token** id. The list thus is always at least one token id long.

This list is also processed to be of a fixed length, through padding or truncation, to match the maximum sequence length capability of the transformer.

The list of token IDs for the sequence is transformed into word embedding vectors, one for each token in the sequence, similar to the encoder process. These embeddings are then combined with positional encodings to maintain the sequence order information.

The core of the decoder consists of N identical modules. Each module has three main components:

* **Self-Attention Layer**: This layer helps the decoder focus on different parts of the sequence generated so far, enabling it to handle dependencies within the it.
* **Cross-Attention Layer**: Following the self-attention layer, the cross-attention layer allows each decoder module to attend to the encoder's output. This mechanism helps the decoder to utilize the context of the input sequence when generating the next token and also lets it "look back" at the entire sequence from the perspective of the encoder.
* **Feed-Forward Neural Network**: Similar to the encoder, this component further processes the information, integrating the insights gathered from self-attention and cross-attention mechanisms.

As the sequence passes through the stack of decoder modules, it becomes increasingly refined with context from both the input (via encoder output) and the partial output sequence generated so far.

This enriched sequence is then passed through a final **linear layer** and a **softmax layer**. The responsibility of the linear layer is it to generate an unnormalized logprobability vector, with one entry for each possible token id. These values are called **logits**. We've seen this in the last section! The softmax layer then transforms these unnormalized logprobabilities into "proper" probabilities by exponentiation and normalization.

The output of the decoder is a vector of probabilities, one for each token from the tokenizer vocabulary. The next token in the output sequence is then generated by picking the token with the highest probability. We can also use a more sophisticated scheme to let the decoder be a little bit more creative, e.g. by picking a random token from the top-k tokens with the highest probabilities.

## Generating a full output sequence
The process described above only generates a single output token. Each time, we pass in the original input and the output generated so far.

To generate a full output sequence from an input sequence, we repeat this process, each time adding a new output token, until we've reached the maximum sequence length, or the transformer indicates that the sequence has come to a natural end, by assigning the highest probability to a special **end-of-sequence** token.

The way we pick the next token from the probabilities can make the output vary. For example, choosing randomly from the top-k most likely tokens means we might get different outputs for the same input. This means that LLMs are **not deterministic**

## Transformer model families
What we've described above is the original transformer model architecture specifically targeted to solve the squence to sequence learning task.

Over time, this architecture has been adapted to other learning tasks, resulting in 3 distinct architecture families:

<center><img src="https://raw.githubusercontent.com/Mooler0410/LLMsPracticalGuide/main/imgs/tree.jpg" /></center>

**Encoder-Decoder transformer models** follow the original transformer architecture and work as described above. They perform sequence to sequence learning, on-top of which translation, summarization, and other similar tasks can be built.

**Encoder-only transformer models** only use the encoder block from the original transformer architecture and are used for masked in-fill learning, where during training one or more words of an input sequence are masked using a special token id, and the goal is to predict the most probably token for the masked position(s). These models **take into account the tokens to the left and right of a token** over the full sequence. These types of model lend themselves well for natural language processing downstream tasks like sentiment analysis, named entity recognition, text embeddings, and so on. Down-stream tasks are usually implemented by adding an additional (classification) layer on-top of the original encoder stack. [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) is such a model, and considered the swiss army knife of natural language processing.

**Decoder-only transformer models** only use the decoder block from the original transformer architecture, with minor modications, such as the remove of the connection to an encoder block. The are used for causal langauge modelling, also known as next token prediction. These models can **only take into account the preceding tokens** that have been input or generated so far. This model family is fundamental to models designed for text generation, creative writing, and more interactive applications where generating language is the primary objective. Decoder-only models have also been proven to be fantastic **in-context learners**, which means they can **learn from the provided input without of requiring additional training to complete novel tasks**. To some degree, they can be viewed as **systems that can be programmed via natural language**. [GPT](https://en.wikipedia.org/wiki/Generative_pre-trained_transformer) (which powers ChatGPT) and [LlaMA](https://en.wikipedia.org/wiki/LLaMA) (which powers many applications that do not want to rely on OpenAI's services) are prominent members of this family.

For the remainder of this workshop, **we will only discuss decoder-only transformers**, as they are the most versatile and immediately applicable models.

## Training
Decoder-only transformers are trained in multiple stages.


### Pre-training
The first stage is called **pre-training**. The model is trained on dozens of terabytes of unlabeled text data, mostly sourced from the public internet, including sources like Wikipedia, Reddit, Twitter, and so on. During this training phase, the model learns to "understand" and generate natural language as well as artificial languages, like programming languages (if such textual data is part of the training set).

The training objective is to minimize the loss with regards to next token prediction, as described before. This process is **autoregressive**, meaning the training process generates the "labels" from the data itself, without the need for human labeling.

The simple objective can be viewed as a sort of **bottleneck which forces the model to learn not only about language, but also about the real-world**, as some correct predictions rely not only on language understanding, but also understanding of real-world concepts and facts. All this knowledge is encoded in the model parameters.

A popular view among practitioners is that the paramers of the model are basically a **lossily compressed knowledge base** of all the concepts the model observed during training, both from a language and a factual knowledge perspective. Being lossy means that **the model can not reliably reproduce facts**.

The process of pre-training is a huge computational and financial undertaking, generally only carried out by corporations that can spend millions on a single training run. Pre-training of a model like GPT-3 or LlaMA 2 70B can take weeks to months, depending on the size of the training data as well as the number of model parameters.

Pre-trained models, such as LlaMA or the more recent and equally well performing [Mistral](https://mistral.ai/) models can be downloaded and used (with some restrictions) from [Hugging Face](https://huggingface.co/?activityType=update-model&feedType=following). You can run such models as shown in the previous section.

After pre-training, the learned model is not yet very useful on its own. It can be viewed as an "internet document dreaming machine" ((c) [Andrej Karpathy](https://karpathy.ai/)) . When given an input such as a question, it will not respond with an answer, but a continuation based on the most probably next tokens. It essentially assumes the input is the start of an "internet document" and completes it accordingly.

Try [Ollama] with the [LlaMA 7B]() model for a taste of a pre-trained model.
```
ollama pull llama2:7b-text
ollama run llama2:7b-text
```

<center><img src="https://marioslab.io/uploads/genai/pretrained.png" width="480" /></center>



### Supervised fine-tuning
To turn a pre-trained model into a useful tool for a specific task, a second stage of training called **supervised fine-tuning** (SFT) is performed.

SFT is a form of **transfer learning**: the pre-trained model has learned an understanding of language, real-world concepts, and (some lossy) facts. This knowledge is then transferred to a specific, useful task, such as question answering as part of a conversational exchange, like we see when using ChatGPT. We essentially teach the model to stop being an "internet document dreaming machine" and instead become a helpful chatbot assistant. And instead of learning model parameters from scratch, we **continue to improve the already learned model parameters**.

To do so, a trainging set needs to be compiled. For the goal of making the pre-trained model behave like a chatbot, we collect many conversational exchanges that can serve as training examples, which follow a format like:

```
<user>
What is the distance between the earth and then sun?
<assistant>
The average distance between the Earth and the Sun is about 93 million miles, or approximately 150 million kilometers.
<user>
How long does light take to travel that distance?
<assistant>
Light takes approximately 8.34 minutes to travel from the Sun to the Earth.
```

The training set to construct a helpful assistant from a pre-trained model not only encodes the expected conversational format, but usually also includes a **wide range of tasks** the assistant should be able to accomplish, like question answering, classification, summarization, parapharsing, creative writting, coding, and so on. This is where the true power of LLMs lies.

A training set for SFT is orders of magnitudes smaller than the one used for pre-training, but it must also be human created, or at least human curated and quality checked. The required size of this training set may vary, but typically is in the range of a few thousand to a few hundred thousand samples.

Just like during pre-training, the training objective is to minimize the loss with regards to next token prediction proabilities. The training is also autoregressive in the sense that "labels", that is, the expected next token can be automatically derrived from the training data without human intervention.

However, the difference to pre-training is that we specifically select examples of the format we expect the model to follow after SFT is complete. This again serves as a kind of bottleneck, which turns the "internet dreaming machine" into a chatbot. We also only consider the loss over tokens that are part of the expected response from the model, and not the entire sequence. And finally, the expected next token probabilities are often one-hot encoded, meaning the probability for the expected token is `1`, and the probability for all other tokens in the vocabulary is `0`. You can think of this token probability vector as the label for a sample sequence.

**SFT does generally not add new knowledge**. Instead, SFT **influences the style and format** that the model uses to predict new tokens, and can help the model refine its understanding of how to apply its knowledge in context specific ways.

#### Full Fine-Tuning
Fine-tuning is often applied to the full set of model parameters, also known as **full fine-tuning**. While this is usually computationally and financially cheaper than pre-training, it still requires the full set of model parameters to be loaded into (GPU) memory. In addition to the model parameters, training also requires data such as gradients, optimizer states, and activation outputs to be stored in (GPU) RAM.

For example, optimizing a 7B parameter model under the assumption that everything is encoded using float16 (a very, very optimistic assumption) requires:

* 7B * 2 bytes = 14GB for the model parameters (assuming float16 encoding)
* 7B * 2 bytes = 14GB for gradients
* 7B * 2 bytes * 2 state variables = 28GB for the optimizer state
* The activation output memory requirements are harder to estimate, as they depend factors like model architecture, batch size, and maximum sequence length. E.g. for the Mistral model shown earlier, a very rough, very conservative estimate based on the model architecture would be 600k-1M activations per input token.

At a minimum 56GB of GPU memory are required to fully fine-tune a 7B model.  For reference, NVIDIA A100 and H100 GPUs used for deep neural network training have maximum memory in the range of 80-96GB.

Larger and more capable (and production ready) models require linearly more GPU memory. The same is true for the compute time.

#### Performance Efficient Fine-Tuning (PEFT)
To reduce memory requirements, techniques like [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html), [gradient accumulation](https://huggingface.co/docs/transformers/v4.18.0/en/performance), and [gradient checkpointing](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) can be used. This all work under the assumption, that the full model is loaded into a single GPU/TPU.

If the model does not fit into a single GPU/TPU, techniques like [CPU-offloading as implemented in DeepSpeed](https://huggingface.co/docs/transformers/deepspeed), or [Fully Sharded Data Parallel](https://huggingface.co/docs/transformers/fsdp) can be used.

To reduce computation times, techniques like [PyTorch's Distributed Data Parallel](https://pytorch.org/tutorials/beginner/dist_overview.html) can help.

SMEs and individuals usually lack the compute and financial resources to perform full fine-tuning on models above 13B parameters. To bridge this gap, performance efficient

To bridge this gap, various **performance efficient fine-tuning** have been developed. These techniques allow fine-tuning of large models on modest hardware by employing various tricks which reduce memory and compute requirements.

**Layer Freezing**: Instead of updating all parameters during fine-tuning, only a subset of layers (typically the last few layers) are updated. This reduces the memory required for gradients and optimizer states significantly. Layer freezing during training can be achieved with 2 lines of code per layer in [PyTorch](https://discuss.huggingface.co/t/how-to-freeze-layers-using-trainer/4702/8)

**Prompt Tuning**: Instead of updating all parameters during fine-tuning, all model parameters are frozen, and a trainable layer is introduced next to the input embedding layer of the decoder. This trainable layer generates an embedding vector for each token in the input which is added to the corresponding embedding vector generated by the default embedding layer. The training is guided by a training set with examples of inputs and expected outputs. The learned layer then serves the same function as a manually crafted textual prompt for the task at hand. This is not only a PEFT method, but also an automated, supervised prompt engineering method, which can potentially yield better results than a manually crafted prompt. However, its effectiveness may vary considerably across tasks. Hugging Face has a pretty complete [guide on prompt tuning](https://colab.research.google.com/github/huggingface/notebooks/blob/main/peft_docs/en/pytorch/clm-prompt-tuning.ipynb#scrollTo=yRX65MaJ4FaL). Prompt tuning belongs in the [Soft prompts](https://huggingface.co/docs/peft/main/en/conceptual_guides/prompting) fine-tuning cateogry.

**Adapter Modules**: Adapters are small neural network modules inserted between the layers of a pre-trained model. Only these adapters are trained, keeping the original model parameters frozen. This approach drastically reduces the number of trainable parameters.

<center><img src="https://miro.medium.com/v2/resize:fit:523/1*F7uWJePoMc6Qc1O2WxmQqQ.png" width="480" /></center>

**Low-Rank Adaptation**: A popular adapter module PEFT method. Pre-trained model parameters across all layers are frozen. Some layers are wrapped with a LoRA adapter. The adapter constructs two low-rank matrices, which when multiplied together have the same shape as the matrix representing the model parameters of the wrapped layer. Only the parameters in the low-rank matrices are trained. To combine the frozen parameters with the parameters of the low-rank matrices, the low-rank matrices are first multiplied, yielding a matrix of the same shape as the frozen parameter matrix. The two matrices are then added together, allowing to specify the strenght of the influence the low-rank parameters have on top of the frozen pre-training parameters.

None of these methods above help in the case of the pre-trained model not fitting into GPU RAM. However, quantization can be applied to the frozen pre-trained parameters, while trainable PEFT parameters, their optimizer states and activaations are kept at higher precision. This way, even 70B parameter models can be fine-tuned on a single compute node.

PEFT techniques offer a practical solution for leveraging large pre-trained models with limited computational resources. They enable SMEs and individuals to adapt these models to specific tasks without the need for extensive compute power or memory.

The [Hugging Face PEFT documentation](https://huggingface.co/docs/peft/main/en/index) goes into much more detail and includes techniques not mentioned here for brevity. It also includes notebooks demonstrating how to apply PEFT methods for various downstream tasks and model architectures.



### Alignement
In the literature (especially in press releases) one often finds the word **alignement** with respect to large language models. This term is a bit loaded but generally refers to ensuring that the model's outputs are in line with human values, ethics, and intentions. It aims to reduce the risk of the model generating harmful, biased, or undesirable content.

Note that supervised fine-tuning is usually not concerned with alignment in the sense described above, but with turning a pre-trained model into a useful tool, such as a classifier, or a helpful chatbot assistant. While the training data soft SFT will inevitably (and hopefully) be aligned itselfs with respect to human values and ethics, the focus is on the output format, not on creating benevolent, unbiased, non-racist machines.

As such, alignment is often an additional training step on top of SFT.

One such alignment technique is reinforcement learning from human feedback (RLHF), which was pioneered by OpenAI.

This approach involves several steps to iteratively refine the model's outputs:

* **Pre-training**: The model is initially pre-trained on a diverse dataset to learn a broad understanding of language and context.
* **Supervised fine-tuning**: the pre-trained model is turned into a useful tool, like a conversational assistant capable of completeing a wide variety of tasks.
* **Reward Modeling**: Human annotators evaluate the outputs of the model based on certain criteria (such as coherence, relevance, safety, and alignment with ethical standards). These evaluations are used to train a reward model that can predict the human judgment of any given output.
* **Proximal Policy Optimization (PPO)**: The model is further fine-tuned using reinforcement learning, specifically PPO, where the reward model serves as a proxy for human feedback. The model generates outputs, the reward model evaluates these outputs, and the initial model is updated to maximize the predicted rewards. Think of reinforcement learning as a different kind of loss function to update model parameters with.
* **Iterative Refinement**: This process is repeated iteratively, with the reward model being updated based on additional human evaluations as needed. This cycle helps the model to increasingly align its outputs with human values and expectations.

RLHF allows for the fine-tuning of models in a way that directly incorporates human judgment into the training process, making it a powerful tool for aligning AI systems with human values. However, it's also resource-intensive, requiring significant human labor for feedback and evaluation, and there are ongoing discussions about its scalability and the representativeness of the feedback.

## Evaluation
Evaluating large language models (LLMs), especially causal LLMs, involves assessing their performance and alignment across various stages of development: pre-training, fine-tuning, and post-alignment fine-tuning, such as with reinforcement learning from human feedback (RLHF). Each stage presents unique challenges and objectives for evaluation.

### Evaluation of Pre-trained Models
The initial evaluation of pre-trained models typically focuses on their ability to predict the next token in a sequence, measured by metrics like perplexity or cross-entropy loss. These metrics provide a quantitative measure of how well the model has learned the structure and content of the language during pre-training. However, evaluating pre-trained models can be challenging due to their non-deterministic nature. The vast parameter space and the stochastic aspects of training mean that two models trained on the same data may produce slightly different outputs. Additionally, pre-training evaluation might not fully capture a model's potential for downstream tasks, as it primarily assesses language understanding and generation in a general context without task-specific optimization.

### Evaluation of Fine-tuned Models
After fine-tuning, LLMs are evaluated based on their performance on specific tasks, such as text classification, summarization, question answering, or more complex, multi-task capabilities like acting as conversational agents. The choice of evaluation metrics here depends on the task:

**Task-Specific Evaluations**: For classification, metrics like accuracy, F1 score, or area under the [ROC curve (AUC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) are commonly used. For summarization, automated metrics such as [ROUGE (Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) or [BLEU (Bilingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) can assess the quality of generated summaries against reference summaries. However, these metrics have limitations and may not fully capture the nuances of human language evaluation.

**General Evaluations for Multi-task Models**: For models fine-tuned to perform as helpful assistants capable of completing multiple tasks, evaluation becomes more complex. Benchmarks such as [BIG-bench (Beyond the Imitation Game benchmark)](https://github.com/google/BIG-bench?tab=readme-ov-file), [Human Eval](https://github.com/openai/human-eval) offer a diverse set of tasks designed to probe models' capabilities across various domains and types of reasoning. Additionally, human evaluation plays a crucial role in assessing the model's effectiveness, coherence, and relevance of responses in more open-ended or conversational contexts. More recently, smaller LLMs are evaluated by larger, more capable LLMs. Based on these benchmarks, the community has established leaderboards to compare closed and open LLMs. Prominent examples are the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which uses the [Eleuther Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to benchmark LLMs against a large number of evaluation tasks. [Chatbot Arena](https://lmsys.org/blog/2023-05-03-arena/) is an interesting take on evaluating LLMs: it presents humans with the output by two anonymized LLMs for the same prompt and lets them select a winner. This choice is then calculates an elo score similar to chess for each benchmarked LLM.

### Evaluation of RLHF Fine-tuned Models
Models fine-tuned using reinforcement learning from human feedback (RLHF) are further evaluated for their alignment with human values, including aspects like bias, toxicity, and the ability to produce safe and ethical outputs. This evaluation often involves both automated metrics and extensive human judgment:

**Automated Metrics**: Tools and frameworks for measuring bias or toxicity in model outputs can provide initial indicators of potential issues. These might include proprietary or open-source toxicity filters or bias detection algorithms.

**Human Evaluation**: Ultimately, assessing the nuances of bias, toxicity, and ethical alignment requires human judgment. This involves setting up evaluation frameworks where human raters review model outputs against specific guidelines designed to capture a wide range of ethical, moral, and social norms. The complexity of these evaluations reflects the multifaceted nature of language and communication, requiring careful consideration of context, cultural differences, and the potential for harm.

In all stages, the evaluation of LLMs is an iterative process, involving both quantitative metrics and qualitative assessments. As models advance, so too do the methods for evaluating them, highlighting the ongoing need for robust, transparent, and ethical evaluation frameworks to ensure that LLMs serve the public good.