# Lecture 23 - Large Language Models

# **Under construciton: Work in Progress**

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Natural_Language_Processing/Lecture_19-NLP.ipynb)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2023-Python-Programming-for-Data-Science/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Natural_Language_Processing/Lecture_19-NLP.ipynb)

<a id='top'></a>

- [23.1 Introduciton to LLMs](#23.1-introduciton-to-llms)
- [23.2 Creating LLMs](#23.2-creating-llms)
- [Appendix](#appendix)  
- [References](#references)

## 23.1 Introduction to LLMs <a name='23.1-introduciton-to-llms'></a>

Large Language Models (LLMs) are designed to understand and generate natural human language, and have achieved state-of-the-art performance across various NLP tasks.

LLMs are a result of many years of research and advancement in NLP and Machine Learning. Important phases in the development include:

- Statistical language models (1980s-2000s): developed to predict the probability of a word in a sequence based on the preceding words. Examples include bag-of-words models based on N-grams. These models were widely used in tasks like speech recognition and machine translation but struggled with capturing long-range dependencies in text.
- Neural networks models (2003-2017): fully-connected NNs and recurrent NNs  emerged as an alternative to statistical models. Long Short-Term Memory (LSTM) models are type of RNNs that were used in sequence-to-sequence models for tasks like machine translation and formed the basis for several early LLMs.
- Transformer models (2017-2022): transformer architecture replaced the recurrent layers in RNNs with self-attention mechanisms. This breakthrough enabled the development of more powerful and efficient LLMs, laying the foundation for GPT, BERT.
- ChatGPT ... (2022-present)


### 23.1.1 Architecture of Large Language Models

The architecture of LLMs is based on Transformer Networks, which relies on the self-attention mechanism to process and generate text sequences. The main components of the Transformer Networks architecture include:

- **Input embeddings**, are fixed-size continuous vector embeddings that represent tokens in input text.
- **Positional encodings**, fixed-size continuous vectors that are added to the input embeddings to provide information about the relative positions of tokens in the input text sequence.
- **Encoder**, composed of a stack of a multi-head attention modules and fully-connected (feed-forward) modules. The encoder block also includes dropout layers, residual connections, and layer normalization.
- **Decoder**: composed of a stack of multi-head self-attention modules and fully-connected (feed-forward) modules, with an additional masked mutli-head  attention module.
- **Output fully-connected layer**: the output of the decoder is passed through a fully-connected (dense, linear) layer to produce the next token in the text sequence.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-LLMs/images/transformer.jpg' width=450px/>

*Figure: Pretraining LLMs.* Source: [2].

The **self-attention mechanism** is a key component of the Transformer Network architecture that enables the model to weigh the importance of each token with respect to other tokens in a sequence. It allows to capture long-range dependencies and relationships between the tokens and helps the model to understand the context and structure of the input sequence.

### 23.1.2 Variants of Transformer Network Architectures

Various LLMs are built on top of the Transformer architecture with slight modifications or adaptations. Some popular variants include:

- Decoder-only models: GPT (Generative Pre-trained Transformer) models are autoregressive model that utilize only the decoder part of the Transformer architecture to generate text.
- Encoder-only models: BERT is based on the encoder part of the Transformer architecture and is pre-trained using masked language modeling and next sentence prediction tasks.
- Encoder-Decoder models: T5 (Text-to-Text Transfer Transformer) adapts the original Transformer architecture with encoder and decoder sub-networks, enabling it to be used for various NLP tasks with minimal task-specific modifications.

## 23.2 Creating LLMs <a name='23.2-creating-llms'></a>

Creating modern LLMs such as ChatGPT or Llama 2, typically involves three phases:

1. **Pretraining**, the model extracts knowledge from large unlabeled text datasets,
2. **Supervised finetuning**, the model is refined to improve the quality of generated responses,
3. **Alignment**, the model is further refined to generate safe and helpfull responses that are aligned with human preferences.

### 23.2.1 Pretraining

The first step in creating LLMs is **pre-training** the model on a large corpus of publicly available text data. This usally includes a large collection of web pages or e-books comprising billions or trillions of tokens, and rangning from gigabytes to terabytes of text. During pretraining, the model learns the structure of the language, grammar rules, facts about the world, and reasoning abilities. And, it also learns biases and harmful content present in the training data.

 Pre-training is done using unsupervised learning techniques, and two common pre-training tasks are:

- **Causal Language Modeling**, also known as autoregressive language modeling, involves training the model to predict the next token in the sequence given the previous tokens. This approach is used for pre-training ChatGPT, Llama 2, and it is more common with modern LLMs.
- **Masked Language Modeling**, where a certain percentage of input tokens are randomly masked, and the model is trained to predict the masked tokens based on the surrounding context. BERT and other LLMs are pre-trained with masked language modeling.

The following figure depicts the pretraining phase using causal language modeling, where the model learns to predict the next word in a sentence given the previous words.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-LLMs/images/pretraining.jpg' width=450px/>

*Figure: Pretraining LLMs.* Source: [3].

Pretraining allows to extract knowledge from very large unlabeled datasets, without the need for manual labeling. Or, to be more precise, the "label" in model pretraining is the next word in the text, to which we already have access to since it is part of the training text. Such pretraining approach is also called self-supervised training, since the model uses each next word in the text to self-supervise the training.

Note that pretraining LLMs from scratch is computationally expensive and time-consuming. As we stated before, the pretraining phase can cost millions (e.g., estimated cost for GPT-4 is $100 millon).

###23.2.2 Supervised Finetuning

After the pretraining phase, the model is finetuned on a much smaller dataset, which is carefully generated with human supervision. This dataset consists of samples where AI trainers provide both user queries (instructions) and model responses (outputs), as depicted in the following figure. That is, the instruction is the input given to the model, and the output is the desired response by the model. The model takes the instruction text as input (e.g., "Write a limerick about a pelican") and uses next-token prediction to generate the output text ("There once was a pelican so fine ...").

The finetuning process involves updating the model's weights using supervised learning techniques. To compile datasets for supervised finetuning, AI trainers need to write the desired instructions and responses, which is a laborious process. Typical datasets include between 1K and 100K instruction-output pairs. Based on the provided instruction-output pairs, the model learns to generate responses that are similar to those provided by AI trainers.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-LLMs/images/finetuning.jpg' width=500px/>

*Figure: Finetuning a pretrained LLM.* Source: [3].

### 23.3.3 Alignment with Reinforcement Learning from Human Feedback (RLHF)
To further improve the performance and align the model responses with human preferences, LLMs typically use **Reinforcement Learning from Human Feedback (RLHF)**. This process is depicted in the figure below and involves the following steps:

1. Collect human feedback. For this step a new dataset is created by collecting sample prompts (from a database, for example) or by creating a set of new prompts. For each prompt, multiple responses are generated by the finetuned model. Next, AI trainers are asked to rank by quality of all responses gnerated by the model for the same prompt from best to worst. Such feedback will be used to define the human preferences and expectations about the responses by the model. Although this ranking process is time-consuming, it is usually less labor-intensive than creating the dataset for supervised finetuning, since ranking the responses is faster than writing the responses.
2. Create a reward model. The collected data with human feedback containing the prompts and the rank scores of the different responses is used to train a reward model (denoated with RM in the figure). The task for the reward model is to predict the quality of the different responses to a given prompt and output a ranking score. The ranking scores provided by AI trainers are used to establish a ground-truth for training the reward model. Note that the reward model is a different model that only needs to rank the generated responses.
3. Finetune the LLM with RL. The LLM is fine-tuned using a Reinforcement Learning algorithm, and for this step typically the Proximal Policy Optimization (PPO) algorithm is used. For a new prompt sampled from the dataset, the original LLM generates a response, which is evaluated by the reward model and the reward model calculate a reward score. Next, the PPO algorithm uses the reward score to finetune the LLM so that the total rewards for the generated responses by the LLM are maximized. I.e., the goal is to generate responses by the LLM that maximize the predicted rewards, and by that, are aligned with human preferences and are more useful to human users.
4. Iterative improvement. The RLHF process is performed iteratively, with multiple rounds of collecting additional feedback, re-training the reward model, and apllying Reinforcement Learning. This allows for continuous improvements in the model's performance.

<img src='https://raw.githubusercontent.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_23-LLMs/images/RLHF.jpg' width=600px/>

*Figure: Reinforcement Learning from Human Feedback.* Source: [4].

This approach creates a reward system that is augmented by human feedback and is used to teach LLMs which responses are more aligned with human preferences. Through these iterations, LLMs can learn to better align with human values and preferences, which can lead to higher-quality outputs and improved performance on specific tasks. RLHF has been successfully applied to fine-tune models like OpenAI's ChatGPT. There are several variants of the RLFH systems for finetuning LLMs. For example, Llama 2 employs two reward models, one based on the ranks of helpfulness of the responses, and another based on the ranks of safety of the responses. The final reward score is obtained as a combination of the helpfulness and safety scores.

### Limitations and Ethical Considerations of LLMs

Although LLMs have demonstrated impressive performance across a wide range of  tasks, there are several limitations and ethical considerations that raise concerns.

Limitations:

- *Computational resources*: Training LLMs requires significant computational resources, making it difficult for researchers with limited access to GPUs or specialized hardware to develop and fine-tune these models.
- *Data bias*: LLMs are trained on vast amounts of data from the internet, which often contain biases present in the real world. As a result, the models may unintentionally learn and reproduce biases in their predictions and generated text.
- *Lack of understanding*: LLMs may not truly "understand" language in the way humans do. They are often sensitive to small perturbations in the inputs and can generate plausible-sounding but nonsensical text.
- *Inability to explain*: LLMs are inherently black-box models, making it challenging to explain their reasoning or decision-making processes, which is essential in certain applications like healthcare, finance, and legal domains.

Ethical Considerations:

- *Privacy concerns*: LLMs can inadvertently memorize information from their training data, potentially revealing sensitive information or violating user privacy.
- *Misinformation and manipulation*: LLMs can generate coherent and contextually relevant text, which can be exploited to create disinformation, fake news, or deepfake content that manipulates public opinion and undermines trust.
- *Accessibility and fairness*: The computational resources and expertise required to train LLMs may lead to an unequal distribution of benefits, with only a few organizations having the resources to develop and control these powerful models.
- *Environmental impact*: The large-scale training of LLMs consumes a significant amount of energy, contributing to the carbon footprint and raising concerns about the environmental sustainability of these models.

It is important to encourage transparency, collaboration, and responsible AI practices to ensure that LLMs benefit all members of society without causing harm.

### Prompt Engineering

### Interacting with LLMs

When interacting with LLMs, it's essential to consider the following aspects:

- Temperature: Temperature is a parameter that controls the randomness of the model's output. Higher values (e.g., 0.8) result in more diverse responses, while lower values (e.g., 0.2) produce more focused and deterministic responses.
- Max tokens: Setting a limit on the number of tokens (words or word pieces) in the generated response helps control verbosity and ensures that the model stays on topic.
- Iterative refinement: If the model's initial response is unsatisfactory, you can iteratively refine the prompt by incorporating feedback, adding context, or rephrasing the question.
- Prompt chaining: You can use the model's previous responses as context for subsequent prompts, allowing for more coherent and contextually relevant multi-turn interactions.

## References

1. Introduction to Large Language Models, by Bernhard Mayrhofer, available at [https://github.com/datainsightat/introduction_llm](https://github.com/datainsightat/introduction_llm).
2. Understanding Encoder and Decoder LLMs, by Sebastian Raschka, available at [https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder](https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder).
3. LLM Training: RLHF and Its Alternatives, by Sebastian Raschka, available at [https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives](https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives).
4. Training Language Models to Follow Instructions with Human Feedback, by Long Ouyang et al., avaiable at [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155).