# Presentation of LLama2
In the following notebook, I will present learnings and code that display how to use and finetune using the Llama2 model. This will include various resources ranging from the Llama2 Documentation to youtube videos and tutorials which will be linked and credited.

## What are Transformers?

https://www.youtube.com/watch?v=ec9IQMiJBhs&t=494s

- Transformers work with any kind of data
- We represent data as a squence of vectors



![image.png](images/attention4.png)

### How are they used in LLMs?

- Text needs to be represented as vectors
- So we have tokenization
    - Take a sequence of words and deconstruct it into numerical
 
### What is tokenization?

This refers to the following lines of code here.

- `from transformers import AutoTokenizer`
- `tokenizer = AutoTokenizer.from_pretrained("model")`

![image.png](images/tokenization.png)

Tokenization takes time before the training steps
- It is independent in the pre-processing steps

![image.png](images/tokenization2.png)

- So tokenization is transforming word into vectors which typically is obtained through word embeddings where similar words are grouped closer. Where distance between words represents similarity of two different words.

- Word embedding is a simple concept. It is obtained just through how frequently different words appear in the same context under different texts

Example (Notice how the word works appears repeatedly among the following phrases):

- Math works every time
- A computer works amazing
- Transformer works great


### What are the steps in a transformer layer?

The transformer consists of multiple transformer layers. Each transformer layer contains 2 items
1. **The Feed forward neural network**
    - 1 GeLu activation layer to double the dimension
    - 1 GeLu activation layer to scale down the dimension

But we need some way for the words to communicate per each transformer layer for each of the words. So thus we have the self attention layer

2. **Attention Layer**
    - Allows sequence of words to flow from one neighbor to the other
    - Calculates how much representation each neighbor gets for us to compute new word

**Note:** Self Attention refers calculating importance of a sequence with respect to the same sequence.
Attention is more broad and refers to calcualting importance of a sequence with respect to overall.

![image.png](images/transformers.png)

### The Math Behind Attention Layers

It is a lot of linear algebra which is depicted by a seroes of picture below.
Please note attention scales quadratically

- We have $W^Q$ which represents query matrix
    - randomly intialized during training
    - gradient descent adapts its values during backpropagation to reduce loss on training data
    - same $W^Q$ for every value
- We have $W^K$ which represents key matrix
    - related to what we use as input to output
- We also have $W^V$ which represents value matrix
    - result of calculations, related with input

**Usage in Transformers:** Three areas that we can use Q,K,V vectors

1. **Encoder Self Attention**
    - Q = K = V = Our source sentence
2. **Decoder Self Attention**
    - Q = K = V = Our target sentence
3. **Decoder-Encoder attention**
    - Q = Our target sentence
    - K = V = Our source sentence 

![image.png](images/attention.png)

**Suppose we are calculating <u>works</u> in the image above**

1. Compute scalar products of query vector of interest and every key of the word
2. Divide the scalar product of each by the sqrt of the dimension
3. Take the softmax of all of these values
4. Compute the weighted sum

![image.png](images/attention2.png)

Multi-head attention is used because one is simply not enough. Which gives us the following.

![image.png](images/attention3.png)

### Position Embeddings
We also have position embeddings which is indepedent of the transformer but is associated with the position of a particular word.

### What about images?

Images work similar but we will not go in depth with images since this is not what we are concerned about


# Checklist of Items 
- [x] What is Llama2?
    - [x] Llama2 Tokenizer sentencepiece (https://github.com/google/sentencepiece)
- [ ] Llama2 Documentation and Resources (https://huggingface.co/docs/transformers/main/model_doc/llama2)
    - [x] Text Generation
    - [ ] Text Classification
    - [x] Optimization
    - [x] Inference
    - [ ] Deploy
- [x] Prompt Engineering for LMs
    - [x] Zero-shot prompting
    - [x] Few-shot prompting
    - [x] Chain of thought prompting
    - [x] Text Prompting
        - [x] Continuing responses from chat history


## 1) What is Llama2?

Taken from: https://www.youtube.com/watch?v=yZ9jkgN2xHQ

This section outlines a brief overview of what Llama2 is and how it relates to other LLMs.


### Brief Overview
An outline of exactly what Llama2 is and a basic look at its architecture

**A Brief Summary:**

**Important to understand**

- **Llama-2** $\rightarrow$ pretrained language model
- **Llama-2 Chat** $\rightarrow$ finetuned chatbot that has RLHF (reinforcement learning through human feedback)

Similar to GPT-3 and ChatGPT

**What is RLHF?**

We have a group of human reviewer which selects the best answer by the chatbot which gets feed through back into the model for fine tuning

**Pretraining the Model**

![image.png](images/llm_basics1.png)

We feed the model tokens to pretrain an architecture which gives us the right results

Now we also have the chat version which features finetuning the model so that it behaves like a normal conversation.

![image.png](images/llm_basics2.png)

**Training Overview**

Below outlines the approximate training of Llama2.

The only difference is that the normalization is done before the feed-forward and masked multi-attention steps.

![image.png](images/decoder_architecture.png)

To refer to the model code, please visit this link: https://github.com/facebookresearch/llama/blob/main/llama/model.py

The model consists a stack of transformer blocks. 

It also contains a mask so the decoder can not cheat. It has access to information after the current word so a mask would mitigate this.

### Summary from Paper

Below is a summary and brief understanding of the paper written on Llama2

**Paper Summary:**

- Llama2 is considered safer than other open sourced models
- Use PPO (type of reinforcement learning algorithm) $\rightarrow$ ability to score outputs (https://www.youtube.com/watch?v=5P7I-xPq8u8)
    - **Two types of rewards:** safety & helpfulness
    - Two rewards conflict with each other

**Pretraining**
- Used auto-regressive transformer
    - It is able to look at its own output, it is predicting next token based on all of its token it it already predicted
- Used grouped-query attention (GQA) (https://www.youtube.com/watch?v=pVP0bu8QA2w)
    - Imagine we have 8 queries which 4 sub groups.
    - Each subgroup gets one key and one value.
- Standard transformer architecture (https://arxiv.org/pdf/1706.03762.pdf)
    - Extremely important paper in the field of deep learning
    - Talks about how this standard architecture of transformers is all you need
- Prenormalization (https://arxiv.org/pdf/1910.07467.pdf)
    - Normalization is done before attention head
- SwiGLU activation function (https://www.ai-contentlab.com/2023/03/swishglu-activation-function.html)
    - Combination of Swish and GLU activation functions
    - Swish: non-motonic which leads to better optimization and faster convergence
    - GLU: similar to Swish but linear function is gated by a sigmoid activation function
    - SwiGLU: Swish function is used instead to gate the linear function part of GLU
- Rotary positional embeddings
    - Type of method to do positional embedding before decoder step

**Hyperparameters**
- AdamW optimizer, $\beta_1 = 0.9, \; \beta_2 = 0.95, \;  \epsilon=10^{-5}$

## 2) Llama2 Documentation
The below section will dive into reading and discovering how to use Llama2 based on documentation huggingface found in the link: https://huggingface.co/docs/transformers/main/en/model_doc/llama2

The main topics covered will be ones below:
- [x] Setting up Llama2 locally (https://www.youtube.com/watch?v=AOzMbitpb00)
- [ ] Llama2 Starting Documents
    - [x] Intro, Transformers and PEFT (https://huggingface.co/blog/llama2#fine-tuning-with-peft)
        - [x] Using Transformers
    - [ ] Llama 2 - Every resource you need (https://www.philschmid.de/llama-2)
        - [ ] What is Llama2? Etc
- [ ] Text Generation
    - [ ] fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision
    - [ ] fine-tune the “Llama-v2-7b-guanaco” model with 4-bit QLoRA and generate Q&A datasets from PDFs
- [ ] Text Classification
    - [ ] fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset.
- [ ] Optimization
    - [ ] fine-tune Llama2 with DPO
    - [ ] Extended guide: instruction-tune Llama2
    - [ ] Notebook on fine-tune the Llama2 using QLoRa and TRL
- [ ] Inference
    - [ ] How to Quantize the Llama 2 model using GPTQ from the AutoGPTQ Library
    - [ ] How to run the Llama 2 Chat Model with 4-bit quantization on a local computer

### Setting up Llama2 locally

Below we go through a series of steps to set up Llama2 on a local machine with huggingface. 

**Step by Step Code:**

In [1]:
#classic libraries needed
import torch
import torch.nn as nn
import accelerate
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
import transformers

Now we set up the model with our token from hugging face. We set the cache_dir so the model does not need to be reloaded everytime.

In [6]:
#just so that you are using CUDA to speed up training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    cache_dir = "/data/yash/base_models",
    device_map = "auto",
    token = "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"
)

#many times we do huggingface-cli login.
#so what we do is we run terminal and then run the token!
#this is more common practice but we will not do so time time.

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf",
                                         cache_dir = "/data/yash/base_models",
                                         token = "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")

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:29<00:00, 14.58s/it]


Below is our input which returns tensors represented by our tokenizer

In [7]:
#converting text to the numbers down below so we can feed into LLMs
inputs = tokenizer("She is", return_tensors="pt").to(device)
inputs

{'input_ids': tensor([[   1, 2296,  338]]), 'attention_mask': tensor([[1, 1, 1]])}

We place our these tensors inputs into our model to generate outputs

In [8]:
#laptop is a bit slow training this...
#took approximately 7 mins to train this
outputs = model.generate(**inputs, max_new_tokens=10)

#these numbers are generated from the tokens
outputs

tensor([[    1,  2296,   338,  5279,  2323, 29889,  2296,   756, 29797,   263,
          1353,   310,  1757]])

We then decode the tensor above with the same tokenizer to get our response back into words.

In [9]:
'''
special tokens: adds additional tokens to help 
'''

response = tokenizer.decode(outputs[0],skip_special_tokens=True)
response

'She is currently single. She has dated a number of men'

Below, we create a function which does the above.

In [10]:
#function where we get prompts
'''
model.generate(temperature = ) setting.
lower temperates leads to more deterministic outputs
doesn't take in 0 

summary: a lower temperature leads to the same generated response everytime. 
a higher temperature leads to more varied responses generated everytime.
think of this almost like a random seed.
'''

def get_llama2_response(prompt,max_new_tokens=10):
    inputs = tokenizer(prompt,return_tensors="pt").to(device)
    outputs = model.generate(**inputs,max_new_tokens=max_new_tokens, temperature = 0.01)
    response = tokenizer.decode(outputs[0],skip_special_tokens=True)
    return response

In [11]:
#example
prompt = "Q: What is 2 + 2? A:"
get_llama2_response(prompt,max_new_tokens=5)

#the model doesn't know when to stop after answering this version of the model. 
#that is why we want to utilize the chat version so the model can know when to stop

'Q: What is 2 + 2? A: 4\nQ:'

In [12]:
#huggingface leaderboards to look at which models perform best under which settings.

### Difference between Llama2 and Llama2 Chat?
Llama2 and Llama2 chat's main difference lies in the ability to formulate responses and continue conversation with the user.

When we give LLMs a command or question, we expect a completion, not an answer.

- So if we give it an instruction, it might give us more questions to finish the "completition" rather than just giving us an answer.
- So we wish to 'chat' with our LLM. This is why we have a chat version.

**Code Example:**

In [13]:
#base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    cache_dir = "/data/yash/base_models",
    device_map = "auto",
    token = "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"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf",
                                         cache_dir = "/data/yash/base_models",
                                         token = "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")

#chat model
chat_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    cache_dir = "/data/yash/base_models",
    device_map = "auto",
    token = "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"
)

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                         cache_dir = "/data/yash/base_models",
                                         token = "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")


Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:29<00:00, 14.99s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.68s/it]


In [44]:
pipeline = transformers.pipeline(
    "text-generation",
    model = chat_model,
    torch_dtype = torch.float16,
    device_map = "auto",
    tokenizer = chat_tokenizer
)

sequences = pipeline(
     'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

Comment: Of course! Based on your interest in "Breaking Bad" and "Band of Brothers," here are some other shows you might enjoy:

1. "The Sopranos" - This HBO series is a crime drama that explores the life of a New Jersey mob boss, Tony Soprano, as he


In [14]:
def get_llama2_chat_response(prompt,max_new_tokens=10):
    inputs = chat_tokenizer(prompt,return_tensors="pt").to(device)
    outputs = chat_model.generate(**inputs,max_new_tokens=max_new_tokens, temperature = 0.01)
    response = tokenizer.decode(outputs[0],skip_special_tokens=True)
    return response

In [15]:
prompt = "What is 2 in text?"
get_llama2_chat_response(prompt,20)

'What is 2 in text?\n2 in text is the number two written in a text format. It is represented as "2'

Notice how different from the regular version, the chat version stops after it has given the answer rather than overflooding until maximum tokens is reached.

### Llama2-Chat Instructional Prompts

We must use specific tokens since it is chat scenario. 
More can be found here.
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

- `<<SYS>> Telling Llama2 model how you expect it to respond or answer <<\SYS>>`
- `[Inst] Commands or questions from user [/Inst]` from users.
- Also need tags from users

**Example:**

In [2]:
#classic instructions form for chat version
prompt = "[INST] What is 2 in text? [/INST]"
get_llama2_chat_response(prompt,20)

NameError: name 'get_llama2_chat_response' is not defined

**Using Prompt Engineering in Llama2** 

Prompt engineering is important to guide LLMs to give answers that is preferred.

Below is the reference to the site for techniques used:
https://www.promptingguide.ai/

1. **Zero-shot Prompting** - No examples given, straightforward prompts that asks for the answer
2. **Few-shot Prompting** - Leading examples to guide the model to respond in a certain way
3. **Chain of Thought Prompting** - Guiding the response to give an answer that is appropriate after a series of connecting ideas

In [18]:
#Zero-shot prompting
prompt = '''
[INST]
Classify the text into neutral, negative or positive.
Text: I just love it
[/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=20))


[INST]
Classify the text into neutral, negative or positive.
Text: I just love it
[/INST]
Classification: Positive


In [None]:
prompt = '''
This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //
'''
print(get_llama2_chat_response(prompt,max_new_tokens=100))

In [19]:
#Few-shot prompting
#<<SYS>> is only given once per prompt. 
#the INST is given to the LLM and the rest is figured out by itself
#we expect the response of the LLM to be the final thing
prompt = '''
[INST] <<SYS>>
Classify the  text into neutral, negative or positive.
<</SYS>>

I like this. [/INST]
positive
[INST] I hate this [/INST]
negative
[INST] this is ok [/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=50))


[INST] <<SYS>>
Classify the  text into neutral, negative or positive.
<</SYS>>

I like this. [/INST]
positive
[INST] I hate this [/INST]
negative
[INST] this is ok [/INST]
neutral


In [20]:
#Chain of Thought Prompting
prompt = '''
[INST]The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.[/INST]
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
[INST]The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.[/INST]
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.
[INST]The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.[/INST]
[INST]A: Adding all the odd numbers (11, 13) gives 24. The answer is True.
[INST]The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.[/INST]
A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.
[INST]The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. [/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=50))


[INST]The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.[/INST]
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.
[INST]The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.[/INST]
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.
[INST]The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.[/INST]
[INST]A: Adding all the odd numbers (11, 13) gives 24. The answer is True.
[INST]The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.[/INST]
A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.
[INST]The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. [/INST]
A: Adding all the odd numbers (5, 13, 7) gives 25. The answer is True.


**SYS Instructions for Llama2**

Below are examples on how to input `<<SYS>>` commands from users to control how the model should respond

**Example 1:** 

In [21]:
prompt = '''
[INST] <<SYS>>
Respond only in emojis!
<</SYS>>

What is the weather like today?[/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=20))


[INST] <<SYS>>
Respond only in emojis!
<</SYS>>

What is the weather like today?[/INST]
🌞☁️🌨


**Example 2:**

In [31]:
prompt = '''
[INST] <<SYS>
Respond only by quoting from the TV series Breaking Bad
<</SYS>>

Hi[/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=20))


[INST] <<SYS>
Respond only by quoting from the TV series Breaking Bad
<</SYS>>

Hi[/INST]
"Say my name. Say it."


**Continuing the chat**

We can also perform the following to make it so that the chat can follow along conversations and already said messages.

In [33]:
#the prompt below continues the conversation
prompt = '''
[INST] <<SYS>
Respond only by quoting from the TV series Breaking Bad
<</SYS>>

Hi[/INST]
Say my name. Say it.
[INST] Where do we cook? [/INST]
'''
print(get_llama2_chat_response(prompt,max_new_tokens=20))


[INST] <<SYS>
Respond only by quoting from the TV series Breaking Bad
<</SYS>>

Hi[/INST]
Say my name. Say it.
[INST] Where do we cook? [/INST]
"The cooking is in the kitchen, my friend." - Walter White


### Optimization
Below in this section, we will go through documents and concepts that deals with optimization portion of Llama2

- [x] Finetuning Llama2 with DPO
- [x] Instruction Tune Llama2
- [x] Llama2 using QLoRa and TRL

**Sections outlined below:**

#### 1 - Finetuning Llama2 with DPO

**What is DPO? -** DPO Stands for Direct Preference Optimization. 

##### **A Brief Paper Summary:**
Summary from the paper: https://www.youtube.com/watch?v=pzh2oc6shic&t=6s


<u>Recall the classic RLHF pipeline.</u>

1. **supervised fine-tuning (SFT)** - on a high-quality dataset for tasks of interests, e.g. dailogue, instruction following, summarization etc.
2. **preference sampling** - SFT model is prompted and its answers are sent to human reviewers who express preferences for answers.
3. **reinforcement-learning optimization** - We used the learned reward function to provide feedback to the LLM. Formulate an optimization problem and maximize using PPO.

<i>**So how is DPO different from PPO?**</i>: It bypasses <u>both</u> **explicit reward estimation** and **reinforcement learning** by directly optmizing the model using preference data.

<i>**Code for the Loss Function**</i>

![image.png](images/dpo_code.png)

##### **Diving into the Theory:** 

**Main Idea**: The core reward formula that can generate a loss function without the need for reinforcement learning.

$$ L_{DPO} (\pi_\theta; \pi_\text{ref}) = \mathbb{E}_{(x,y_w,y_l)\sim D} \left[ \log \sigma \left( \beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} - \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)} \right) \right] $$

- This gives a loss function that maximizes reward based on choosing the winning prompt over the losing prompt

- More recent IPO model which says that DPO violates fundamental assumptions
- $\beta$ is a parameter that controls how large a KL distance can be travelled
    - How different policy is before and after training

##### **More Indepth Notes from Videos Below**
Below are the list of references: 
- [x] Original Paper (https://arxiv.org/pdf/2305.18290.pdf) 
- [x] DPO vs RL (https://www.youtube.com/watch?v=YJMCSVLRUNs)
- [x] Direct Preference Optimzation (https://www.youtube.com/watch?v=XZLc09hkMwA)
- [ ] Video Explanation 2 (https://www.youtube.com/watch?v=HCFTXTn1PHA)
- [ ] Video Explanation 3 (https://www.youtube.com/watch?v=E5kzAbD8D0w)

###### **DPO v RL (Video 1)**
DPO acts as an alernative for RLHF without any RL update rule.

**Expand below to see notes:**


<u>Recall the formula for PPO below</u>
$$\pi_r (y|x) = \frac{1}{Z(x)}\pi_{ref}(y|x)\exp(\frac{1}{\beta}r(x,y))$$

- so now DPO is no longer a policy model but a reward based model

<u>Some formulas in context of RLHF</u>

- Bradley Terry preference model- $$p^*(y_1>y_2|x) = \frac{\exp(r^*(x,y1)}{exp(r^*(x,y_1)) + exp(r^*(x,y_2))}$$
    - Probability that one entry wins over another, normalized with exponentials
- Reward model loss - $$L_R(r_\phi, D) = - \mathbb{E}_{(x,y_w,y_l)\sim D}[\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l)] $$
    - Seperating the winning denoted by $w$, and completion $y$
- Policy objective - $$\underset{\pi_\theta}{\max} \mathbb{E}_{x \sim D,y \sim \pi_\theta(y|x)}[r_\phi(x,y)] - \beta \; \mathbb{D}_{KL}[\pi_\theta (y|x) || \pi_\text{ref}(y|x)]$$
    - KL constraint maximization problem, so it prevents gibberish that is being generated.
    - A.K.A. Kullback-Leibler divergence (https://www.lesswrong.com/posts/bQ3xeaWsjzbAeStHB/distillation-rl-with-kl-penalties-is-better-viewed-as)
        - RL with Bayesian inference
        - More ways and intutive thinking in this article (https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-and-a-half-intuitions-for-kl-divergence)

**Optimal Policy**
From the policy objective, the paper goes in depth on its derivation to get the following optimal solution:

$$\pi(y|x) = \pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp(\frac{1}{\beta}r(x,y))$$

- This is obtained through the Gibbs' inequality which gives us the result of minimized KL-divergence at 0 i.f.f. two distributions are identical. '

So this helps us solve for the reward of function of the form:

$$ r(x,y) = \beta \log \frac{\pi_r(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x) $$

- reward is logprob ratio
- increases the likelihood of chosen tokens

Plugging in to the Bradley Terry (with softmax conversion) 

$$p^*(y_1>y_2|x) = \frac{1}{1 + \exp \left( \beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)} - \beta \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} \right) }$$

- This gives a way to calculate the probability that the chosen prompt is better than the rejected prompt based on the data

Reformulate the above to get a loss $\rightarrow$ get the MLE objective

$$ L_{DPO} (\pi_\theta; \pi_\text{ref}) = \mathbb{E}_{(x,y_w,y_l)\sim D} \left[ \log \sigma \left( \beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} - \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)} \right) \right] $$

- Above is the difference between $y_w$ given $x$ and $y_l$ given $x$.
- Pairwise difference between the winning and losing prompts
- Loss function is very simple in PPO compared to DPO

###### **Direct Preference Optimization (Video 2)**

This video covers the following four sections
- How does normal Training/Fine-Tuning work?
- How does DPO work?
- DPO datasets
- Training Notebook Runthrough

**Google Colab Code:** https://colab.research.google.com/drive/1AP9jewCrK6uSItWeRBePbkY9EFiYiii4?usp=sharing

Run and followed coding instructions to fine-tune tinyLlama using DPO. Also wrangled datasets from the website to fit instruction prompting for Llama2 chat.

**How does normal training/fine-tuning work?**

Standard training - penalize the model based on predicted vs actual next token.

<u>Example:</u>  
Training Data:
- The Capital of Ireland is <u> Dublin </u>
- The Capital of Ireland is <u> Dublin </u>
- The Capital of Ireland is <u> Cork </u>

From the above, we see that the model has a preference to predict dublin as the next token. If we want to bias the model with Dublin as the response $\rightarrow$, we feed the model with data with Dublin as the answer.

**What about Direct Preference Optimization?**

- We are going to instead drag the prob. dist. away from one answer and towards another

<u> Example: </u>

- Prompt $\rightarrow$ The Capital of Ireland is ____

- Chosen Response $\rightarrow$ Dublin
- Rejected Response $\rightarrow$ Cork

Thus, we have a new penalization method given that the 
$$\mathbb{P}(\text{Dublin}) \text{ of the model being trained} > \mathbb{P}(\text{Dublin}) \text{ of the reference model}$$
$$\mathbb{P}(\text{Cork}) \text{ of the model being trained} < \mathbb{P}(\text{Cork}) \text{ of the reference model}$$
Reference model is just a copy of the model before we started the DPO process

**DPO Datasets**
- Ultrachat
    - A series of conversation that has been had from all sort of conversations in AI
        - Includes both good response and bad response
        - We form pairs and we train our model
    - Not allowed for commercial use.
- Helpful and Harmless
    - This trains the model to stray away from harmful results

**DPO vs. RLHF**

Previous way involved getting totally new model and then train it to recgonize good and bad answers.

So the model trained will be good on identifying good or bad answers. And then you run loop on the model to get an answer $\rightarrow$ back-propogate

##### **Code Examples:**

https://huggingface.co/blog/dpo-trl

#### 2 - Instruction Tune Llama2

#### 3 - Llama2 using QLoRa and TRL (Only SVTTrainer)
- [x] Google Collab (https://colab.research.google.com/drive/1SYpgFpcmtIUzdE7pxqknrM4ArCASfkFQ?usp=sharing#scrollTo=nCheb2Z9Aos8)
- [x] Youtube Video (https://www.youtube.com/watch?v=aI8cyr-gH6M&t=6s)

**Google Collab Code:** https://colab.research.google.com/drive/1MJ8lFC81RPjsIVtm5TBgClXOyihPc3Xj?usp=sharing

Followed step by step and reached training epochs in the google collab code

**Youtube Video**

Specifies and explains the repository using LLama2 training with DPO and 4bit, QLora in the github of huggingface library.

**What is LORA?** https://www.youtube.com/watch?v=X4VvO3G6_vw

- Before LLMs, data typically Pre-trained and then through a custom dataset $\rightarrow$ finetuned
- Deployment of this bulky LLMs are getting increasingly more difficult
- Adapters are introduced

![image.png](images/lora1.png)