# Proximal Policy Optimization

## LLMs as distributions

* LLM can be viewed as probability distribution $\pi(y|x)$ over possible responses $y$ for a given query $x$.
* Sampling responses -> rather than a single deterministic output, LLMs sample tokens from a probability distribution at each step.
* Dataset approximation -> $X \sim D$ indicates that the training data distribution $D$ underpins how the model assigns probabilities to queries $X$.

### Transformers & softmax probabilities

* Token by token generation -> For a query, the transformer predicts the next token using a distribution derived from the softmax function.
* Softmax function -> Transforms logit into probabilities for each token in the vocab.
* Sequential dependence -> The probability distribution at time $t+1$ depends on the generated token at time $t$. Past tokens influence future token probabilities.

### Sampling & generation process

* Initial Query - > convert words into tokenized embeddings
* Transformer Pass -> model outputs a probability distribution over the next token
* Random Selection -> pick a token based on the distribution (not necessarily the highest probability)
* Iterate -> Feed the chosen token back into the model and repeat until the sequence is complete or \<eos\> is generated.

### Generation parameters

* temperature
    * scales the logits before applying softmax
    * effect
        * hight temp -> more uniform distribution -> more randomness
        * low temp -> sharper distribution -> less randomness
* top-k sampling
    * restricts next-token choices to the top k highest prob tokens
    * process
        * compute probs
        * keep only top K tokens
        * re-normalize to ensure probabilities sum to 1
    * effect
        * limits randomness to the most likely tokens
* top-p sampling
    * chooses the smallest set of tokens whose cumulative prob exceeds p
    * effect
        * allows dynamic selection of the token set, ensuring coverage of high-prob mass
* beam search
    * keeps track of multiple top candidate sequences and expands them step by step
    * effect
        * systematically searches for high prob sequences instead of just sampling
* repetition penalty
    * penalizes repeated tokens or phrases to encourage diversity in the output
    * effect
        * reduces repetitive loops ("the the the...")
* max & min tokens
    * sets upper and lower limits for the number of generated tokens
    * effect
        * controls how long the generated text can be

### Summary

* LLMs can be perceived as stochastic generators, guided by the learned distribution
* Softmax & transformers are the core mechanisms for assigning probs at the generation step
* Parameters
    * Temperature -> alter randomness
    * Top K & Top P -> control token selection strategies
    * Beam search -> systematically explores multiple high-prob sequences
    * Repetition penalty & response length -> refine structure and lenght of the outputs
* Adjusting the parameters helps to tailor the outputs


## From distributions to policies

### Policy in reinforcement learning

* a policy maps the agent's current state to the action it should take
* a role of randomness -> policy often include stochasticity to explore unseen possibilities, leading to more robust and adaptable results
* application -> in RL tasks, policies guide action sequences to achieve goals or maximize rewards

### LLMs as policies/distributions

* language model can be viewed as a policy $\pi(y|X)$ where given an input $x$, the model produces a distribution over possible outputs $y$
* text generation -> the policy guides the model in exploring diverse text output paths, enhancing creativity and context relevance

### Roll-outs & policy distribution

* roll-outs in LLM
    * when a query is provided, the model generates multiple possible responses (roll-outs)
    * each rollout is a distinct sequence of tokens sampled from the policy distribution
* roll-outs in RL
    * roll-outs also involve reward signals that guide future policy updates
    * note -> LLM frameworks like HG may refer to rollout differently than RL, typically without explicit rewards

### Mapping $y$ from $x$ via a policy $pi$

* $y \sim \pi(y|X)$ indicates the output $y$ is sampled from a distribution conditioned on $x$
* sequential probs -> the model breaks down the probability of a sentence into a product of conditional probs over tokens (transformer + softmax)

### Summary

* policy as distribution -> in both RL and LLM, a policy/distribution guides possible actions (tokens)
* viewing LLMs as policies highlights their capacity for exploration and action selection (text-generation)
* multiple possible outputs for a given input, in RL this is reward-based, while in LLM context, just different responses

## Reinforcement learning with human feedback

* the concept uses human feedback (rewards) to guide and fine-tune a pre-trained language model
* monkeys typing at random eventually produce coherent text, but banana can accelerate the creation of desired output

### Reward function $r(X,Y)$

* a function assigning a score to the response $Y$ given a query $X$
* indicates quality or relevance of the model's output and provides a basis for learning
* example
    * query -> Which country owns Antarctica?
    * response 1 ->  ?9dfsa -> reward: 0 (irrelevant)
    * response 2 -> No country owns Antarctica -> reward .9 (mostly correct)
    * response 3 -> Antarctica is governed by an international treaty -> reward 1 (ideal)

### Rollouts in RLHF

* a rollout is query - response pair, multiple roll-outs help the model explore diverse responses
* for each query, the model can produce several distinct responses, each receiving a reward, which are then used to adjust model parameters

### Expected reward $E(r(X,Y))$

* empirical estimate -> sum or average rewards across multiple queries and responses to approximate how well the model is performing
* $\hat R \sim \frac{1}{N \times K} \sum_{n=1}^N \sum_{k=1}^N r(X_n, Y_{n,k})$
    * $N$ stands for number of queries, $K$ for number of responses per query (roll-outs)
    * $r(X_n, Y_{n,k})$ reward for the $k$-th response to the $n$-th query
* policy perspective -> the expected reward is an expectation over the data (queries) and the model's distribution

### Fine-tuning with a reward model

* setup
    * agent/policy -> the LLM with parameters $\theta$
    * reward model -> evaluates query-response pairs and returns a reward

* process
    * step 1 -> input query $X$ to the LLM -> generate response $Y$
    * step 2 -> reward model takes $(X,Y)$ -> outputs reward $r(X,Y)$
    * step 3 -> LLM updates parameters $\theta$ to maximize reward signal

* example
    * query -> Which country owns Antarctica?
    * responses -> ?9dfsa, No country...
    * highest reward -> Antarctica is governed by a international treaty ->1
    * outcome -> the model learns to favor this type of correct response

### Summary

* human feedback drives learning, assigning rewards for correct responses focuses the model on producing better results
* roll-outs -> each q-r pair is evaluated, guiding how params are updated
* expected reward -> gives an overall measure of the model's performance
* RLHF -> balances exploration (response sampling) with exploitation (updating the model's params)


## Policy gradient foundations

* a policy $\pi_0$ assigns probabilities to potential actions or responses, given an input query $X$
* objective -> maximize the expected reward $E[r(X,Y)]$ over query-response pairs $(X,Y)$

### Proximal policy optimization

* a method to update policy parameters $\theta$ with stability, avoiding large destabilizing changes
* components
    * clipped surrogate objective -> prevents the new policy from diverging excessively from the old one
    * KL penalty coefficient ($\Beta$) -> regularizes the policy update by penalizing high divergence between the old and the new policies

### Training process

* agent & reward model
    * and agent (LLM) with learnable parameters $\theta$ generates a response $Y$ to a query $X$
    * reward function/model evaluates $(X,Y)$ and returns a scalar reward $r(X,Y)$

* roll-outs
    * the combination $(X,Y)$ is often referred to as a rollout
    * multiple roll-outs across queries help in estimating the expected reward

* objective
    * to find parameter set $\theta$ that maximizes the expected reward
    * a reference model can be included as a regularization term, ensuring we dont deviate to much from the original model

### Log-derivative trick

* directly computing $\nabla_0 \ E[r(X,Y)]$ can be intractable, the log-derivative trick reformulates the expression for easier gradient estimation ->
    * express the objective as an expectation of rewards under the policy distribution
        * $E[r_Y|\theta] = \sum_Y \ r(X,Y)\ \pi_0(Y|X)$
        * $\hat \theta = arg \ max_\theta [\sum_Y \ r(X,Y)\ \pi_0(Y|X)]$
        * $\nabla_\theta E[r_Y|\theta] = \sum_Y r(X,Y)\nabla_\theta \pi_0(Y|X)$
    * introduce $log \pi_0(Y|X)$
        * $\nabla_\theta log(\pi_0(Y|X)) = \frac{\nabla_\theta \pi_0(Y|X)}{\pi_0(Y|X)}$
    * rearrange and factor out the gradient, enabling Monte Carlo sampling to compute updates
        * $\nabla_\theta \pi_0(Y|X) = \nabla_\theta log(\pi_0(Y|X)) \pi_0(Y|X)$
        * $ \nabla_\theta E[r_Y|\theta]  = E_{Y\sim \pi_0(Y|X)} [r(X,Y) \nabla_\theta \pi_0(Y|X)]$
        * $ E_{X \sim D}[\nabla_\theta E[r_Y|\theta]] = \nabla_\theta E_{X \sim D}[E[r_Y|\theta]]$

### Practical tips

* regular evaluation with human feedback
* moderate KL penalty to avoid overly large or small regularization (instability vs no-updates)
* temperature tuning -> go from lower (more exploitation) to higher (more exploration)

### Summary

* policy gradient framework to directly optimize policy $\pi_0$ by maximizing expected reward
* PPO introduces tools to keep policy updates proximal and stable
* log-derivative trick is a RL foundational trick for computing gradient updates
* model fine-tuning through generating roll-outs, computing rewards, updating $\theta$, always balancing performance and stability

# Direct Preference Optimization

## Partition function

* direct preference optimization is a reinforcement learning technique that fine-tunes models based on human preferences more directly than traditional methods
* core ideas
    * collect data on human preferences by comparing different model outputs
    * directly optimize the model's parameters so outputs are better aligned with human preference

### DPO vs traditional RL

* traditional
    * often uses a reward function indirectly related to human feedback
    * requires an actor (policy) and critic (value/reward estimator) eg. PPO
* DPO
    * directly incorporates preference data into optimization
    * aims to avoid the complexity of training reward models by leveraging direct comparisons between responses

### Three components

* reward function (encoder)
    * evaluates the relevance or quality of a response
    * example "this is a cat" -> low score if discussing LLMs, high score if relevant
* target decoder (parameters $\theta$)
    * the model to be fine-tuned (policy $\pi$)
* reference model
    * acts as a baseline or "initial" model to regularize ho far the new policy can deviate

### Partition function & normalization

* partition function ($Z$)
    * ensures that the sum of probabilities over all possible outcomes is 1
    * converts unnormalized positive functions into valid probability distributions through scaling

* logistic example
    * $\sigma(x)$ -> sigmoid mapping $x$ to $(0,1)$
    * $P(y=1|x) = 1- \sigma(x)$
    * partition function is implicitly accounted for through the sigmoid shape

### Key DPO steps

* collect preference data -> show model outputs to humans, gather "Which is better?" judgments
* construct objective -> use these preferences (and the ref model) to form an objective function that pushes the policy to align with top choices
* normalize & fine-tune -> employ the partition function to ensure outputs form a valid distribution, optimize with gradient-based methods

### Summary

* DPO is more directly aligned with human choices, less complex than multi-stage RL setups
* partition function is critical to transform raw scores into proper probability distributions, appears in both logreg and more general RL-based preference learning contexts
* DPO is used for fine-tuning, preference based training in language tasks etc

## Optimal solution

### Objective functions

* an objective function measures how far a model's predictions deviate from targets, it guides model training, optimization of the objective improves performance

### KL divergence

* measures dissimilarity between two probability distributions $\pi * (y|x)$ and $\pi_{ref} (y|x)$
* zero divergence if the distributions are identical
* asymmetrical
* usage
    * minimizing KL divergence aligns a new policy $\pi *$ with a reference policy $\pi_{ref}$

### Converting max problem to min problem

* negation trick -> turning $arg \ max f(w)$ into $arg \ min [-f(x)]$
* scalar multiplication does not affect the location of the optimum, only rescales the func

### From RL objective to DPO

* initial RL setup
    *   $max_{\pi *} E[r(x,y)] - \beta \ D_{KL} ( \pi_* (y|x)|| \pi_{ref} (y|x))$

* reformulation
    * multiply by -1, rearrange terms and convert into an expectation
    * express the reward term as a log exponential to combine with $log \pi_{ref}$

* partition func & normalization
    * introduce partition function $Z(x)$ to ensure probs sum to 1, ie
        * $\pi * (y|X) = \frac{\pi_{ref} (y|x) \ exp(\frac{1}{\beta} r(x,y))}{Z{x}}$
        * where $Z(x) = \sum_y \pi_{ref}(y|x) \ exp(\frac{1}{\beta} r(x,y))$


### Optimal DPO policy

* closed form $\pi * (y|X) = \frac{\pi_{ref} (y|x) \ exp(\frac{1}{\beta} r(x,y))}{Z{x}}$
* interpretation -> the new policy re-weights the reference policy by $exp(\frac{1}{\beta} r)$
* $\beta param$
    * controls how much we amplify the reward term relative to $\pi_{ref}$
    * larger $\beta$ -> stronger emphasis on reward, smaller -> closer to the reference policy

### Partition function complexity

* $Z(x) = \sum_{Y \in V^T} \pi_{ref}(y|x) \ exp(\frac{1}{\beta} r(x,y))$, where $V^T$ is the set of all possible sequences of length $T$
* exponential growth
    * for large vocabularies and longer sequences, the partition func sum becomes huge
    * direct computation is typically impractical, motivating approximate or sampling-based methods

### Summary

* KL divergence & reward balances alignment with a reference model ($\pi_{ref}$) and maximizing a reward function $r$
* closed form solution $\pi * (y|X) = \frac{\pi_{ref} (y|x) \ exp(\frac{1}{\beta} r(x,y))}{Z{x}}$
* using $\beta$ as tuning param to manage policy stability and reward func
* estimating complex partition function might not be feasible


## From PPO to DPO

* DPO goal is to fine-tune a causal language model on pairwise preference data, bypassing the complexity of traditional RL-based methods like PPO

### Ranking vs scoring

* it is hard for humans to numerically score responses, this challenge is addressed with pairwise comparison which are simple and still informative
* Bradley-Terry model is a classical statistical model for pairwise comparisons used here to to guide the preference-based fine tuning

### Bradley-Terry loss

* $l(\theta) = -ln \ \sigma(s(A)-s(B))$, where $\sigma$ is the logistic function and $s(.)$ is the score
* dataset notation
    * $X$ -> query, $Y_w$ -> winning response, $Y_l$ -> loosing response
    * summation or expectation over $(X,Y_w,Y_l)$ samples in the dataset $D$

### DPO optimal policy & reference model

* $\pi * (y|X) = \frac{\pi_{ref} (y|x) \ exp(\frac{1}{\beta} r(x,y))}{Z{x}}$, where $r$ is the reward function, $\pi_{ref}$ is a reference policy, and $Z(x)$ is the partition function
* computing $Z(x)$ is not practical, taking log ratios of winning vs losing responses, the partition func cancels out

### From policy ratio to loss

* log ratio trick
    * take the difference between the log probs of winnings vs losing responses
    * combine it with a form of the reward (or equivalently the LLM outputs + reference model weights)

* Bradley-Terry-like loss
    * the new loss depends only on the ratio $\frac{\pi_\theta(Y_w|X)}{\pi_\theta(Y_l|X)}$ and the reference model, effectively removing need for an explicit reward function or partition function

### Simplified expression & plot

* set $\beta =1$ and reference policy to a constant $C$
* define $u = \frac{\pi_\theta(Y_w|X)}{\pi_\theta(Y_l|X)}$
* as $u$ increases the model is more likely to favor the winning response, loss decreases as $u$ grows above 1, encouraging correct ranking

### Converting loss to cost

* BT model -> $l = -ln \ \sigma (log(\pi_\theta (Y_w|X))-log(\pi_\theta (Y_l|X)))$
* transformation
    * by re-expressing the above as a negative lo-likelihood, we get a cost func that is differentiable wrt $\theta$
    * can be implemented through pytorch or HF DPO trainer

### Summary

* DPO replaces complex PPO with a direct optimization objective derived from pairwise preference
* partition function can be eliminated through log ratios of winning and loosing responses, that cancel out normalization term
* BT insights shows that pairwise comparison is simpler for humans and yields a straightforward log-loss for optimization
* using the log ratios reduces computational resources required
* resulting diff func -> $r(X,Y_w)-r(X,Y_l) = \beta ln (\frac{pi_{r}(Y_w|X)}{\pi_{ref}(Y_w|X)}) - \beta ln (\frac{pi_{r}(Y_l|X)}{\pi_{ref}(Y_l|X)})$
* resulting loss func -> $-\sigma (\beta ln (\frac{pi_{r}(Y_w|X)}{\pi_{ref}(Y_w|X)}) - \beta ln (\frac{pi_{r}(Y_l|X)}{\pi_{ref}(Y_l|X)}))$



