Skip to content

Commit 87bed72

Browse files
committed
GRPO in the RL intro slides, by Claude Opus
Summaries by Claude: Summary 1: The new sections include: 1. Policy Ratios and Importance Sampling - Explains the limitation of REINFORCE (on-policy only) - Introduces policy ratios as a solution for reusing off-policy data - Provides concrete examples with numerical interpretations 2. Clipping - Shows the problem of unbounded policy updates - Explains PPO-style clipping with concrete examples - Demonstrates how clipping prevents destabilizing updates 3. KL Divergence Penalty - Introduces KL divergence as an additional safety mechanism - Explains the trade-off between β parameters - Shows why both clipping AND KL penalty are needed 4. GRPO Algorithm - Brings everything together showing GRPO as "REINFORCE + Group Baselines + Clipping + KL Penalty" - Provides the complete algorithm steps - Includes practical implementation code - Compares GRPO directly to REINFORCE with constant baseline 5. GRPO Summary - Synthesizes why GRPO works for LLMs - Shows the evolutionary path: REINFORCE → GRPO - Emphasizes that GRPO is REINFORCE evolved for modern training The presentation now provides a complete pedagogical journey from basic REINFORCE to industrial-strength GRPO, with each concept building naturally on the previous ones. The audience will understand not just what GRPO is, but why each component was added and how they work together. Summary 2: I've made the flow clearer by: 1. Changed the initial mention from "Group Baselines" to "GRPO Innovation" to avoid assuming prior knowledge 2. Added a dedicated definition of Group Baselines as "The Key Innovation" that explains: - What it replaces (historical episodes from different queries) - How it works (G responses to same query) - Why it's better ("How good was this response compared to other attempts at the same problem?") This maintains the narrative flow while properly introducing the concept when we first need to understand what makes GRPO special. The audience now gets the full context of why group baselines are GRPO's key innovation before diving into the algorithm details.
1 parent 71f2997 commit 87bed72

File tree

1 file changed

+230
-7
lines changed

1 file changed

+230
-7
lines changed

docs/slides-RL-REINFORCE.md

Lines changed: 230 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -353,18 +353,241 @@ REINFORCE with baseline is a simple actor-critic method:
353353
{pause up=summary}
354354
### Next Steps
355355

356-
- Implement REINFORCE on a simple environment
357-
- Experiment with different baseline functions
358-
- Explore more advanced policy gradient methods (PPO, A3C)
359-
- Consider trust region methods for more stable updates
356+
- Implement the Sokoban environment
357+
- Implement a policy network model
358+
- Implement REINFORCE
359+
- Enhance with the constant baseline
360360

361-
{pause down=refs-sutton-barto}
361+
***
362+
363+
{pause center #policy-ratios}
364+
## Policy Ratios and Importance Sampling
365+
366+
REINFORCE has a fundamental limitation: it can only use data from the **current** policy.
367+
368+
{pause up=policy-ratios}
369+
### The Problem with Policy Updates
370+
371+
After each gradient step, our policy π(a|s,θ) changes. But what about all that expensive experience we just collected?
372+
373+
{.example title="Sokoban Training Reality"}
374+
- Collect 1000 episodes with current policy → expensive!
375+
- Update policy weights θ → policy changes
376+
- Old episodes are now **off-policy** → can't use them directly
377+
378+
{pause center #importance-sampling}
379+
> ### Solution: Importance Sampling
380+
>
381+
> **Key insight**: We can reuse off-policy data by weighting it appropriately.
382+
383+
{pause}
384+
385+
{.definition title="Policy Ratio"}
386+
$$\text{ratio}_{t} = \frac{\pi_{\theta_{new}}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$
387+
388+
This tells us how much more (or less) likely the action was under the new policy vs. old policy.
389+
390+
{pause down}
391+
**Importance-weighted REINFORCE update**:
392+
$$\theta \leftarrow \theta + \alpha \cdot \text{ratio}_t \cdot G_t \cdot \nabla_\theta \ln \pi(a_t|s_t,\theta)$$
393+
394+
{pause up=importance-sampling}
395+
{.example title="Policy Ratio Interpretation"}
396+
- `ratio = 2.0`: New policy twice as likely to take this action
397+
- `ratio = 0.5`: New policy half as likely to take this action
398+
- `ratio = 1.0`: No change in action probability
399+
400+
***
401+
402+
{pause center #clipping}
403+
## The Problem: Unbounded Policy Updates
404+
405+
Importance sampling allows off-policy learning, but creates a new problem: **unbounded ratios**.
406+
407+
{pause up=clipping}
408+
409+
{.example title="When Ratios Explode"}
410+
If old policy had π_old(action) = 0.01 and new policy has π_new(action) = 0.9:
362411

363-
**You now have the foundation to start learning policies through interaction!**
412+
**ratio = 0.9 / 0.01 = 90**
413+
414+
With a high return G_t = +10: **update = 90 × 10 = 900**
415+
416+
This massive update can destabilize training!
417+
418+
{pause}
419+
420+
### Solution: Clipped Policy Updates
421+
422+
**PPO-style clipping** limits how much the policy can change in one update:
423+
424+
$$L^{CLIP}(\theta) = \min\left(\text{ratio}_t \cdot A_t, \; \text{clip}(\text{ratio}_t, 1-\epsilon, 1+\epsilon) \cdot A_t\right)$$
425+
426+
{pause}
427+
428+
{.definition title="Clipping Parameters"}
429+
- **ε = 0.2** (typical): Allow 20% change in action probabilities
430+
- **clip(x, 1-ε, 1+ε)**: Forces ratio to stay in [0.8, 1.2] range
431+
- **min(...)**: Takes the more conservative update
432+
433+
434+
{pause down .example title="Clipping in Action (ε = 0.2)"}
435+
- `ratio = 90` → clipped to `1.2` → much smaller update
436+
- `ratio = 0.01` → clipped to `0.8` → prevents tiny updates too
437+
- `ratio = 1.1` → no clipping needed, within [0.8, 1.2]
364438

365439
***
366440

441+
{pause center #kl-penalty}
442+
## KL Divergence: Keeping Policies Close
443+
444+
Even with clipping, we want an additional safety mechanism to prevent the policy from changing too drastically.
445+
446+
{pause up=kl-penalty}
447+
448+
{.definition title="KL Divergence Penalty"}
449+
$$D_{KL}[\pi_{old} \| \pi_{new}] = \sum_a \pi_{old}(a|s) \log \frac{\pi_{old}(a|s)}{\pi_{new}(a|s)}$$
450+
451+
**Measures how different two probability distributions are.**
452+
453+
{pause #kl-objective}
454+
### KL-Regularized Objective
455+
456+
$$L_{total}(\theta) = L_{policy}(\theta) - \beta \cdot D_{KL}[\pi_{old} \| \pi_{new}]$$
457+
458+
{pause}
459+
460+
{.definition title="KL Penalty Parameters"}
461+
- **β**: Controls penalty strength (e.g., 0.01, 0.04)
462+
- **Higher β**: Keeps policy very close to old policy (stable but slow learning)
463+
- **Lower β**: Allows more exploration (faster learning but less stable)
464+
465+
466+
{pause down .example title="KL Penalty in Practice"}
467+
> If policy changes dramatically → high KL divergence → large penalty
468+
> → discourages big changes
469+
>
470+
> The penalty acts like a "trust region" - we trust small changes more than large ones.
471+
472+
{pause up=kl-objective}
473+
> ### Why Both Clipping AND KL Penalty?
474+
>
475+
> **Clipping**: Hard constraint on individual action probabilities
476+
> **KL Penalty**: Soft constraint on overall policy distribution
477+
>
478+
> Together they provide robust stability for policy optimization.
479+
480+
***
481+
482+
{pause center #grpo-algorithm}
483+
## Group Relative Policy Optimization (GRPO)
484+
485+
Now we can understand GRPO: **REINFORCE + GRPO Innovation + Clipping + KL Penalty**
486+
487+
{pause up=grpo-algorithm}
488+
489+
{.definition title="GRPO: The Complete Picture"}
490+
> GRPO combines all the techniques we've learned:
491+
> 1. **Group baselines** - Compare responses to the **same query** (GRPO's key innovation)
492+
> 2. **Policy ratios** for off-policy learning
493+
> 3. **Clipping** for stable updates
494+
> 4. **KL penalties** for additional safety
495+
496+
{pause}
497+
498+
{.definition title="Group Baselines: The Key Innovation"}
499+
> Instead of comparing to historical episodes from different queries:
500+
> - Generate **G responses** to the **same query**
501+
> - Compute advantages relative to **this group**: $A_i = (r_i - \text{mean}_\text{group}) / (\text{std}_\text{group} + ε)$
502+
> - Much better signal: "How good was this response compared to other attempts at the same problem?"
503+
504+
{pause center #grpo-steps}
505+
### GRPO Algorithm Steps
506+
507+
{.block title="GRPO for LLM Fine-tuning"}
508+
1. **Sample G responses** per query from current policy π_θ_old
509+
2. **Evaluate rewards** r_i for each response using reward model
510+
3. **Compute group advantages**: A_i = (r_i - mean_group) / (std_group + ε)
511+
4. **Calculate clipped loss**:
512+
- ratio_i = π_θ(response_i) / π_θ_old(response_i)
513+
- L_i = min(ratio_i × A_i, clip(ratio_i, 1-ε, 1+ε) × A_i)
514+
5. **Add KL penalty**: L_total = L_policy - β × KL[π_θ_old || π_θ]
515+
6. **Update policy**: θ ← θ - α ∇L_total
516+
517+
{pause up=grpo-steps}
518+
### Why GRPO Works for LLMs
519+
520+
{.example title="GRPO vs REINFORCE Comparison" #grpo-for-llms}
521+
>
522+
> **REINFORCE with constant baseline**:
523+
> - Baseline: Average of past episodes (different queries)
524+
> - No clipping → unstable updates
525+
> - No policy ratios → must stay on-policy
526+
>
527+
> **GRPO**:
528+
> - **Better baseline**: Compare within responses to **same query**
529+
> - **Clipped updates**: Stable learning with large batches
530+
> - **Policy ratios**: Can reuse data across updates
531+
532+
{pause down=grpo-for-llms}
533+
534+
{#grpo-implementation}
535+
### GRPO Implementation Reality
536+
537+
```python
538+
# For each query, generate G=4 responses
539+
responses = model.generate(query, num_return_sequences=4)
540+
541+
# Compute group-relative advantages
542+
rewards = [reward_model(r) for r in responses]
543+
advantages = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-8)
544+
545+
# Clipped policy gradient update
546+
ratios = new_probs / old_probs
547+
clipped_loss = min(ratios * advantages,
548+
clip(ratios, 0.8, 1.2) * advantages)
549+
```
550+
551+
***
552+
553+
{pause up=grpo-implementation #grpo-summary}
554+
## GRPO: Why It Works
555+
556+
{.block title="Key Insights"}
557+
>
558+
> > **Group baselines are better** - Compare responses to the same query, not different queries
559+
> >
560+
> > **Clipping prevents instability** - Large policy updates are dangerous
561+
> >
562+
> > **KL penalties add safety** - Trust regions keep learning stable
563+
> >
564+
> > **Perfect for LLM fine-tuning** - Generate multiple responses easily
565+
566+
{pause up=grpo-summary}
567+
568+
### The Evolution: REINFORCE → GRPO
569+
570+
1. **REINFORCE**: Basic policy gradients with historical baselines
571+
2. **+ Clipping**: Stable updates with policy ratios
572+
3. **+ KL Penalty**: Additional safety through regularization
573+
4. **+ Group Baselines**: Better comparisons for LLM setting
574+
5. **= GRPO**: Industrial-strength policy optimization for LLMs
575+
576+
{pause}
577+
578+
**GRPO doesn't replace REINFORCE - it's REINFORCE evolved for modern LLM training.**
579+
580+
**You now understand the complete journey from REINFORCE to GRPO!**
581+
582+
***
583+
584+
{pause down=fin}
585+
367586
## References
368587

369588
{#refs-sutton-barto}
370-
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement learning: An introduction* (2nd ed.). MIT Press.
589+
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement learning: An introduction* (2nd ed.). MIT Press.
590+
591+
Skiredj, A. (2025). *The Illustrated GRPO: A Detailed and Pedagogical Explanation of GRPO Algorithm*. OCP Solutions & Mohammed VI Polytechnic University, Morocco.
592+
593+
{#fin}

0 commit comments

Comments
 (0)