Skip to content

Commit 632475b

Browse files
committed
Pre-initial sketch of tutorial slides, by Claude
Summary by Claude: - Start with RL fundamentals for audiences familiar with neural network training but new to ML - Progress logically from basic concepts to the REINFORCE algorithm - Include mathematical details with proper LaTeX formatting - Incorporate key quotes from Sutton & Barto to ground the presentation - Use slipshow navigation with strategic pauses and viewport control - Provide practical implementation guidance for neural network frameworks The presentation covers the core RL framework, policy gradients, the policy gradient theorem, REINFORCE algorithm steps, variance reduction with baselines, and practical considerations including actor-critic methods.
1 parent aec63f5 commit 632475b

File tree

1 file changed

+324
-0
lines changed

1 file changed

+324
-0
lines changed

docs/RL_Introduction-REINFORCE.md

Lines changed: 324 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,324 @@
1+
# Reinforcement Learning: An Introduction to REINFORCE
2+
3+
Welcome to reinforcement learning! If you're familiar with supervised learning and neural network training, you're about to discover a fundamentally different approach to machine learning.
4+
5+
{pause}
6+
7+
## What is Reinforcement Learning? {#rl-definition}
8+
9+
{.definition title="Reinforcement Learning"}
10+
Instead of learning from labeled examples, an **agent** learns by **acting** in an **environment** and receiving **rewards**.
11+
12+
{pause up=rl-definition}
13+
14+
### The RL Framework
15+
16+
> **Agent**: The learner (your neural network)
17+
>
18+
> **Environment**: The world the agent interacts with
19+
>
20+
> **Actions**: What the agent can do
21+
>
22+
> **States**: What the agent observes
23+
>
24+
> **Rewards**: Feedback signal (positive or negative)
25+
26+
{pause}
27+
28+
Think of it like learning to play a game:
29+
- You don't know the rules initially
30+
- You try actions and see what happens
31+
- Good moves get rewarded, bad moves get punished
32+
- You gradually learn a strategy
33+
34+
{pause center=rl-definition}
35+
36+
---
37+
38+
## Key Differences from Supervised Learning {#differences}
39+
40+
{.block title="Supervised Learning"}
41+
- Fixed dataset with input-output pairs
42+
- Learn to minimize prediction error
43+
- Single training phase
44+
45+
{pause}
46+
47+
{.block title="Reinforcement Learning"}
48+
- Dynamic interaction with environment
49+
- Learn to maximize cumulative reward
50+
- Continuous learning from experience
51+
52+
{pause}
53+
54+
**No labeled data** - the agent must discover what actions are good through trial and error.
55+
56+
{pause down=differences}
57+
58+
---
59+
60+
## The Policy: Your Agent's Strategy {#policy-intro}
61+
62+
{.definition title="Policy π(a|s)"}
63+
The probability of taking action **a** in state **s**.
64+
65+
This is what your neural network learns to represent!
66+
67+
{pause up=policy-intro}
68+
69+
### Why Probabilistic Policies?
70+
71+
From Sutton & Barto:
72+
73+
> "action probabilities change smoothly as a function of the learned parameter, whereas in ε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values"
74+
75+
{pause}
76+
77+
**Smooth changes** = **stable learning**
78+
79+
{pause}
80+
81+
{.example title="Policy Examples"}
82+
- **Discrete actions**: Softmax over action preferences
83+
- **Continuous actions**: Mean and variance of Gaussian distribution
84+
85+
{pause center=policy-intro}
86+
87+
---
88+
89+
## Episodes and Returns {#episodes}
90+
91+
{.definition title="Episode"}
92+
A complete sequence of interactions from start to terminal state.
93+
94+
{.definition title="Return G_t"}
95+
The total reward from time step t until the end of the episode:
96+
$$G_t = R_{t+1} + R_{t+2} + ... + R_T$$
97+
98+
{pause up=episodes}
99+
100+
### The Goal
101+
102+
**Maximize expected return** by learning a better policy.
103+
104+
{pause}
105+
106+
But how do we improve a policy that's represented by a neural network?
107+
108+
{pause down=episodes}
109+
110+
---
111+
112+
## Enter REINFORCE {#reinforce-intro}
113+
114+
{.theorem title="The REINFORCE Algorithm"}
115+
A **policy gradient** method that directly optimizes the policy parameters to maximize expected return.
116+
117+
{pause up=reinforce-intro}
118+
119+
### Core Insight
120+
121+
We want to:
122+
1. **Increase** the probability of actions that led to high returns
123+
2. **Decrease** the probability of actions that led to low returns
124+
125+
{pause}
126+
127+
From Sutton & Barto:
128+
129+
> "it causes the parameter to move most in the directions that favor actions that yield the highest return"
130+
131+
{pause center=reinforce-intro}
132+
133+
---
134+
135+
## The Policy Gradient Theorem {#gradient-theorem}
136+
137+
The gradient of expected return with respect to policy parameters θ:
138+
139+
$$\nabla_\theta J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a|s,\theta)$$
140+
141+
{pause}
142+
143+
This looks complicated, but REINFORCE gives us a simple way to estimate it!
144+
145+
{pause}
146+
147+
{.theorem title="REINFORCE Gradient Estimate"}
148+
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[G_t \nabla_\theta \ln \pi(A_t|S_t,\theta)\right]$$
149+
150+
{pause up=gradient-theorem}
151+
152+
### What This Means
153+
154+
From Sutton & Barto:
155+
156+
> "Each increment is proportional to the product of a return G_t and a vector, the gradient of the probability of taking the action actually taken divided by the probability of taking that action"
157+
158+
{pause down=gradient-theorem}
159+
160+
---
161+
162+
## REINFORCE Algorithm Steps {#algorithm}
163+
164+
{.block title="REINFORCE Algorithm"}
165+
1. **Initialize** policy parameters θ randomly
166+
2. **For each episode**:
167+
- Generate episode following π(·|·,θ)
168+
- For each step t in episode:
169+
- Calculate return: $G_t = \sum_{k=t+1}^T R_k$
170+
- Update: $\theta \leftarrow \theta + \alpha G_t \nabla_\theta \ln \pi(A_t|S_t,\theta)$
171+
172+
{pause up=algorithm}
173+
174+
### Key Properties
175+
176+
From Sutton & Barto:
177+
178+
> "REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode"
179+
180+
{pause}
181+
182+
This makes it an **unbiased** but **high variance** estimator.
183+
184+
{pause center=algorithm}
185+
186+
---
187+
188+
## Implementation in Neural Networks {#implementation}
189+
190+
If your policy network outputs action probabilities, the gradient update becomes:
191+
192+
```ocaml
193+
(* Compute log probability gradient *)
194+
let log_prob_grad = compute_gradient_log_prob action_taken state in
195+
(* Scale by return *)
196+
let policy_grad = G_t *. log_prob_grad in
197+
(* Update parameters *)
198+
update_parameters policy_grad learning_rate
199+
```
200+
201+
{pause up=implementation}
202+
203+
### In Practice
204+
205+
You'll typically:
206+
1. Use **automatic differentiation** to compute ∇ ln π
207+
2. **Collect episodes** in batches for stability
208+
3. Apply **baseline subtraction** to reduce variance
209+
210+
{pause down=implementation}
211+
212+
---
213+
214+
## Reducing Variance with Baselines {#baselines}
215+
216+
REINFORCE can be **very noisy**. We can subtract a baseline b(s) from returns:
217+
218+
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[(G_t - b(S_t)) \nabla_\theta \ln \pi(A_t|S_t,\theta)\right]$$
219+
220+
{pause up=baselines}
221+
222+
From Sutton & Barto:
223+
224+
> "The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero"
225+
226+
{pause}
227+
228+
> "In some states all actions have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones"
229+
230+
{pause}
231+
232+
{.example title="Common Baselines"}
233+
- **Constant**: Average return over recent episodes
234+
- **State-dependent**: Value function V(s) learned separately
235+
236+
{pause center=baselines}
237+
238+
---
239+
240+
## REINFORCE with Baseline {#reinforce-baseline}
241+
242+
{.block title="REINFORCE with Baseline Algorithm"}
243+
244+
1. **Initialize** policy parameters θ and baseline parameters w
245+
2. **For each episode**:
246+
- Generate episode following π(·|·,θ)
247+
- For each step t:
248+
- $G_t = \sum_{k=t+1}^T R_k$
249+
- $\delta = G_t - b(S_t,w)$
250+
- $\theta \leftarrow \theta + \alpha_\theta \delta \nabla_\theta \ln \pi(A_t|S_t,\theta)$
251+
- $w \leftarrow w + \alpha_w \delta \nabla_w b(S_t,w)$
252+
253+
{pause up=reinforce-baseline}
254+
255+
The baseline is learned to predict expected returns, reducing variance without introducing bias.
256+
257+
{pause down=reinforce-baseline}
258+
259+
---
260+
261+
## Practical Considerations {#practical}
262+
263+
### Learning Rates
264+
265+
From Sutton & Barto:
266+
267+
> "Choosing the step size for values (here α_w) is relatively easy... much less clear how to set the step size for the policy parameters"
268+
269+
{pause up=practical}
270+
271+
**Policy updates are more sensitive** - start with smaller learning rates for θ.
272+
273+
{pause}
274+
275+
### Actor-Critic Methods
276+
277+
From Sutton & Barto:
278+
279+
> "Methods that learn approximations to both policy and value functions are often called actor–critic methods"
280+
281+
{pause}
282+
283+
REINFORCE with baseline is a simple actor-critic method:
284+
- **Actor**: The policy π(a|s,θ)
285+
- **Critic**: The baseline b(s,w)
286+
287+
{pause center=practical}
288+
289+
---
290+
291+
## Summary {#summary}
292+
293+
{.block title="Key Takeaways"}
294+
295+
**RL learns from interaction**, not labeled data
296+
297+
**REINFORCE optimizes policies directly** using policy gradients
298+
299+
**Returns weight gradient updates** - high returns → strengthen action probabilities
300+
301+
**Baselines reduce variance** without introducing bias
302+
303+
**Actor-critic architectures** combine policy and value learning
304+
305+
{pause up=summary}
306+
307+
### Next Steps
308+
309+
- Implement REINFORCE on a simple environment
310+
- Experiment with different baseline functions
311+
- Explore more advanced policy gradient methods (PPO, A3C)
312+
- Consider trust region methods for more stable updates
313+
314+
{pause}
315+
316+
**You now have the foundation to start learning policies through interaction!**
317+
318+
{pause center=summary}
319+
320+
---
321+
322+
## References
323+
324+
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement learning: An introduction* (2nd ed.). MIT Press.

0 commit comments

Comments
 (0)