|
| 1 | +# Reinforcement Learning: An Introduction to REINFORCE |
| 2 | + |
| 3 | +Welcome to reinforcement learning! If you're familiar with supervised learning and neural network training, you're about to discover a fundamentally different approach to machine learning. |
| 4 | + |
| 5 | +{pause} |
| 6 | + |
| 7 | +## What is Reinforcement Learning? {#rl-definition} |
| 8 | + |
| 9 | +{.definition title="Reinforcement Learning"} |
| 10 | +Instead of learning from labeled examples, an **agent** learns by **acting** in an **environment** and receiving **rewards**. |
| 11 | + |
| 12 | +{pause up=rl-definition} |
| 13 | + |
| 14 | +### The RL Framework |
| 15 | + |
| 16 | +> **Agent**: The learner (your neural network) |
| 17 | +> |
| 18 | +> **Environment**: The world the agent interacts with |
| 19 | +> |
| 20 | +> **Actions**: What the agent can do |
| 21 | +> |
| 22 | +> **States**: What the agent observes |
| 23 | +> |
| 24 | +> **Rewards**: Feedback signal (positive or negative) |
| 25 | +
|
| 26 | +{pause} |
| 27 | + |
| 28 | +Think of it like learning to play a game: |
| 29 | +- You don't know the rules initially |
| 30 | +- You try actions and see what happens |
| 31 | +- Good moves get rewarded, bad moves get punished |
| 32 | +- You gradually learn a strategy |
| 33 | + |
| 34 | +{pause center=rl-definition} |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## Key Differences from Supervised Learning {#differences} |
| 39 | + |
| 40 | +{.block title="Supervised Learning"} |
| 41 | +- Fixed dataset with input-output pairs |
| 42 | +- Learn to minimize prediction error |
| 43 | +- Single training phase |
| 44 | + |
| 45 | +{pause} |
| 46 | + |
| 47 | +{.block title="Reinforcement Learning"} |
| 48 | +- Dynamic interaction with environment |
| 49 | +- Learn to maximize cumulative reward |
| 50 | +- Continuous learning from experience |
| 51 | + |
| 52 | +{pause} |
| 53 | + |
| 54 | +**No labeled data** - the agent must discover what actions are good through trial and error. |
| 55 | + |
| 56 | +{pause down=differences} |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## The Policy: Your Agent's Strategy {#policy-intro} |
| 61 | + |
| 62 | +{.definition title="Policy π(a|s)"} |
| 63 | +The probability of taking action **a** in state **s**. |
| 64 | + |
| 65 | +This is what your neural network learns to represent! |
| 66 | + |
| 67 | +{pause up=policy-intro} |
| 68 | + |
| 69 | +### Why Probabilistic Policies? |
| 70 | + |
| 71 | +From Sutton & Barto: |
| 72 | + |
| 73 | +> "action probabilities change smoothly as a function of the learned parameter, whereas in ε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values" |
| 74 | +
|
| 75 | +{pause} |
| 76 | + |
| 77 | +**Smooth changes** = **stable learning** |
| 78 | + |
| 79 | +{pause} |
| 80 | + |
| 81 | +{.example title="Policy Examples"} |
| 82 | +- **Discrete actions**: Softmax over action preferences |
| 83 | +- **Continuous actions**: Mean and variance of Gaussian distribution |
| 84 | + |
| 85 | +{pause center=policy-intro} |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +## Episodes and Returns {#episodes} |
| 90 | + |
| 91 | +{.definition title="Episode"} |
| 92 | +A complete sequence of interactions from start to terminal state. |
| 93 | + |
| 94 | +{.definition title="Return G_t"} |
| 95 | +The total reward from time step t until the end of the episode: |
| 96 | +$$G_t = R_{t+1} + R_{t+2} + ... + R_T$$ |
| 97 | + |
| 98 | +{pause up=episodes} |
| 99 | + |
| 100 | +### The Goal |
| 101 | + |
| 102 | +**Maximize expected return** by learning a better policy. |
| 103 | + |
| 104 | +{pause} |
| 105 | + |
| 106 | +But how do we improve a policy that's represented by a neural network? |
| 107 | + |
| 108 | +{pause down=episodes} |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## Enter REINFORCE {#reinforce-intro} |
| 113 | + |
| 114 | +{.theorem title="The REINFORCE Algorithm"} |
| 115 | +A **policy gradient** method that directly optimizes the policy parameters to maximize expected return. |
| 116 | + |
| 117 | +{pause up=reinforce-intro} |
| 118 | + |
| 119 | +### Core Insight |
| 120 | + |
| 121 | +We want to: |
| 122 | +1. **Increase** the probability of actions that led to high returns |
| 123 | +2. **Decrease** the probability of actions that led to low returns |
| 124 | + |
| 125 | +{pause} |
| 126 | + |
| 127 | +From Sutton & Barto: |
| 128 | + |
| 129 | +> "it causes the parameter to move most in the directions that favor actions that yield the highest return" |
| 130 | +
|
| 131 | +{pause center=reinforce-intro} |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## The Policy Gradient Theorem {#gradient-theorem} |
| 136 | + |
| 137 | +The gradient of expected return with respect to policy parameters θ: |
| 138 | + |
| 139 | +$$\nabla_\theta J(\theta) \propto \sum_s \mu(s) \sum_a q_\pi(s,a) \nabla_\theta \pi(a|s,\theta)$$ |
| 140 | + |
| 141 | +{pause} |
| 142 | + |
| 143 | +This looks complicated, but REINFORCE gives us a simple way to estimate it! |
| 144 | + |
| 145 | +{pause} |
| 146 | + |
| 147 | +{.theorem title="REINFORCE Gradient Estimate"} |
| 148 | +$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[G_t \nabla_\theta \ln \pi(A_t|S_t,\theta)\right]$$ |
| 149 | + |
| 150 | +{pause up=gradient-theorem} |
| 151 | + |
| 152 | +### What This Means |
| 153 | + |
| 154 | +From Sutton & Barto: |
| 155 | + |
| 156 | +> "Each increment is proportional to the product of a return G_t and a vector, the gradient of the probability of taking the action actually taken divided by the probability of taking that action" |
| 157 | +
|
| 158 | +{pause down=gradient-theorem} |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## REINFORCE Algorithm Steps {#algorithm} |
| 163 | + |
| 164 | +{.block title="REINFORCE Algorithm"} |
| 165 | +1. **Initialize** policy parameters θ randomly |
| 166 | +2. **For each episode**: |
| 167 | + - Generate episode following π(·|·,θ) |
| 168 | + - For each step t in episode: |
| 169 | + - Calculate return: $G_t = \sum_{k=t+1}^T R_k$ |
| 170 | + - Update: $\theta \leftarrow \theta + \alpha G_t \nabla_\theta \ln \pi(A_t|S_t,\theta)$ |
| 171 | + |
| 172 | +{pause up=algorithm} |
| 173 | + |
| 174 | +### Key Properties |
| 175 | + |
| 176 | +From Sutton & Barto: |
| 177 | + |
| 178 | +> "REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode" |
| 179 | +
|
| 180 | +{pause} |
| 181 | + |
| 182 | +This makes it an **unbiased** but **high variance** estimator. |
| 183 | + |
| 184 | +{pause center=algorithm} |
| 185 | + |
| 186 | +--- |
| 187 | + |
| 188 | +## Implementation in Neural Networks {#implementation} |
| 189 | + |
| 190 | +If your policy network outputs action probabilities, the gradient update becomes: |
| 191 | + |
| 192 | +```ocaml |
| 193 | +(* Compute log probability gradient *) |
| 194 | +let log_prob_grad = compute_gradient_log_prob action_taken state in |
| 195 | +(* Scale by return *) |
| 196 | +let policy_grad = G_t *. log_prob_grad in |
| 197 | +(* Update parameters *) |
| 198 | +update_parameters policy_grad learning_rate |
| 199 | +``` |
| 200 | + |
| 201 | +{pause up=implementation} |
| 202 | + |
| 203 | +### In Practice |
| 204 | + |
| 205 | +You'll typically: |
| 206 | +1. Use **automatic differentiation** to compute ∇ ln π |
| 207 | +2. **Collect episodes** in batches for stability |
| 208 | +3. Apply **baseline subtraction** to reduce variance |
| 209 | + |
| 210 | +{pause down=implementation} |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## Reducing Variance with Baselines {#baselines} |
| 215 | + |
| 216 | +REINFORCE can be **very noisy**. We can subtract a baseline b(s) from returns: |
| 217 | + |
| 218 | +$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[(G_t - b(S_t)) \nabla_\theta \ln \pi(A_t|S_t,\theta)\right]$$ |
| 219 | + |
| 220 | +{pause up=baselines} |
| 221 | + |
| 222 | +From Sutton & Barto: |
| 223 | + |
| 224 | +> "The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero" |
| 225 | +
|
| 226 | +{pause} |
| 227 | + |
| 228 | +> "In some states all actions have high values and we need a high baseline to differentiate the higher valued actions from the less highly valued ones" |
| 229 | +
|
| 230 | +{pause} |
| 231 | + |
| 232 | +{.example title="Common Baselines"} |
| 233 | +- **Constant**: Average return over recent episodes |
| 234 | +- **State-dependent**: Value function V(s) learned separately |
| 235 | + |
| 236 | +{pause center=baselines} |
| 237 | + |
| 238 | +--- |
| 239 | + |
| 240 | +## REINFORCE with Baseline {#reinforce-baseline} |
| 241 | + |
| 242 | +{.block title="REINFORCE with Baseline Algorithm"} |
| 243 | + |
| 244 | +1. **Initialize** policy parameters θ and baseline parameters w |
| 245 | +2. **For each episode**: |
| 246 | + - Generate episode following π(·|·,θ) |
| 247 | + - For each step t: |
| 248 | + - $G_t = \sum_{k=t+1}^T R_k$ |
| 249 | + - $\delta = G_t - b(S_t,w)$ |
| 250 | + - $\theta \leftarrow \theta + \alpha_\theta \delta \nabla_\theta \ln \pi(A_t|S_t,\theta)$ |
| 251 | + - $w \leftarrow w + \alpha_w \delta \nabla_w b(S_t,w)$ |
| 252 | + |
| 253 | +{pause up=reinforce-baseline} |
| 254 | + |
| 255 | +The baseline is learned to predict expected returns, reducing variance without introducing bias. |
| 256 | + |
| 257 | +{pause down=reinforce-baseline} |
| 258 | + |
| 259 | +--- |
| 260 | + |
| 261 | +## Practical Considerations {#practical} |
| 262 | + |
| 263 | +### Learning Rates |
| 264 | + |
| 265 | +From Sutton & Barto: |
| 266 | + |
| 267 | +> "Choosing the step size for values (here α_w) is relatively easy... much less clear how to set the step size for the policy parameters" |
| 268 | +
|
| 269 | +{pause up=practical} |
| 270 | + |
| 271 | +**Policy updates are more sensitive** - start with smaller learning rates for θ. |
| 272 | + |
| 273 | +{pause} |
| 274 | + |
| 275 | +### Actor-Critic Methods |
| 276 | + |
| 277 | +From Sutton & Barto: |
| 278 | + |
| 279 | +> "Methods that learn approximations to both policy and value functions are often called actor–critic methods" |
| 280 | +
|
| 281 | +{pause} |
| 282 | + |
| 283 | +REINFORCE with baseline is a simple actor-critic method: |
| 284 | +- **Actor**: The policy π(a|s,θ) |
| 285 | +- **Critic**: The baseline b(s,w) |
| 286 | + |
| 287 | +{pause center=practical} |
| 288 | + |
| 289 | +--- |
| 290 | + |
| 291 | +## Summary {#summary} |
| 292 | + |
| 293 | +{.block title="Key Takeaways"} |
| 294 | + |
| 295 | +✓ **RL learns from interaction**, not labeled data |
| 296 | + |
| 297 | +✓ **REINFORCE optimizes policies directly** using policy gradients |
| 298 | + |
| 299 | +✓ **Returns weight gradient updates** - high returns → strengthen action probabilities |
| 300 | + |
| 301 | +✓ **Baselines reduce variance** without introducing bias |
| 302 | + |
| 303 | +✓ **Actor-critic architectures** combine policy and value learning |
| 304 | + |
| 305 | +{pause up=summary} |
| 306 | + |
| 307 | +### Next Steps |
| 308 | + |
| 309 | +- Implement REINFORCE on a simple environment |
| 310 | +- Experiment with different baseline functions |
| 311 | +- Explore more advanced policy gradient methods (PPO, A3C) |
| 312 | +- Consider trust region methods for more stable updates |
| 313 | + |
| 314 | +{pause} |
| 315 | + |
| 316 | +**You now have the foundation to start learning policies through interaction!** |
| 317 | + |
| 318 | +{pause center=summary} |
| 319 | + |
| 320 | +--- |
| 321 | + |
| 322 | +## References |
| 323 | + |
| 324 | +Sutton, R. S., & Barto, A. G. (2018). *Reinforcement learning: An introduction* (2nd ed.). MIT Press. |
0 commit comments