# AIO Q3 — Q-Learning: From Bellman to Approximation

## What is Reinforcement Learning?

Imagine teaching a dog new tricks. You don't give it a manual. Instead, you reward good behavior with treats and discourage bad behavior. Over time, the dog will learn which actions lead to rewards. **Reinforcement Learning (RL)** works the same way: an agent learns to make decisions by interacting with an environment and receiving feedback.

**Why does this matter?**
- **Game AI:** RL trained AlphaGo to defeat world champions and taught AI to master Atari games from raw pixels
- **Robotics:** Robots learn to walk, grasp objects, and navigate without explicit programming  
- **Self-driving cars:** Learning optimal driving policies from experience
- **Healthcare:** Personalized treatment recommendations that adapt over time

Unlike supervised learning (where we have "correct" answers), RL agents must **discover** good strategies through trial and error, balancing *exploration* (trying new things) and *exploitation* (using what works).

---

## Setup: Markov Decision Process (MDP)

We model decision-making problems as a **Markov Decision Process (MDP)** with five components:

- **States** $s \in \mathcal{S}$: Where the agent currently is (e.g., Pacman's position on the grid)
- **Actions** $a \in \mathcal{A}$: What the agent can do (e.g., UP, DOWN, LEFT, RIGHT)
- **Transitions** $T(s,a,s') = P(s'|s,a)$: Probability of reaching state $s'$ after taking action $a$ in state $s$ (e.g., 80% chance of moving in intended direction, 20% slip sideways)
- **Rewards** $R(s,a,s')$: Immediate feedback for a transition (e.g., +10 for eating a pellet, -500 for getting eaten by a ghost)
- **Discount factor** $\gamma \in [0,1)$: How much we value future rewards vs. immediate ones (γ = 0.9 means future rewards are worth 90% of immediate ones)

**The Goal:** Find a *policy* $\pi(s) \to a$ that maximizes expected cumulative discounted reward.

The optimal **Q-function** (action-value function) tells us how good it is to take action $a$ in state $s$ and then act optimally. It satisfies the **Bellman optimality equation**:

$$
Q^*(s,a) = \sum_{s'} T(s,a,s') \left[ R(s,a,s') + \gamma \max_{a'} Q^*(s',a') \right]
$$


## Q1 — From Bellman to Sample-Based Update - 9 Points

In reinforcement learning, we don't know $T(s,a,s')$ or $R(s,a,s')$. Instead, we observe samples $(s, a, r, s')$.

The Q-learning update rule is:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha \left[ r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right]
$$

**Part (a) - 5 Points:** The update rule can be rewritten as an **exponential moving average**:

$$
Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s',a') \right)
$$

After $k$ updates, what is the weight (the coefficient that multiplies it in the formula) on the **original** Q-value? What does this imply as $k \to \infty$?

```
Answer (a):


```

**Part (b) - 4 Points:** What is the interpretation of $\alpha$ in terms of weighting old vs. new information?

```
Answer (b):


```


## Q2 — The TD Error - 14 Points

The **temporal difference (TD) error** measures the "surprise": the difference between what we *expected* to get (our current Q-value) and what we *actually* observed (the reward plus estimated future value). It is defined as:

$$
\delta = r + \gamma \max_{a'} Q(s', a') - Q(s, a)
$$

**Part (a) - 6 Points:** The TD error compares two estimates of the value of taking action $a$ in state $s$:

- **Current estimate:** $Q(s,a)$
- **Sample-based estimate:** $r + \gamma \max_{a'} Q(s',a')$

Show that when the agent consistently **underestimates** the value of a state-action pair (value of Q(s,a) is too low), $\delta$ will be positive. What happens to $Q(s,a)$ over time in this case?

```
Answer (a):


```

**Part (b) - 5 Points:** What does it mean when $\delta > 0$? What about $\delta < 0$? Interpret in terms of "surprise."

```
Answer (b):


```

**Part (c) - 3 Points:** At convergence (when $Q = Q^*$), what is the expected value of $\delta$? Justify your answer.

```
Answer (c):


```


## Q3 — Exploration Method: $\epsilon$-Greedy Analysis - 11 Points

In $\epsilon$-greedy, given a value, $\epsilon$, we define two stochastic behaviors of our agent:
- Exploration: Take a random action with probability $\epsilon$ ($\epsilon$ = 0.1 means explore 10% of the time)
- Exploitation: Take the greedy action $\arg\max_a Q(s,a)$ with probability $1-\epsilon$. This action is considered "greedy" because the agent is greedily grabbing whatever looks best right now based on current estimates, without considering that exploring might reveal something better.

Suppose there are $|\mathcal{A}| = 4$ actions, and the current greedy action is $a^* = \arg\max_a Q(s,a)$. The greedy action is the action with the highest Q-value at the current state. 

**Part (a) - 6 Points:** What is the probability of selecting a specific non-greedy action $a \neq a^*$?

```
Answer (a):


```

**Part (b) 5 Points:** As $\epsilon \to 0$, what policy does $\epsilon$-greedy approach? As $\epsilon \to 1$?

```
Answer (b):


```


## Q4 — Exploration Functions - 9 Points

An alternative to $\epsilon$-greedy is to instead optimize an **exploration function**:

$$
f(u, n) = u + \frac{k}{n}
$$

where $u$ is the current Q-value estimate, $n$ is the visit count for that state-action pair, and $k > 0$ is a constant.

**Part (a) - 5 Points:** As $n \to \infty$, what does $f(u,n)$ approach? What does this mean for exploitation?

```
Answer (a):


```

**Part (b) - 4 Points:** Why might this approach have lower **regret** than pure $\epsilon$-greedy? Regret is the total reward you lost during learning by not acting optimally (it's the difference between what you could have earned with the optimal policy and what you actually earned while exploring).

```
Answer (b):


```


## Q5 — Linear Function Approximation (Approximate Q-Learning) - 11 Points

In games like Pacman, there are millions of possible states (every combination of Pacman position, ghost positions, and pellet configurations). We can't store a Q-value for every state-action pair because there's simply too many.

**Approximate Q-learning** solves this by representing Q-values as a weighted sum of **features**. Instead of storing Q(s,a) directly, we learn **weights** that tell us how important each feature is. The same weights apply to ALL states, allowing us to generalize from states we've seen to states we haven't. For example, if we use w_1 = -5 as the weight for "closeness to ghost" and eventually learn that ghosts are bad, that knowledge applies to every state where a ghost is nearby. Including states we've never visited before. 

When the state space is too large, we approximate:

$$
Q(s,a) \approx \hat{Q}(s,a; \mathbf{w}) = \sum_{i=1}^{d} w_i f_i(s,a) = \mathbf{w}^\top \mathbf{f}(s,a)
$$

where $\mathbf{f}(s,a) = [f_1(s,a), \ldots, f_d(s,a)]^\top$ are **features** and $\mathbf{w} = [w_1, \ldots, w_d]^\top$ are **weights**.

**Part (a) - 6 Points:** Consider a game where the agent makes $n$ sequential decisions, each with branching factor $b$ (i.e., $b$ possible actions at each step), and there are $|A|$ total actions available.

If we wanted each state-action pair to have its own independent Q-value (no generalization), how many features would we need? Express your answer in terms of $b$, $n$, and $|A|$.

```
Answer (a):


```

**Part (b) - 5 Points:** Consider the feature $f(s,a) = \frac{1}{\text{distance to nearest ghost} + 1}$. 

(i) Would you expect the learned weight (the value of w that the agent discovers through training) for this feature to be positive or negative? Why?

```
Answer (b)(i):


```

(ii) If Pacman gets eaten by a ghost (large negative reward), how will this weight change during the update? 

*Hint: Use the TD error $\delta = r + \gamma V(s') - Q(s,a)$ and weight update rule $w \leftarrow w + \alpha \cdot \delta \cdot f(s,a)$*

```
Answer (b)(ii):


```


## Q6 — Deriving the Weight Update (Gradient Descent) - 13 Points

We want to minimize the squared TD error:

$$
L(\mathbf{w}) = \frac{1}{2} \left( y - \hat{Q}(s,a; \mathbf{w}) \right)^2
$$

where $y = r + \gamma \max_{a'} \hat{Q}(s', a'; \mathbf{w})$ is the **target** (treated as fixed for the gradient).

**Part (a) - 4 Points:** Compute the gradient $\nabla_{\mathbf{w}} L(\mathbf{w})$.

*Hint: Use the Q-function approximation from Q5.*

```
Answer (a):


```

**Part (b) - 5 Points:** The gradient descent update is $\mathbf{w} \leftarrow \mathbf{w} - \alpha \nabla_{\mathbf{w}} L$. Show that this gives:

$$
w_i \leftarrow w_i + \alpha \cdot \delta \cdot f_i(s,a)
$$

where $\delta$ is the TD error.

```
Answer (b):


```

**Part (c) - 4 Points:** Looking at the update rule $w_i \leftarrow w_i + \alpha \cdot \delta \cdot f_i(s,a)$: 

Suppose $\delta = -5$, $\alpha = 0.1$, and we have two features with $f_1(s,a) = 2$ and $f_2(s,a) = 0$. 

- Calculate the change in $w_1$
- Calculate the change in $w_2$
- Which weight changes more? Why?

```
Answer (c):


```


## Q7 — Bridge Crossing - 8 Points

Consider a robot crossing a narrow bridge over a cliff:

```
[START] ---- [BRIDGE] ---- [GOAL]
              |     |
           [CLIFF] [CLIFF]
```

**The Setup:**
- The bridge has 5 positions: START → B1 → B2 → B3 → GOAL
- At each bridge position (B1, B2, B3), the robot can choose: **WALK** (slow, safe) or **RUN** (fast, risky)
- **WALK:** Always moves forward 1 step. Reward = +1
- **RUN:** 80% chance to move forward 2 steps (Reward = +3), but 20% chance to slip and fall off the cliff (Reward = -100, episode ends)
- Reaching GOAL gives a final reward of +10

The robot uses **ε-greedy Q-learning** with discount γ = 0.9.

**Training constraint:** The agent only gets **50 training episodes** to learn before being tested.

---

**Part (a) - 4 Points:** What is the optimal policy to maximize expected reward? 

```
Answer (a):


```

---

**Part (b) - 4 Points:** Suppose we set ε = 0.3 (30% random exploration). During each episode, what is the liklihood that the agent *does not* fall?

```
Answer (b):


```





## Q8 — The Exploration-Exploitation Dilemma (Continued from Q7) - 20 Points

Using the Bridge Crossing setup from Q8:

**Part (a) - 5 Points:** Given only 50 training episodes, can you find values of ε (exploration rate) and α (learning rate) such that the Q-learning agent reliably learns the optimal (safe) policy?

```
Answer (a):


```

---

**Part (b) - 10 Points:** Explain why this is a difficult problem by answering these two questions:

(i) If ε is **high** (e.g., ε = 0.5), what happens to the agent during training?

```
Answer (b)(i):


```

(ii) If ε is **low** (e.g., ε = 0.01), what problem does the agent face?

```
Answer (b)(ii):


```

---

**Part (c) - 5 Points:** Based on your answers above, explain why ε-greedy exploration struggles with "dangerous" environments where random actions can lead to very bad outcomes.

```
Answer (c):


```
