## Temporal Difference (TD) Algorithms: Detailed Notes

### Roadmap: Temporal Difference (TD) Algorithms

Here’s an overview of each algorithm to understand their role in reinforcement learning:

1. **TD(0)**: A one-step Temporal Difference method that updates the value function using the immediate reward and next state’s value estimate.
2. **On-policy SARSA**: Learns action values $Q(s, a)$ for a policy by following it; updates are made using the actual actions taken by the policy.
3. **Q-learning**: An off-policy method that learns the optimal action-value function $Q^*(s, a)$, updating based on the maximum future $Q$-value, regardless of the policy followed.
4. **Expected SARSA**: A variation of SARSA that updates using the expected $Q$-value over all actions, weighted by the policy’s action probabilities.
5. **Replay Buffer and Off-policy Learning**: Enables efficient off-policy learning by storing and reusing past experiences to stabilize and enhance training.
6. **Q-learning for Continuous State Spaces**: Extends Q-learning to work with continuous states using function approximators like neural networks.
7. **n-step Returns**: Generalizes TD(0) by considering multi-step returns, balancing bias and variance in updates.
8. **Eligibility Traces**: Maintains a record of visited states to assign credit for rewards to past states/actions, enabling faster learning.
9. **TD($\lambda$)**: Combines n-step returns and eligibility traces into a single framework for more comprehensive credit assignment.

---

### **1. TD(0)**

- **Key Idea**: Update the value of a state after one step by bootstrapping (using the estimated value of the next state).
- **Update Rule**:
  $$
  V(s) \leftarrow V(s) + \alpha \left[ R_{t+1} + \gamma V(s') - V(s) \right]
  $$
- **Example**:
  If you move from state $s$ to $s'$ with a reward $R_{t+1} = 5$ and a discount factor $\gamma = 0.9$, TD(0) adjusts $V(s)$ toward $5 + 0.9 \cdot V(s')$.

---

### **2. On-policy SARSA**

- **Key Idea**: Learn $Q(s, a)$ by following the current policy and updating $Q$-values based on the actions actually taken.
- **Update Rule**:
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R_{t+1} + \gamma Q(s', a') - Q(s, a) \right]
  $$
- **Example**:
  If $a'$ is the next action chosen by the policy at state $s'$, $Q(s, a)$ adjusts using $Q(s', a')$ (not the maximum $Q$-value).

---

### **3. Q-learning**

- **Key Idea**: Learn $Q^*(s, a)$ by updating $Q$-values based on the maximum expected reward for the next state.
- **Update Rule**:
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R_{t+1} + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
  $$
- **Off-policy Nature**: The update depends on the optimal action (not necessarily the one taken), enabling learning independent of the behavior policy.
- **Example**:
  In a grid-world game, Q-learning can learn the shortest path even while exploring suboptimal actions.

---

### **4. Expected SARSA**

- **Key Idea**: Similar to SARSA but updates $Q(s, a)$ using the expected $Q(s', a')$ weighted by the action probabilities of the policy.
- **Update Rule**:
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \left[ R_{t+1} + \gamma \sum_{a'} \pi(a'|s') Q(s', a') - Q(s, a) \right]
  $$
- **Advantage**: Reduces variance compared to SARSA by averaging over all actions instead of depending on one sampled action.
- **Example**:
  In stochastic environments, Expected SARSA provides smoother updates by considering multiple future outcomes.

---

### **5. Replay Buffer and Off-policy Learning**

- **Key Idea**: Store past transitions $(s, a, R, s')$ in a replay buffer to reuse them for updates.
- **How it works**:
  1. Store transitions during episode execution.
  2. Sample mini-batches of transitions randomly for training.
  3. Update $Q(s, a)$ using these sampled experiences (off-policy updates).
- **Advantages**:
  - Breaks correlations in consecutive samples.
  - Allows updates from diverse experiences.
- **Common Application**: Deep Q-Learning.

---

### **6. Q-learning for Continuous State Spaces**

- **Key Idea**: Use function approximators (e.g., neural networks) to generalize $Q(s, a)$ across continuous state spaces.
- **Challenges**:
  - Large state-action spaces.
  - Stability of learning.
- **Solution**:
  - Use experience replay and target networks (in Deep Q-Learning).
  - Approximate $Q(s, a; \theta)$ as a parametric function.
- **Example**: Training an autonomous car with continuous steering angles as actions.

---

### **7. n-step Returns**

- **Key Idea**: Combine benefits of Monte Carlo and TD(0) by updating states using cumulative rewards over $n$ steps.
- **Update Rule**:
  $$
  G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n V(s_{t+n})
  $$
  Update:
  $$
  V(s_t) \leftarrow V(s_t) + \alpha \left[ G_t^{(n)} - V(s_t) \right]
  $$
- **Advantages**:
  - Longer returns incorporate more future information.
  - $n$-step return balances bias and variance.
- **Example**:
  Using $n = 3$, you update $V(s)$ based on rewards from the next three steps.

---

### **8. Eligibility Traces**

- **Key Idea**: Maintain a "trace" for visited states and update their values based on this trace.
- **How it works**:
  - Each visited state accumulates an "eligibility" measure that decays over time.
  - Updates are applied proportionally to eligibility traces.
- **Update Rule** (for SARSA):
  $$
  e(s, a) = \gamma \lambda e(s, a) + 1 \text{ (if } s, a \text{ visited)}
  $$
  $$
  Q(s, a) \leftarrow Q(s, a) + \alpha \delta_t e(s, a)
  $$
  where $\delta_t$ is the TD error.
- **Advantages**: Faster learning by crediting all recent states for the rewards.
- **Example**: Accelerating credit assignment in long episodes.

---

### **9. TD($\lambda$)**

- **Key Idea**: Generalize n-step returns with a weighted average of all possible $n$-step returns using a parameter $\lambda$.
- **How it works**:
  - Assign greater weight to recent rewards and decay weight for older ones.
  - Update value using the weighted return:
    $$
    V(s) \leftarrow V(s) + \alpha \sum_{n=1}^\infty \lambda^{n-1} \left[ G_t^{(n)} - V(s) \right]
    $$
- **Advantages**:
  - Unifies Monte Carlo and TD learning.
  - Fine-tunes bias-variance trade-off using $\lambda$.
- **Example**:
  Setting $\lambda = 0$ reduces TD($\lambda$) to TD(0), while $\lambda = 1$ makes it equivalent to Monte Carlo.

---

### Summary Table

| **Algorithm**          | **Goal**                             | **Key Idea**                       | **Pros**                               | **Cons**                               |
|-------------------------|---------------------------------------|-------------------------------------|----------------------------------------|----------------------------------------|
| TD(0)                  | Policy evaluation                    | One-step bootstrapping             | Simple, computationally efficient      | May update slowly                      |
| SARSA                  | On-policy control                    | Update based on policy’s actions   | Stable, tracks policy’s performance    | Sensitive to policy quality            |
| Q-learning             | Off-policy control                   | Update using max future reward     | Finds optimal policy                   | May diverge without sufficient exploration |
| Expected SARSA         | On-policy control                    | Update using expected rewards      | Reduces variance                       | More computation for expectations      |
| Replay Buffer          | Improve stability for Q-learning     | Store and reuse transitions        | Breaks sample correlation              | Requires memory and additional logic   |
| Continuous Q-learning  | Control in continuous spaces         | Function approximation for $Q$     | Scalable, flexible                     | Needs careful design of approximators  |
| n-step Returns         | Extend TD to multiple steps          | Use multi-step rewards             | Balances bias and variance             | Selecting $n$ is tricky              |
| Eligibility Traces     | Assign credit to recent states       | Decaying eligibility over time     | Speeds up learning                     | Adds complexity                        |
| TD($\lambda$)         | Combine Monte Carlo and TD learning  | Weighted average of returns        | General-purpose, tunable               | Sensitive to $\lambda$ setting         |