# TD(0) and n-Step TD Algorithms

## üìö Learning Objectives

By completing this notebook, you will:
- Understand Temporal Difference (TD) learning
- Implement TD(0) algorithm
- Implement n-step TD algorithms
- Compare TD with Monte Carlo methods
- Apply TD algorithms to RL environments

## üîó Prerequisites

- ‚úÖ Understanding of value functions
- ‚úÖ Monte Carlo methods knowledge
- ‚úÖ Python knowledge (functions, loops, NumPy)
- ‚úÖ Understanding of bootstrapping

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 2**:
- Running TD(0) and n-step TD algorithms in simple RL environments
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Temporal Difference (TD) learning** combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping). TD methods update estimates based on other estimates, making them more sample-efficient than Monte Carlo.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

print("‚úÖ Libraries imported!")
print("\nTD(0) and n-Step TD Algorithms")
print("=" * 60)

## Part 1: Understanding TD Learning


In [None]:
print("=" * 60)
print("Part 1: Understanding TD Learning")
print("=" * 60)

print("\nTD Learning Key Concepts:")
print(" 1. Bootstrap: Update estimate using other estimates")
print(" 2. Incremental: Update after each step (no need to wait for episode end)")
print(" 3. TD Error: Œ¥"
t = R_{t+1} + Œ≥V(S_{t+1}) - V(S_t)")
print(" 4. Update: V(S_t) = V(S_t) + Œ±[Œ¥_t]")

print("\nTD(0) Algorithm:")
print(" V(S_t) ‚Üê V(S_t) + Œ±[R_{t+1} + Œ≥V(S_{t+1}) - V(S_t)]")
print(" - Uses 1-step return: R_{t+1} + Œ≥V(S_{t+1})")
print(" - Updates after each step")

print("\nn-Step TD Algorithm:")
print(" - Uses n-step return: R_{t+1} + Œ≥R_{t+2} + ... + Œ≥^n V(S_{t+n})")
print(" - Balances bias and variance")
print(" - n=1: TD(0), n=‚àû: Monte Carlo")

print("\n‚úÖ TD learning concepts understood!")

## Part 2: TD(0) Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 2: TD(0) Implementation")
print("=" * 60)

def td0(policy, env_simulator, n
episodes =100, alpha=0.1, gamma=0.99):
 
    
    """
 TD(0) for estimating state values.
 V(S_t) ‚Üê V(S_t) + Œ±[R_{t+1} + Œ≥V(S_{t+1}) - V(S_t)]
 """
 V = defaultdict(float)
 
 for episode in range(n_episodes):
 state = env
simulator.reset()
 done = False
 
 while not done:
 # Choose action (simplified)
 action = policy(state) if callable(policy) else policy.get(state, 0)
 
 # Take step (simplified - assuming env interface)
 next_state, reward, done = env
simulator.step(action)
 
 # TD(0) update
 td
target = reward + gamma * V[next_state]
 td
error = td
target - V[state]
 V[state] = V[state] + alpha * td_error
 
 state = next
state
 
 return V

# Simple example
class SimpleEnv:
 
def __init__(self):
 self.state = 1
 self.n
states = 5
 
 def reset(self):
 self.state = 1
 return self.state
 
 def step(self, action):
 # Simple transition: move towards goal (state 4)
 if self.state < 4:
 self.state += 1
 reward = 1.0 if self.state == 4 else 0.0
 done = self.state == 4
 return self.state, reward, done

env = SimpleEnv()
def simple_policy(state):
 return 1 # Always move forward

V
td0 = td0(simple_policy, env, n
episodes =100, alpha=0.1, gamma=1.0)

print("\nTD(0) Estimated State Values:")
for state in sorted(V_td0.keys()):
 print(f" V({state}) = {V_td0[state]:.4f}")

print("\n‚úÖ TD(0) implemented!")

## Part 3: n-Step TD Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 3: n-Step TD Implementation")
print("=" * 60)

def n_step_td(policy, env_simulator, n
episodes =100, n=2, alpha=0.1, gamma=0.99):
 
    
    """
 n-step TD for estimating state values.
 Uses n-step return: R_{t+1} + Œ≥R_{t+2} + ... + Œ≥^n V(S_{t+n})
 """
 V = defaultdict(float)
 
 for episode in range(n_episodes):
 states = [env_simulator.reset()]
 rewards = []
 t = 0
 T = float('inf')
 
 while True:
 if t < T:
 # Take action
 action = policy(states[-1]) if callable(policy) else policy.get(states[-1], 0)
 next_state, reward, done = env
simulator.step(action)
 
 states.append(next_state)
 rewards.append(reward)
 
 if done:
 T = t + 1
 
 # Update time
 tau = t - n + 1
 
 if tau >= 0:
 # Calculate n-step return
 G = sum(gamma ** i * rewards[tau + i] for i in range(min(n, T - tau)))
 if tau + n < T:
 G += gamma ** n * V[states[tau + n]]
 
 # Update value
 V[states[tau]] = V[states[tau]] + alpha * (G - V[states[tau]])
 
 t += 1
 if tau == T - 1:
 break
 
 # Keep only last n states/rewards
 if len(states) > n + 1:
 states.pop(0)
 rewards.pop(0)
 
 return V

# Compare n-step TD for different n values
env2 = SimpleEnv()
V
n1 = n
step
td(simple_policy, env2, n
episodes =100, n=1, alpha=0.1, gamma=1.0)
env3 = SimpleEnv()
V
n2 = n
step
td(simple_policy, env3, n
episodes =100, n=2, alpha=0.1, gamma=1.0)
env4 = SimpleEnv()
V
n4 = n
step
td(simple_policy, env4, n
episodes =100, n=4, alpha=0.1, gamma=1.0)

print("\nn-Step TD Estimated State Values:")
print("n=1 (TD(0)):")
for state in sorted(V_n1.keys()):
 print(f" V({state}) = {V_n1[state]:.4f}")

print("\nn=2:")
for state in sorted(V_n2.keys()):
 print(f" V({state}) = {V_n2[state]:.4f}")

print("\nn=4:")
for state in sorted(V_n4.keys()):
 print(f" V({state}) = {V_n4[state]:.4f}")

print("\n‚úÖ n-step TD implemented!")

## Summary

### Key Concepts:
1. **TD Learning**: Combines Monte Carlo (experience) and DP (bootstrapping)
2. **TD(0)**: 1-step TD, updates: V(S_t) ‚Üê V(S_t) + Œ±[R + Œ≥V(S_{t+1}) - V(S_t)]
3. **n-Step TD**: Uses n-step returns, balances bias and variance
4. **Bootstrapping**: Update using other estimates (faster but biased)

### Comparison:
- **Monte Carlo**: High variance, no bias, requires episodes
- **TD(0)**: Low variance, some bias, online (incremental)
- **n-Step TD**: Trade-off between MC and TD(0)

### Advantages:
- Online learning (no need to wait for episode end)
- Lower variance than Monte Carlo
- More sample-efficient
- Works for continuing tasks

### Applications:
- Value function estimation
- Policy evaluation
- Online learning scenarios

### Next Steps:
- SARSA and Q-learning (TD control)
- Eligibility traces (TD(Œª))
- Compare with Monte Carlo and DP

**Reference:** Course 09, Unit 2: "Prediction and Control without a Model" - TD algorithms practical content