# TD(0) and n-Step TD Algorithms

## üìö Learning Objectives

By completing this notebook, you will:
- Understand Temporal Difference (TD) learning
- Implement TD(0) algorithm
- Implement n-step TD algorithms
- Compare TD with Monte Carlo methods
- Apply TD algorithms to RL environments

## üîó Prerequisites

- ‚úÖ Understanding of value functions
- ‚úÖ Monte Carlo methods knowledge
- ‚úÖ Python knowledge (functions, loops, NumPy)
- ‚úÖ Understanding of bootstrapping

---

## Official Structure Reference

This notebook covers practical activities from **Course 09, Unit 2**:
- Running TD(0) and n-step TD algorithms in simple RL environments
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Temporal Difference (TD) learning** combines ideas from Monte Carlo (learning from experience) and Dynamic Programming (bootstrapping). TD methods update estimates based on other estimates, making them more sample-efficient than Monte Carlo.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

print("‚úÖ Libraries imported!")
print("\nTD(0) and n-Step TD Algorithms")
print("=" * 60)

## Part 1: Understanding TD Learning


In [None]:
print("=" * 60)
print("Part 1: Understanding TD Learning")
print("=" * 60)


## Part 2: TD(0) Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 2: TD(0) Implementation")
print("=" * 60)


## Part 3: n-Step TD Implementation


In [None]:
print("\n" + "=" * 60)
print("Part 3: n-Step TD Implementation")
print("=" * 60)


## Summary

### Key Concepts:
1. **TD Learning**: Combines Monte Carlo (experience) and DP (bootstrapping)
2. **TD(0)**: 1-step TD, updates: V(S_t) ‚Üê V(S_t) + Œ±[R + Œ≥V(S_{t+1}) - V(S_t)]
3. **n-Step TD**: Uses n-step returns, balances bias and variance
4. **Bootstrapping**: Update using other estimates (faster but biased)

### Comparison:
- **Monte Carlo**: High variance, no bias, requires episodes
- **TD(0)**: Low variance, some bias, online (incremental)
- **n-Step TD**: Trade-off between MC and TD(0)

### Advantages:
- Online learning (no need to wait for episode end)
- Lower variance than Monte Carlo
- More sample-efficient
- Works for continuing tasks

### Applications:
- Value function estimation
- Policy evaluation
- Online learning scenarios

### Next Steps:
- SARSA and Q-learning (TD control)
- Eligibility traces (TD(Œª))
- Compare with Monte Carlo and DP

**Reference:** Course 09, Unit 2: "Prediction and Control without a Model" - TD algorithms practical content