# Model-Based Reinforcement Learning

Model-based reinforcement learning (MBRL) aims to improve sample efficiency by learning a **model of the environment’s dynamics** and using it for **planning** or **simulated experience**.  
Instead of learning purely from environment interactions, the agent leverages its learned model to reason about future outcomes.


---

## 1. Simple Model-Based RL (Sample-Based Planning)

Train a neural model to predict:
- The **next state**: $\hat{s}' = f_\theta(s, a)$
- The **reward**: $\hat{r} = g_\phi(s, a)$

Once the model approximates the environment’s dynamics well, the agent can:
- **Sample transitions** from the model: $(s, a, \hat{r}, \hat{s}')$
- **Train a model-free RL algorithm** (e.g. DQN, A2C) using these simulated transitions.

### Example Workflow
1. Collect real transitions $(s, a, r, s')$ from the environment.  
2. Train neural networks $f_\theta$ and $g_\phi$ to predict $s'$ and $r$.  
3. Use the model to simulate new transitions.  
4. Train a policy/value function on this generated data.

---

## 2. Dyna-Q (Combined Learning and Planning)

**Dyna-Q** (Sutton, 1990) is a **hybrid** approach that combines:
- **Model-free RL** (learning from real experience)
- **Model-based planning** (learning from simulated experience)

After each real environment step:
1. Update the **Q-function** using the real transition $(s, a, r, s')$.
2. Update the **learned model** $\hat{P}, \hat{R}$ using the same data.
3. **Sample “imaginary” experiences** from the model and perform additional Q-learning updates.

This allows the agent to “replay” and reinforce knowledge without re-interacting with the environment.

### Algorithm


**for** each real step (s, a, r, s'):
<br>
<br> update $Q(s,a)$
<br> update $model(\hat{R}$, $\hat{P}$)

   **for** $k$ simulated steps:
   1. sample ($s_{sim}, a_{sim}$) from memory: 
   2. $s'_{sim}$, $r_{sim}$ = $model(s_{sim}$, $a_{sim})$
   3. update $Q(s_{sim}, a_{sim})$

---

## 3. Monte Carlo Tree Search (MCTS) and Monte Carlo Planning

Rather than sampling random one-step transitions, **Monte Carlo Tree Search (MCTS)** performs **lookahead planning** by simulating multiple future trajectories from a given state using the learned model.

The agent builds a **search tree**:
- Nodes represent states.
- Edges represent actions.
- Values are estimated via simulated rollouts.

This allows more strategic planning and decision-making at each step.

- MCTS can act as the **planning component** in Dyna-Q.
- Instead of sampling one-step transitions, the agent performs **multi-step rollouts** using the model to estimate expected returns for each action.
- The results are used to update Q-values or guide policy selection.

### Advantages
- Enables deeper lookahead than simple Dyna sampling.
- Reduces variance in value estimates.
- Strong theoretical and practical foundation (e.g. AlphaZero).

### Challenges
- Computationally expensive.
- Requires a well-calibrated model for reliable rollouts.

---

## Summary Table

| Approach | Uses Learned Model? | Uses Real Env? | Planning Style | Pros | Cons |
|-----------|--------------------|----------------|----------------|------|------|
| **Simple Model-Based RL** | ✅ | ❌ | Sampling synthetic transitions | High efficiency | Model bias |
| **Dyna-Q** | ✅ | ✅ | Sampled replay (mix of real + model) | Balanced learning | Complex trade-off |
| **MCTS / MC Search** | ✅ | Optional | Tree-based rollouts | Strategic planning | High compute cost |

---

### References
- Sutton & Barto (1998): *Reinforcement Learning: An Introduction* — Ch. 8 (Planning and Learning)
- Sutton (1990): *Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming*
- Silver et al. (2017): *Mastering the Game of Go without Human Knowledge (AlphaZero)*
