### Model-Based RL

### University of Virginia
### Reinforcement Learning
#### Last updated: December 6, 2023

---


### SOURCES 

- Reinforcement Learning, RS Sutton & AG Barto, 2nd edition. Chapter 8

### LEARNING OUTCOMES

- Explain how methods like Dyna can be used
- Understand how to apply rollout algorithms to improve a policy
- Understand how Monte Carlo Tree Search works

### CONCEPTS

- Dyna and Dyna-style methods
- Using an environment to simulate future states, rewards
- Distribution model vs sample model
- Rollout algorithms
- Monte Carlo Tree Search


---  

![tahiti](images/tahiti.jpg)

Models are used all the time, not just by computers but also by humans

When we see this beautiful photo of Tahiti, we think of what a vacation might be like.

We might think of snorkeling or sunbathing. Models - simplified versions of a situation - help us plan.

### I. Working with Models for Planning

Model: anything an agent can use to predict how environment will respond to actions  
`f: (state, action) -> (next_state, reward)`

Model-based approaches rely on **planning**: the model generates simulated experience   
Model-free approaches rely on **learning**: real experience is generated by the environment

*Distribution models* produces all possibilities with their probabilities. This is the model assumed in MDPs.  

*Sample models* produce one of the possibilities. Example: simulate drawn card from a deck.

Distribution models are more general. Given this model, can always simulate a result or trajectory.

**Using a Model**  

The main benefit to a model:

- Agent can plan ahead,
- See what would happen for different choices,
- Decide between its options,
- The results can be used in a learned policy

**Real experience**  

Can be used to 1) update model 2) improve value function, policy

---

### II. Dyna

Integrating planning, acting, and learning

[fill this in]

---

### III. Rollout Algorithms

Made at *decision time*, when we encounter the new state and need to select an action

Based on Monte Carlo control, they work like this:

- Decide on a number of trajectories to simulate, and number of time points per trajectory
- Simulate multiple trajectories from the given state, taking each possible action and then following the policy
- For each of the starting actions, compute the return over the trajectories
- Average the returns given each action to estimate the action values

The algorithm uses an important heuristic: the computation should be devoted to imminent events.  
Oftentimes, exhaustive search is expensive and unnecessary, as many (state,action) pairs may not be valuable.

This approach doesn't find optimal policy; it's simpler and can improve current policy.  
Can be surprisingly effective.  
However, can be computationally expensive as it's done in real time and may require many trajectories.

Can be parallelized on multiple machines.

![rollout](images/rollout.png)

---

### IV. Monte Carlo Tree Search (MCTS)

Successful example of rollout algorithm. Helped computer Go improve from amateur to grandmaster.

Similar to mentioned rollout algo, but accumulates value estimates from Monte Carlo simulation.

Can be effective if there is fast environment model simulator.

Uses simple policy called *rollout policy*

Core idea:  
**successively focus multiple simulations, starting from current state and extending initial portions of promising trajectories**

MC estimates maintained for subset of state-action pairs most likely reached in a few steps. This forms a tree rooted at current state.

Based on value estimates for some of the actions, make selections. This is *tree policy*.

Steps:
- **Selection** - Starting at root, follow tree policy to select a leaf node
- **Expansion** - On some iterations, expand tree from selected leaf node by adding child node via unexplored actions
- **Simulation** - From selected node or new child node, simulate complete episode, selecting actions from rollout policy
- **Backup** - Return from simulated episode is backed up to update action values attached to edges  in tree policy.  
              Only values in the tree are saved.


![mcts](images/mcts.png)

---