# Chapter 9: On-policy Prediction with Approximation

## 1. Models and Planning
- **model-based** methods:
    - require a model of enviroment (DP, HS)
    - rely on **planning**
- **model-free** methods:
    - does not require a model of enviroment (MC, TD)
    - rely on **learning**
- heart of 2 methods is the computation of value functions


- **Model**: anything that an agent can use to predict how the environment will repond to its actions
    - **Distribution Models**: description of all possibilities and their probabilites, $p(s',r | s,a) ~~~\forall s,a,s',r$
    - **Sample Models**: produce just one of the possibilities (sample experiences for given $s,a$)
    - Model is used to *simulate* the environment and produce *simulated experience*
        - distribution models are stronger
        - however, easier to obtain sample models
        

- **Planning**: any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
    - **state-space** planning
    - **plan-space** planning: difficult to apply stochastic squential decision problems

![planning](assets/8.1.planning.png)

- state-space planning methods view
    - compute value functions in order to improve the policy
    - apply the backup operations to simulated experience

![state-space planning](assets/8.1.state-space-planning.png)

- *planning* methods use simulated experience generated by a model
- *learning* methods use real exeperience generated by the environment
    - *random-sample one-step tabular Q-planning*
![Q-planning](assets/8.1.q-planning.png)

## 2. Dyna: Integrated Planning, Acting, and Learning
- 2 roles of real experience:
    - **model learning**: improve the model
        - also, called **indirect RL**
    - **direct RL**: directyly improve the value funciton and policy

![Experience Relationship](assets/8.2.exp-relation.png)


- *indirect RL* (model-based)
    - fuller use of a limited amount of experience, thus achieve a better policy with fewer environmental interactions
- *direct RL* (model-free)
    - much simpler, not affected by biases of the designed model


- planning, acting, model-learning, and direct RL occur simultaneously and in paralel in Dyna agents
    - Dyna Architecture

![Dyna Architecture](assets/8.2.dyna-architecture.png)

- **Dyna-Q** algorithm
    - direct RL: step (d)
    - model-learning, and planning: steps (e), and (f)

![Dyna-Q](assets/8.2.dyna-q.png)

## 3. When the Model is Wrong
- model may be incorrect
    - environment is stochastic and only limited number of samples
    - model learning function has generalized imperfectly
    - environment has changed and its new behavior has not yet been observed
- when model is incorrect, suboptimal policy is computed
    - in some cases, this lead to discovery & correction of the modeling error
- the general problem is the conflict between exploration and exploitation
    - probably is no solution is perfect
    - in practical, simple heuristics are often effective
- **Dyna-Q+** method:
    - state-action pair is not visted in $\tau$ time steps
    - add bones reward: $r + \kappa\sqrt{\tau}$, for some small $\kappa$

## 4. Prioritized Sweeping
- started state-action pairs selection by uniform is usually not the best, should focus on particular state-action pairs
- work back from any state whose value has changed
    - using queue to maintain every state-action pair whose estimated value would change nontrivially if updated
    - prioriteize by the size of the change

- waste lots of computation on low-probability transitions

![Prioritized Sweeping](assets/8.4.prioritized_sweeping.png)

## 5. Exepected vs Sample Updates
- for one-step updates, vary primarily along 2 binary dimensions
    - state values or action values
    - optimal policy or arbitrary given policy
    - expected updates or sample updates

![Backup Diagrams for one-step](assets/8.5.backup-diagrams.png)


- expected updates are better but require more computation
    - let $b$ is *branching factor*, expected update requires roughly $b$ times as much computation as a sample update
- in a large problem, sample updates are preferable

![Expected vs Sample updates](assets/8.5.expected_vs_sample.png)

## 6. Trajectory Sampling
- **Trajectory Sampling**: simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way
- Seem both efficient and elegant
- Sampling according to the **on-policy** distribution
    - faster planning initially and retarded planning in the long run
    - in the long run, may hurt, sampling other states may useful
    - for large problems, can be great advantage

## 7. Real-time Dynamic Programming - *RTDP*
- An on-policy trajectory-sampling version of the value-interation algorithm of DP
    - an example of an asynchronous DP algorithm
- Allow completely skip states that cannot be reached by the given policy from any of the start states (*irrelevant*)
    - can find a optimal policy on the relevant states without visting every states as *Sarsa*
    - Greate advantage for very large state sets
- Select a greedy action
    - value function approaches the optimal value function $v_*$
    - policy used by the agent to generate trajectories approaches an optimal policy
- Strongly focused on subsets of the states that were relevant to the problem's objective
- Reduce 50% of computation required by sweep-based value iteration

## 8. Planning at Decision Time
- 2 ways planning:
    - **background planning**:
        - use to gradually improve a policy or value function on the basis of simulated experience obtained from a model (such as DP an Dyna)
        - not focus on the current state
    - **decision-time planning**:
        - use to begin and complete it after encountering each new state $S_t$
        - focus on a particular state
- in general, can mix both
- most useful in applications in which fast responses are not required

## 9. Heuristic Search
- A decision-time planning method
- For each state encountered, a large tree of possible continuations is considered
- Approximate value funciton is applied to the leaf nodes, then backed up toward the current state at root
    - Backing up is just the same as in the expected updates with maxes ($v_*, q_*$)
    - Backing up stops at the state-action nodes for the current state
- Once the backed-up values of these nodes are computed
    - The best of them is chosen as the current action
    - All backed-up values are discarded
- Can be viewed as an extension of the idea of a greedy policy, beyond a single step
    - Seaching deeper than one step is to obtain better action selections
    - The deeper the search, the more computation is required, slower response time
- Can be so effective, because of smart focusing on the states and actions that might immediately follow the current state


- Method of heuristic search
    - Contruct a search tree
    - Perform the individual one-step updates from bottom up

![Heuristic Search](assets/8.9.heuristic-search.png)

## 10. Rollout Algorithms
- An decision-time planning algorithm based on MC control
    - Simulate trajectories that all begin at the current environment state
    - Estimate action values $q_\pi$ by averaging the returns of many simulated trajectories
    - Action with highest estimated value is executed
    
- The goal
    - Not estimate a complete optimal action-value $q_*$ or a complete action-value $q_\pi$ for a given policy $\pi$
    - Produce MC estimates of action values only for each current state and for a given policy - **rollout policy**
    - Improve upon the rollout policy; not to find an optimal policy
- The better the rollout policy and the more accurate the value estimates, the better the policy produced
- It is important to tradeoff
    - better rollout polices require more time is needed to simulate enough trajectories
    - Run many trials in parallel on separate processors
    - Truncate the simulated trajectories, correcting the truncated returns by means of a stored evaluation function
- Not a learning algorithms
    - do not maintain long-term memories of values or policies

## 11. Monte Carlo Tree Search
- A successful example of decision-time planning
- Is a rollout algorithm enhanced by the addition of a means for accumulating value estimates from MC simulations
- Use in game and single-agent with simple model for fast multistep simulation
- Execute after encountering each new state to select an action for that state
    - Each execution is an iterative process that simulates many trajectories starting from the current state
- Core idea is focus muliple simulations starting at the current state
    - benfefits from online, incremental, sample-based value estimation and policy improvement
    - can avoid the problem of globally approximating an action-value function while it retains the benefit of using past experience to guide exploration

## 12. Summary
- 3 key ideas in common:
    - Estimate value functions
    - Operate by backing up values along actual or possible state trajectories
    - Follow the general trategy of GPI (*generalized policy iteration*)

![Space of RL](assets/8.11.space-of-rl.png)


- 3th dimension: *on-policy* or *off-policy*
- most important dimension: **function approximation** - in the part 2 of the book