# Chapter 5: Monte Carlo Methods

## 1. Introduction
- Do not assume complete knownledge of the environment
- Learning from experience of interaction with environment
    - Experience is divided into **episodes**
    - Based on averaging sample returns - complete returns of each episode
- Like an associative bandit
    - Nonstationary from the point of view of the earlier state
- Adapt the idea of GPI from DP

## 2. Monte Carlo Prediction
- Learning the state-value function $v_\pi(s)$ for a given policy $\pi$
- Estimate from experience by averaging the returns observed after visits to that state $s$
- a **visit** to $s$: each occurrence of state $s$ in an episode
- 2 methods:
    - **first-visit** MC: average of the returns only for the first visits to $s$
    - **every-visit** MC: average of the returns for all visits to $s$

- Only on choice considered at each state (unlike DP) - only sampled on he one episode
- Estimates for each state are independent
    - computational expense of estimating a single state is independent of the number of states
- Do not bootstrap (unlike DP)

- first-visit MC algorithm

![First-Visit MC](assets/5.1.first-visit-mc.png)

## 3. Monte Carlo Estimation of Action Values
- if a model is not available, MC is useful to estimate action values $q_*$
- Averaging returns starting from state $s$, taking action $a$ following  policy $\pi$
- a **visit** to pair $s,a$: each occurrence of state $s$ and action $a$ is taken in it,  in an episode
- 2 methods:
    - first-visit MC: average of the returns only for the first visits to $s, a$
    - every-visit MC: average of the returns for all visits to $s, a$
- Need to estimate the value of all the actions from each state
- **Exploring starts**: every pairs $s,a$ has nonzero probability of being selected as the start

## 4. Monte Carlo Control
- Use GPI
$$\pi_0 \stackrel{E}{\longrightarrow} q_{\pi_0} \stackrel{I}{\longrightarrow} \pi_1 \stackrel{E}{\longrightarrow} q_{\pi_1} \stackrel{I}{\longrightarrow} \pi_2 \stackrel{E}{\longrightarrow} ... \stackrel{I}{\longrightarrow} \pi_* \stackrel{E}{\longrightarrow} q_{\pi_*}$$

- Policy evaluation $\stackrel{E}{\longrightarrow}$: using MC methods for prediction
- Policy improment $\stackrel{I}{\longrightarrow}$: policy greedy with respect to the current value function
    - meets the conditions for policy improvement by policy improvement theorem
    - if, $\pi_{k+1}=\pi_k$, then $\pi_k=\pi_*$
$$
\begin{aligned}
q_{\pi_k}\big(s,\pi_{k+1}(s)\big) &= q_{\pi_k}\big(s,\arg\max_a q_{\pi_k}(s,a)\big)
\\ &= \max_a q_{\pi_k}(s,a)
\\ &\ge q_{\pi_k}\big(s,\pi_k(s)\big)
\\ &\ge v_{\pi_k}(s)
\end{aligned}
$$

- Converage conditions assumptions:
    - (1) episodes have exploring starts
    - (2) policy evaluation could be done with an infinite number of episodes
 
- Need to remove both assumptions in order to obtain a practical algorithm
- Solve the assumption (2):
    - Obtain bounds on the magnitude and probability of error in the estimates, assure that these bounds are sufficiently small
    - Value Iteration
    - Alterate between improvement and evaluation steps for single states
    
- Monte Carlo Exploring Starts (*Monte Carlo ES*)
    - alterate between evaluation and improvement on an episode-by-episode basis
    - convergence to this fixed point (fixed point is optimal policy $\pi_*$) seems inevitable

![Monte Carlo Exploring Starts](assets/5.3.mc-es.png)

- Monte Carlo without Exploring Starts
    - **on-policy** methods: evaluate or improve the policy that is used to make decisions
    - **off-policy** methods: evaluate or improve the policy different from that is used to generate the data

## 5. On-Policy method
- Learn about policy currently executing
- Policy is generally soft $\pi(a | s) > 0$
- ε-soft policy like ε-greedy:
    - probability of nongreedy is $\dfrac{\epsilon}{| \mathcal A(s) |}$
    - and, probability of greedy is $1-\epsilon+\dfrac{\epsilon}{| \mathcal A(s) |}$

- ε-greedy with respect to $q_\pi$ is an improvement over any ε-soft policy $\pi$ is assured by the policy improvement theorem
$$
\begin{aligned}
q_\pi\big(s,\pi'(s)\big) &= \sum_a \pi'(a | s) q_\pi(s,a)
\\ &= \frac{\epsilon}{| \mathcal A(s) |}\sum_a q_\pi(s,a) + (1-\epsilon)\max_a q_\pi(s,a)
\\ &\ge \frac{\epsilon}{| \mathcal A(s) |}\sum_a q_\pi(s,a) + (1-\epsilon)\sum_a \frac{\pi(a | s)-\frac{\epsilon}{| \mathcal A(s) |}}{1-\epsilon} q_\pi(s,a)
\\ &= \frac{\epsilon}{| \mathcal A(s) |}\sum_a q_\pi(s,a) - \frac{\epsilon}{| \mathcal A(s) |}\sum_a q_\pi(s,a) + \sum_a \pi(a | s) q_\pi(s,a)
\\ &\ge v_\pi(s)
\end{aligned}
$$

- Converages to best ε-soft policy $v_\pi=\tilde v_* ~~~, \forall s\in\mathcal S$

![On-Policy e-soft](assets/5.4.e-soft.png)

## 6. Off-policy method
