# Monte-Carlo Methods for Predictions and Control

> This is the summary of lecture "Sample-based Learning Methods" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## What is Monte Carlo?

To use a pure Dynamic Programming approach, the agent needs to know the environment's transition probabilities

- In some problems, we don't know the environment transition probabilities
- The computation can be error-prone and tedious

**Monte-Carlo**(MC for abbreviation) methods estimate values by averaging over a large number of random samples.

### MC prediction, for estimating $V \approx v_{\pi} $

$ \begin{aligned} 
    &\text{Input: } \text{a policy } \pi \text{ to be evaluated} \\ 
    &\text{Initialize: } \\
    &\quad V(s) \in \mathcal{R} \text{, arbitrarily, for all } s \in \mathbb{S} \\
    &\quad Returns(s) \leftarrow \text{ an empty list, for all } s \in \mathbb{S} \\
    \newline
    &\text{Loop forever (for each episode):} \\
    &\quad \text{Generate an episode following } \pi: S_0, A_0, R_1, S_1, A_1, R_2, \dots, S_{T-1}, A_{T-1}, R_T \\
    &\quad G \leftarrow 0 \\
    &\quad \text{Loop for each step of episode, } t = T-1, T-2, \dots, 0: \\
    &\qquad G \leftarrow \gamma G + R_{t+1} \\
    &\qquad \text{Append } G \text{ to } Returns(S_t) \\
    &\qquad V(S_t) \leftarrow \text{ average} (Returns(S_t)) \\
\end{aligned} $

### Increamental Update

$ NewEstimate \leftarrow OldEstimate + StepSize [ Target - OldEstimate] $

## Using Monte Carlo for Prediction

### Example - Blackjack

Collect cards so that their sum is as large as possible without exceeding 21.

#### Problem Formulation
- Undiscounted MDP where each game of blackjack corresponds to an episode
- Reward: -1 for a loss, 0 for a draw, and 1 for a win
- Action: Hit or Stick
- States (200 in total):
    - Whether the player has a usable ace (Yes or No)
    - The sum of the player's cards (12-21)
    - The card the dealer shows (Ace-10)
- Cards are dealt from a deck with replacement
- Policy: Stops requesting cards when the player's sum is 20 or 21

### Implications of Monte Carlo learning

- We do not need to keep a **large model** of the environment.

- We are estimating the value of an individual state **independently** of the values of other states

- The **computation** needed to update the value of each state does not depend on the size of the MDP.

## Using Monte Carlo for Action Values

### Monte Carlo methods in RL

$ v_{\pi}(s) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s] \\
q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a] $

For exploration in MC, we usually use **exploring starts** for defining initial state and action. Or we can use **Epsilon-Greedy strategy** for stochastic policy.

## Using Monte Carlo methods for Generalized Policy Iteration

### Monte Carlo Generalized Policy Iteration

![gpi](image/gpi.png)

When gather policies ($\pi_0 \rightarrow \pi_1 \rightarrow \pi_2 \dots$)

- Improvement:

$\pi_{k+1}(s) \doteq \arg \max_{a} q_{\pi_k}(s, a)$

- Evaluation:
Use Monte Carlo Prediction

### Monte Carlo methods with ES (Exploring Starts), for estimating $\pi \approx \pi_{*}$

$\begin{aligned} & \text{Initialize:} \\ 
 & \quad \pi(s) \in \mathcal{A}(s) \text{ (arbitrarily), for all } s \in \mathcal{S} \\
 & \quad Q(s, a) \in \mathbb{R} \text{ (arbitrarily), for all } s \in \mathcal{S}, a \in \mathcal{A}(s) \\
 & \quad Returns(s, a) \leftarrow \text{ empty list, for all } s \in \mathcal{S}, a \in \mathcal{A}(s) \\
 \newline
 & \text{Loop forever (for each episode):} \\
 & \quad \text{Choose } S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0) \text{ randomly such that all pairs have probability } > 0 \\
 & \quad \text{Generate an episode from } S_0, A_0, \text{following } \pi: S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T \\
 & \quad G \leftarrow 0 \\
 & \quad \text{Loop for each step of episode, } t=T-1, T-2, \dots, 0: \\
 & \qquad G \leftarrow \gamma G + R_{t+1} \\
 & \qquad \text{Append } G \text{ to } Returns(S_t, A_t) \\
 & \qquad Q(S_t, A_t) \leftarrow \text{ average}(Returns(S_t, A_t)) \\
 & \qquad \pi(S_t) \leftarrow \arg \max_a Q(S_t, a) 
 \end{aligned} $