# Chapter 5: Sampling Methods
Generally any method based on averaging complete returns is referred to as Carlo Monte results. Using complete limits the method's scope to solely episodic tasks. 
The main idea here is to let our agent learn from experience.
## Monte Carlo Prediction 
### State Value function Prediction
Let's start with predicting the value function. There is 1 algorithm with 2 variants: 
1. First visit prediction
2. Every visit prediction
Let's start with the first:

* Input $\pi$ to be evaluated
1. random initialization of V(s), initialize the returns with empty list for each state $s$
2. Loop:
    * Generate Episode : $S_0, A_0, R_1, ... S_T, A_T, R_T$
    * G = 0
    * for all steps in the episode: $ t = T - 1, T - 2 ... 0 $: 
        * $G \leftarrow \gamma \cdot G + R_{t+1}$
        * if $t$ is the first occurence of $S_t$: append $G$ to returns($S_t$)
    * V(S_t) = avg(returns($S_t$))

The 2nd variant simply appends all returns regardless of their order of appearance.

* There are several advantages over $DP$ introduced by the Monte Carlo approach. Unlike DP we do not need the complete dynamic of the environment.
* The value function of a given state is computed independently of other states. This is particularly useful if only a subset of states is of interest.
* The computation needed to update the value of each state does not depend on the size of the DMP.

### Action Value function Prediction
In the absence of a model, we cannot use the approach above as there is no states. In this case, we focus on actions or state-action pairs. The algorithm is basically the same. Nevertheless, for a given policy, it is quite possible to have some of state-action pairs to never be visited which breaks the method.  

This serious issue is solved by randomly choosing the initial state-action pair which guarantees that any pair will be visited an infite number of times for an infitied number of sequences. 

## Monto Carlo ES Policy Control
1. Initialize :
    * $\pi(s)$ a random policy
    * random state-action value functions
2. Loop:
    1. Choose $S_0, A_0$ such that each pair has a probability of being chosen $p > 0$
    2. Generate an episode: $S_0, A_0, R_1, S_1, A_1... S_{T - 1}, A_{T - 1}, R_T$
    3. loop through each step in the episode: $i = T, T - 1, T - 2 ..., 1, 0$: 
        * $G \leftarrow \gamma \cdot G + R_t$
        * if the pair $S_t, A_t$ appears for the first time
            * estimate $Q(S_t, A_t)$ 
            * Assign the policy $\pi(S_t) = argmax_a Q(S_t, A_t) $

## The problem with Exploring Starts
Even though the idea of having non-zero probabilities for each $action, state$ pair in the environment can be seen as somewhat a solution, it cannot be seen as a general and absolute solution mainly when the agent is interacting with an actual environment. We will consider 2 approachs to tackle such problem
### On policy learning
In the first few sections, we considered possible techniques to embed exploration in the policy such as $\epsilon$ greedy methods. The same technique can be applied to the Monte Carlo approach. The updated pseudo-code will be as follows:

1. Initialize :
    * $\pi(s)$ a random $\epsilon$ soft policy 
    * random state-action value functions

2. Loop:
    1. Choose $S_0, A_0$ randomly
    2. Generate an episode: $S_0, A_0, R_1, S_1, A_1... S_{T - 1}, A_{T - 1}, R_T$
    3. loop through each step in the episode: $i = T, T - 1, T - 2 ..., 1, 0$:
        * $G \leftarrow \gamma \cdot G + R_t$
        * if the pair $S_t, A_t$ appears for the first time
            * estimate $Q(S_t, A_t)$ 
            * update the policy as follows: 
            $ \pi(a | S_t) \leftarrow \begin{equation}
                \begin{cases}
                1 - \epsilon + \frac{\epsilon}{|A(s)|}  && \text{if } a = A^{*}\\
                \frac{\epsilon}{|A(s)|} && \text{if } a \neq A^{*}
                \end{cases}
            \end{equation}
            $

This approach does not reach the task optimal policy (as such policy is deterministic). Nevertheless, with smaller $\epsilon$ and more training episodes, the difference in `between the 2 policies cannot be significant which is a small price to pay for overcoming the impracticality of exploring starts. 

## Off policy Control
The off policy is still confusing to me at the moment.

# Temporal Difference LEARNING
This is one of the central ideas in RL. First let's consider an improvement over the MC update version as follows:
$$V(S_t) = V(S_t) + \alpha \cdot (G_t - V(S_t))$$
as we would like $V(S_t)$ to convert to the expected value of $G_t$. 

TD(0) is a powerful algorithm described as follows:
* Input: the policy $\pi$ to be evaluated, step size $\alpha \in (0, 1]$
1. Initialize $V(s)$ randomly for each state $s$ 
2. Loop:
    1. initialize $S$
    2. $A \leftarrow$ action given by $\pi$
    3. Take action $A$, observe $R$ and next state $S$
    4. $V(s) \leftarrow V(s_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(s_t)) $
    5. $S \leftarrow S^{'}$


The intuition behind $ V(s_t) = V(s_t) + \alpha (R_{t+1} + \gamma V(S_{t+1}) - V(s_t))$ is that $R_{t+1} + \gamma V(S_{t+1})$ is somewhat of an estimation of $G_t$.  

Another important identity:  
$$G_t - V(S_t) = \sum_{k = t}^{T - 1} \gamma ^{k - t} \delta_k$$
where $\delta_t = R_{t+1} + \gamma \cdot V(S_{t+1}) - V(S_t)$. The identity does not hold when $V(S_t)$ is updated as it is in TD, but when the step size $\alpha$ is small enough, it can be seen as a good approximation.

The main advantage introduced by DT is waiting only for the next time step and not for an entire episode which is quite a critical point in real applications. Certain tasks have significanlty long episodes which makes MC methods too slow. 

## TD-Control
### On policy Control
Policy control is usually applied to action state value function as the latter. Thus, we need to derive the update rule for action-state pairs. The derivation is pretty straightforward, so we will include in the main algorithm directly:
* Input: policy: $\pi$, step size: $\alpha \in (0, 1]$
1. random initialization for $Q(s, a)$ except for pairs $Q(Terminal, s) = 0$
2. choose initial pair $A, S$ using policy derived from $Q$: mainly $\epsilon$-greedy.
3. Loop:
    1. Take action $A$, observe $R$, $S^{'}$  
    2. choose $A^{'}$ and $S'$  
    3. $Q(S, A) \leftarrow Q(S, A) + \alpha \cdot [R + \gamma \cdot Q(S', A') - Q(S, A)]$
    4. $S\leftarrow S'$ , $A \leftarrow A'$
4. until $S$ is terminal  
Having the value of $Q(S, a)$ can be seen as the control, as the final policy can be simply an $\epsilon$-greedy policy using the values $Q$

## Off Policy Control
Q-learning can be considered as one of the major breakthroughs in Reinforcement Learning. It is somehow based on the bellman optimality equation and thus, estimates $q^{*}$ regardless of the policy being followed:

* Input: policy: $\pi$, step size: $\alpha \in (0, 1]$
1. random initialization for $Q(s, a)$ except for pairs $Q(Terminal, s) = 0$
2. choose initial pair $A, S$ using policy derived from $Q$: mainly $\epsilon$-greedy.
3. Loop:
    1. Take action $A$, observe $R$, $S^{'}$    
    2. $Q(S, A) \leftarrow Q(S, A) + \alpha \cdot [R + \gamma \cdot max_a~Q(S', a) - Q(S, A)]$
    3. $S\leftarrow S'$
4. until $S$ is terminal  

The main difference here is using the maximum action-state value function based on the next state $S'$

Let's explain the reason why Q-learning is consider Off-Policy learning. One main point is that it acts using the given $Q$. Nevertheless, it does not the term $Q(S_{t+1}, A_{t+1})$ in the update. In other words, the target policy is not the same as the behavioral policy.   

On the other hand, Sarsa is consider on policy as it updates $Q$ while using the action-state value functions from $Q$

## Expected Sarsa
At least from an algorithmic point of view, the only difference is the update equation:
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\cdot (R_{t+1} + \gamma \sum_{a' \in A(S_{t+1})} \pi(a' | S_{t+1}) \cdot Q(S_{t+1}, a') - Q(S_t, A_t))$$ 

The update is much more stable than Sarsa as the updates are of lower variance, but with the downside of increased computation.
We can say that expected Sarsa is more robust to larger values of the step size. 

# Chapter 8: Models
The term ***model*** should be seen as any mechanism that enables the agent to predict the environment's state and rewards. There are 2 main types of models:
1. distribution models: They generate exact probabilites for every transition, providing complete knowledge of the environment
2. sample models: The transitions are estimated by aggregating over the samples taken from the environment.

## Dyna Q:
The term planning should be thought of as acting / choosing a state-action pair based on the model we built for the environment. One path to improve ***Q-Learning*** is to introduce a planning component. How you might ask ? For each episode of interaction with the word, the model saves the transitions, the rewards and different states. Using this knownledge, we can simluate the real experience and update the state action funtion. (A sampling policy should be determined).   
Let's formalize this idea with the following algorithm:


1. Initialize $Q(s, a)$ and $Model(s, a)$ for all $s \in S$ and $a \in A$
2. Loop:
    1. S current state
    2. $A \leftarrow \epsilon$-greedy action
    3. Take action $A$ observer, $R, S'$
    4. Update $Q(S, A)$ according to ***Q-Learning*** update: 
        $$Q(S, A) \leftarrow Q(S, A) + \alpha\cdot (R + \gamma \cdot max_{a'} Q(S', a')  - Q(S, A))$$ 
    5. $Model(S, A) \leftarrow R, S'$ assuming the environment is deterministic
    6. Loop for $n$ times: ($n$ number of planning iterations):
        1. $S \leftarrow$ an action previously seen
        2. $A \leftarrow$ an action seen with $S$
        3. $R, S' \leftarrow Model(A, S)$: $~$ the deterministic outcome of state action pair $A, S$
        4. Update $Q(S, A)$ using ***Q-Learning*** 

We can see how the Dyna Q-Learning better leverages the limited experience the agent acquires from interacting with the environment. This is specially crucial in practical tasks where the interaction can be quite expensive. 
## Changing Environment
Relying heavily on the model raises the natural question: what if the model is unaccurate ?. Well This exact question is just an extension of the exploitation and exploration tradeoff as the model can change with time and the agent should take into account such possibilities.   
One way to address this issue is to add an exploitation component in the planning update formula.
