# Model Free Control

> A notebook that helps us to discover Reinforcement Learning
- toc: true
- branch: master
- badges: true
- comments: true
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2
- image: https://www.mdpi.com/symmetry/symmetry-12-01685/article_deploy/html/images/symmetry-12-01685-g002.png
- description: Fourth in a series on understanding Reinforcement Learning.

# Objectives

Previously, we have learned qbout [Model Free Prediction](https://dnlam.github.io/fastblog/2020/06/21/Model_Free_Prediction_Control.html) which enables us to estimate the value function of unknown MDP using sampling (Monte Carlo and Temporal Different Learning Algorithms). 

In this notebook, we will talk about model free control which we will try to optimize these value functions.

# Monte Carlo for Control
## Model Free Policy Iteration Using Action-Value Function

To improve a policy (policy improvement) as we have noticed in Dynamic Programming, we can make it greedy. However, rhe greedy policy improvement over state value function does require the knowledge of MDP  model.
$$
\pi'(s) = \argmax _ {a} \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) | S_t=s, A_t=a]
$$

Because we are not in model-free for the calculation of the expectation, so it is become curbersome to perform the policy improvement without a model.

Fortunately, instead of estimating state values, we could choose to estimate state action values, then, we can much more easily find the greedy policy by pick the highest valued action in every state.

$$
\pi '(s) = \argmax _a q(s,a)
$$

Then, we can apply the Generalised Policy Iteration with Action Value Function. First, starting from a random action value and policy, we can iteratively estimate the action value (e.g using Monte Carlo policy evaluation) and perform policy improvement. 

However, if we choose to do greedy policy improvement then we wouldn't explore which means we cant sample all s,a when learning by interacting. And it makes the policy evaluation susceptible to problems. For example, if we have a fully greedy policy and we are trying to evaluate that with MC algorithm. Then, it might not select certain actions in certain states and that means we do not have a reliable estimate of this action value from those actions and the improvement step could be wrong.

To deal with that problem, we can consider $\epsilon$-greedy policy where we allow a small probability of picking any action.

by doing so, we might have a full model free algorithm that can be used for policy iteration.

# TD for Control

Basically, one would like to apply the TD learning to the action value functions to have the same convenience of being able to pick a greedy or $\epsilon -$ greedy policy. Plus, we can continue the update every time step because TD can learn from individual transition without completing the full episode.

## SARSA Algorithm
One way to do it is to use SARSA Algorithm which is just a TD learning for state-action value.
$$
q_{t+1}(S_t,A_t) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma q(S_{t+1},A_{t+1}) - q(S_t,A_t) )
$$

Then, we can use this inplace of MC learning to do the policy evaluation step. The policy improvement now is still supportedly considered as $\epsilon$ -greedy improvement.

## Off-policy TD and Q-learning


In previous cases, we considered MC and SARSA algorithms where we are interleaving the policy with evaluation and improvement. Now, we turn to ```Off-policy``` learning which means that learning about a policy different from the on we are following and there is a specific very popular algorithm which is ```Q-learning```. 

Q-learning corresponds to sampling of the following value interation $q_{k+1}(s,a) = \mathbb{E}[R_{t+1} + \gamma \max _{a'} q_k(S_{t+1},a') | S_t=s, A_t=a ] $

Here, the interation equation has a maximization over the action in the one step ahead , the one that maximize our action values.

we can eventually sample this with the following equation:
$$
q_{k+1}(s,a) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma \max _{a'} q_t(S_{t+1},a') -q_t(S_t,A_t))
$$

Inside the paranthesis, we bootstrap differently with SARSA. Instead of considering the next action in the next state, we consider the best action that we can possibly take according to our current estimates and then use that for the greedy value of our current value function. 

 ## On and Off-Policy Learning

On policy learning is about learning from a behaviour policy from experienced sample from that policy. Here, we always consider there are just one policy (Monte Carlo, SARSA) and that policy would be used to generate the behaviour because we want to take it to evaluate/predict the value functions of that policy. We called that is On-policy learning because we are studying the policy that we are following.

On the other hand, Off policy learns about a `target` policy $\pi$ but the experience sample is from a different policy $\mu$. This refers essentially to learning counterfactual about the other things we could do "what if...?"

In general, in off policy, we  would like to evaluate a target policy $\pi(a|s)$ to compute the estimated  value of that policy $v_\pi(s)$ or $q_\pi(s,a)$ while using behavioud policy $\mu(a|s)$ to generate actions.

In practice, it is important because of several possible reasons:
- we want to learn from observing database (stored experience) 
- reuse experience from old policies (past experience) 
- Learn about multiple policies while following one policy 
- Learn about greedy policy while following exploratory policy.

Q learning estimates the value of greedy policy
$$
q_{t+1}(S_t,A_t) = q_t(S_t,A_t) + \alpha _t(R_{t+1} + \gamma \max _{a'} q_t(S_{t+1},a') -q_t(S_t,A_t))
$$

It imples that we are learning about the value if we take this action and then in the next step we would be greedy. It is a valid target to update and it would be learning about the greedy policy but we do not need to react all the time according to that policy because it does not explore efficiently..

<i>Theorem: Q-learning control converges to the optimal action-value function, $q \rightarrow q^* $ as long as we take each action in each state infinitely often. </i>

IT is proved that it works for any policy that eventually selects all actions sufficiently often (requires appropriately decaying step size $\sum _t \alpha _t = \infty$ and $\sum _t \alpha _t ^2 < \infty$)