d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

# Model-Free Control  

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you learn:<br>
 - On-Policy MC Control
 - On-Policy TD Learning
 - Off-Policy Learning
  
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) References
* [David Silver lecture](https://www.youtube.com/watch?v=2pWv7GOvuf0&list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)
* Sutton book - Chapter 5, 6, 7, 8

### Knowledge check ###
- What is the difference between Control and Prediction problems?
- Some examples of problems what can be modeled with MDPs?
- In what cases do you use Model-free control?

### On and Off-Policy Learning ###
0. On-policy learning
 - You learn about policy \\(\pi \\) from the experience sampled from \\(\pi \\). In other words, you learn from policy \\(\pi \\) by taking actions based on policy \\(\pi\\)
0. Off-policy learning
 - You learn about policy \\(\pi \\) from experience sampled from \\(\mu\\)

### Questions (refresher) ###
<br>
0. What was policy iteration?
0. Can you apply that method using MC? Why?

### Questions###
<br>
0. What is the solution?

### \\(\epsilon-Greedy\\) Exploration ###
0. Simple idea to explore the space
0. If you have m actions, you take one of them with probability of \\(\frac{\epsilon}{m}\\). i. e
$$ \pi(a|s) = \begin{cases}
   \frac{\epsilon}{m} + 1 - \epsilon &\text{if } a^{\* } = argmax\_{a\in A}   Q(s,a)\\\
   \frac{\epsilon}{m} &\text{otherwise}
\end{cases} $$

### Some points to keep in mind ###
- \\(\epsilon-greedy\\) policy improves over time (proof out of scope. Talk to me)
- **For any \\(\epsilon-greedy\\) policy \\(\pi\\), the \\(\epsilon-greedy\\) policy \\(\pi'\\) with respect to \\(q\_{\pi}\\) is an improvement, \\(V\_{pi}' \ge V\_{\pi}(s)\\)** (proof out of scope, talk to me if you need a proof)
- You do not have to iterate through multiple episodes to update the policy. One can take on episode or two, update Q(s,a), then apply \\(\epsilon-greedy\\) then repeat

### Definition ###
- Greedy in the Limit with Infinite Exploration (GLIE)
 - All state-action pairs are explored infinitely many times, $$lim\_{k\rightarrow \infty} N\_{k}(s,a)=\infty$$
 - The policy converges on a greedy policy, $$ lim\_{k \rightarrow \infty} \pi\_{k}(a\bigm\vert s) = 1 )a = argmax\_{a\in A} Q\_{k}(s, a')$$
- \\(\epsilon-greedy\\) is GLIE if \\(\epsilon \\) goes to zero as we progress for example choose \\(\epsilon\_{k} = \frac{1}{k}\\)

### GLIE MC Control ###
0. Sample kth episode using the policy \\(\pi\\)
0. For ** every episode **
 - For each state \\(S\_{t}\\) and \\(A\_{t}\\) in the episode,
 - \\(N\big(S\_{t}, A\_{t}) \longleftarrow N\big(S\_{t}, A\_{t}) + 1  \\), where \\(N\big(S\_{t}, A\_{t})\\) is the number of times that \\(\big(S\_{t}, A\_{t}\big)\\) is appeared.
 - \\(Q\big(S\_{t}, A\_{t})\longleftarrow Q\big(S\_{t}, A\_{t} \big) + \frac{1}{N\big(S\_{t}, A\_{t}\big )} (G\_{t} -Q(S\_{t}, A\_{t})\big) \\),
 - \\(\epsilon \longleftarrow \frac{1}{k}\\),
 - \\(\pi \longleftarrow \epsilon-greedy (Q)\\)
0. If we do this long enough, \\(Q\big(s,a)\longrightarrow q\_{\*}(s,a)\\)

### Refresher ###
- What are the advantages of TD over MC?

### Questions ###
- How can you use TD instead of MC for a Control problem?

### Convergence of SARSA ###
- Sarsa converges to the optimal action-value function, \\(Q\big(s,a)\longrightarrow q\_{\*}(s,a)\\), under the following conditions:
 - GLIE sequence of policies \\(\pi\_{t}(a \bigm\vert s)\\) 
 - Robbins-Monro sequence of step-sizes \\(\alpha\_{t}\\)
$$ \sum\_{t = 1}^{\infty} \alpha\_{t} = \infty $$
$$ \sum\_{t = 1}^{\infty} \alpha\_{t}^{2} \lt \infty $$

### Questions ###
What graph do you expect to see when you use SARSA to train the agent? Can you plot Episodes vs Time steps for a typical RL problem?

### n-Step Sarsa ###

- Similar to TD(\\(\lambda\\))
- We can define different q's
$$ q\_{t}^1 = R\_{t+1} + \gamma Q(S\_{t+1}) $$
$$ q\_{t}^2 = R\_{t+1} + \gamma R\_{t+2} + \gamma^2 Q(S\_{t+2}) $$
.
.
.
$$ q\_{t}^\infty = R\_{t+1} + \gamma R\_{t+2} + ...+ \gamma^{T-1} R\_{T} $$
$$ q\_{t}^n = R\_{t+1} + \gamma R\_{t+2} + ... + \gamma^{n-1}R\_{t+n} + \gamma^{n} Q(S\_{t+n}) $$
$$ Q\big(S, A\big)\longleftarrow Q\big(S, A \big) + \alpha \big(q\_{t}^n -Q\big(S, A\big)\big) $$

### Sarsa(\\(\lambda\\)) Forward and Backward View
- **Forward view:**
$$ Q\big(S, A\big)\longleftarrow Q\big(S, A \big) + \alpha \big(q\_{t}^\lambda -Q\big(S, A\big)\big) $$
where 
$$ q\_{t}^\lambda = (1-\lambda) \sum\_{n = 1}^{\infty}\lambda^{n-1}q\_{t}^n $$

- **Backward view:**
 - Eligibility traces:
 - \\(E\_{0}(s, a) = 0\\)
 - \\(E\_{t}(s, a) = \gamma\lambda E\_{t-1}(s, a) + 1(S\_{t} = s, A\_{t} = a)\\), where \\(1\\) is a indicator function
- Keep an eligibility trace for every state s
- Update Q(s, a) for every state s and action a
$$\delta\_{t} = R\_{t+1} + \gamma Q(S\_{t+1}, A\_{t+1}) - Q(S\_{t}, A\_{t}) $$
$$ Q(s, a) \longleftarrow Q(s, a) + \alpha \delta\_{t}E\_{t}(s, a) $$

### Off-Policy Learning ###
<br>

You might want to evaluate a **target** policy \\(\pi(a \bigm\vert s)\\) to compute \\(v\_{\pi}(s)\\) or \\(q\_{\pi}(s)\\) while following a different **behavior** policy \\(\mu(a \bigm\vert s)\\)

### Questions ###
0. Why one would do that?
0. How do you do that?

### Importance Sampling ###
- Simple idea. Just re-order a computation to end up with a new probability distribution.
$$ E\_{X \,\, \tilde{} \,\, P}  \big[f(X) \big] = \sum P(X)f(X) = \sum Q(X) \frac{P(X)}{Q(X)} f(X) = E\_{X \,\, \tilde{} \,\, Q}\bigg[f(X)\frac{P(X)}{Q(X)}\bigg] $$
- Here how you apply this in practice:
 - Use returns generated from behavior policy \\(\mu\\) to evaluate \\(\pi\\)
 - If policies are similar (meaning the distribution is similar) then put height weight on the return. Otherwise, put less weight on the return
 - $$ G\_{t}^{\frac {\pi}{\mu}} = \frac{\pi\big(A\_{t}\bigm\vert S\_{t}\big)}{\mu \big(A\_{t}\bigm\vert S\_{t}\big)} \frac{\pi\big(A\_{t+1}\bigm\vert S\_{t+1}\big)}{\mu \big(A\_{t+1}\bigm\vert S\_{t+1}\big)} ... \frac{\pi\big(A\_{T}\bigm\vert S\_{T}\big)}{\mu \big(A\_{T}\bigm\vert S\_{T}\big)} G\_{t} $$
 - $$ V\big(S\_{t} \big)\longleftarrow V\big(S\_{t} \big) + \alpha \big(G\_{t}^{\frac {\pi}{\mu}} -V\big(S\_{t}\big)\big) $$

### Question ###
- What can go wrong with above formula?
- Solutions?

- $$ V\big(S\_{t} \big)\longleftarrow V\big(S\_{t} \big) + \alpha \bigg(\frac{\pi\big(A\_{t}\bigm\vert S\_{t}\big)}{\mu \big(A\_{t}\bigm\vert S\_{t}\big)} (R\_{t+1} + \gamma V(S\_{t+1})) -V\big(S\_{t}\big)\bigg) $$
 - Much lower variance. We do not multiply multiple ratios
 - Policies need to only be similar over a single step

### Q-Learning ###
- It does not require importance sampling
- It is off-policy learning
- Next action is chosen based on behavior policy
- Consider alternative actions
- Update Q

$$ Q\big(S\_{t}, A\_{t} \big)\longleftarrow Q\big(S\_{t}, A\_{t} \big) + \alpha  \big(R\_{t+1} + \gamma Q(S\_{t+1}, A') -Q\big(S\_{t}, A\_{t}\big)\big) $$
- Special case of above formula is when your policy is greedy policy
 - \\(\pi\big(S\_{t+1}\big) = argmax\_{a'} Q\big(S\_{t+1}, a'\big)\\)
 - The behavior policy is \\(\epsilon\\)-greedy with respect to Q(s,a)
 - i.e \\(R\_{t+1} + \gamma Q(S\_{t+1}, A') = R\_{t+1} + max\_{a'} \gamma Q(S\_{t+1}, a') \\)
- Q-learning that you have heard about:
$$ Q\big(S\_{t}, A\_{t} \big)\longleftarrow Q\big(S\_{t}, A\_{t} \big) + \alpha (R\_{t+1} + max\_{a'} \gamma Q(S\_{t+1}, a') -Q\big(S\_{t}, A\_{t}\big)\big) $$

### Summary ###
## ![comparision](https://chunpai.github.io/assets/img/DP_and_TD.png)

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>