# Lecture 12 - Fast RL II

provided by [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)

---

<div class="alert alert-block alert-info">
Table of Contents: <br>
    
<ul>
    <li>1. <a href="#1.-Introduction">Introduction</a>
    <li>2. <a href="#2.-Bayesian-Bandits">Bayesian Bandits</a></li>
    <li>3. <a href="#3.-Probability-Matching">Probability Matching</a></li>
    <li>4. <a href="#4.-Framework:-Probably-Approximately-Correct">Framework: Probably Approximately Correct</a></li>
    <li>5. <a href="#5.-Fast-RL-in-MDPs">Fast RL in MDPs</a></li>
    <li>6. <a href="#6.-Resource">Resource</a></li>
</ul>
</div>

# 1. Introduction

We have been covering algorithms that fall under the concept of __optimism under uncertainty__. 

We have looked at the following approaches:
* __Greedy__: linear total regret
* __Constant $\epsilon$-greedy__: linear total regret
* __Decaying $\epsilon$-greedy__: sublinear regret
* __UCB__: sublinear regret

# 2. Bayesian Bandits

Before in UCB, we made no assumptions about the unknown reward distribution $R$ except for the bounds on the rewards.

Another approach is called __bayesian bandits__ and exploits prior knowledge of rewards $p[R]$. It computes a posterior distribution of rewards $p[R~|~h_{t}]$ based on a history of action reward pairs.

It leverages Bayes' rule:

$$
p(\phi_{i}~|~r_{i1}) = \frac{p(r_{i1}~|~\phi_{i})p(\phi_{i})}{p(r_{i1})} \hspace{1em} (Eq.~1)\\
$$

If $p(\phi_{i}~|~r_{i1})$ and $p(\phi_{i})$ are the same, then we can call the prior $p(\phi_{i})$ and model $p(r_{i1}~|~\phi_{i})$ a __conjugate__. Why is this useful? It means we can do our posterior updating analytically.

Framework:
* __frequentist regret__ : (the framework we have been using before) assumes a true unknown set of parameters

$$
Regret(\mathcal{A}, T; \theta) = \sum_{t = 1}^{t} \mathbb{E}[Q(a^{*}) - Q(a_{t})] \hspace{1em} (Eq.~2)\\
$$

* __bayesian regret__ : assumes there's a prior over parameters

$$
BayesRegret(\mathcal{A}, T; \theta) = \mathbb{E}_{\theta \sim p_{\theta}} [\sum_{t = 1}^{t} \mathbb{E}[Q(a^{*}) - Q(a_{t})~|~\theta]] \hspace{1em} (Eq.~3)\\
$$

We tackle this framework with __probability matching__.

# 3. Probability Matching

We assume we have a parametric distribution over rewards for each arm. 

__Probability Matching__ selects the best action (optimal action) based on a history.

$$
\pi(a~|~h_{t}) = \mathbb{P}[Q(a) > Q(a'), \forall a' \ne a ~|~ h_{t}] \hspace{1em} (Eq.~4)\\
$$

Uncertain actions have higher probability of being max.

Initialize prior over each arm $a, p(R_{a})$ <br>
loop <br>
$\quad$ For each arm $a$ _sample_ a reward distribution $R_{a}$ from posterior <br>
$\quad$ Compute action-value function $Q(a) = \mathbb{E}[R_{a}]$ <br>
$\quad$ $a_{t} = \underset{a \in \mathcal{A}}{argmax}Q(a)$ <br>
$\quad$ Observe reward $r$ <br>
$\quad$ Update posterior $p(R_{a}~|~r)$ using Bayes law <br>

_Algorithm 1. Thompson Sampling._

I found this resource to be really helpful for understanding thompson sampling: https://www.youtube.com/watch?v=Zgwfw3bzSmQ.

_Thompson sampling has the same regret bounds as UCB._

# 4. Framework: Probably Approximately Correct

 Because we evaluate based on total regret, we don't know if regret is caused by a lot of little mistakes or a few large ones.
 
 We can tackle this problem with the __Probably Approximately Correct (PAC)__ framework.
 
 $$
 Q(a) \ge Q(a^{*}) - \epsilon \hspace{1em} (Eq.~5)\\
 $$
 
 Basically it will operate much like before (optimism or Thompson sampling) however a small $\epsilon$ is added to give room for other actions to be selected.
 
From what I'm understanding, this framework can be applied to optimism under uncertainty and probability matching/thompson sampling.

# 5. Fast RLs in MDPs

For the MDP setting (we've been covering the multi-armed bandit setting), we can use the same frameworks. This section focuses on the PAC framework.

Not too sure, but from what I understand, I would think UCB and Thompson sampling are only applicable to the multi-armed bandit setting. In the (tabular) MDP setting, they carry the same ideas but aren't exactly the same.

The lecture begins with __optimistic initialization__. In the MDP setting, we can use any of the model-free algorithms (e.g. SARSA, MC, Q-learning) we've learned to estimate $Q(s, a)$. 

We can initialize our q-values optimistically like setting them to $\frac{r_{max}}{1 - \gamma}$ or initializing $V(s) = \frac{r_{max}}{(1 - \gamma) \Pi_{i=1}^{T} \alpha_{i}}$. We consider $r_{max}$ to be the state-action pair that maximizes the reward. $\gamma$ is the discount factor. $\alpha_{i}$ is the learning rate at the $i$-th timestep which goes up till $T$, the number of samples to learn near optimal q-values.

Optimistic initialization is one way to make RL faster in the MDP setting.
Other approaches include:
* be very optimistic till confident empirical estimates close to true parameters
* be optimistic given information you have
    * compute confidence sets on dynamics/reward models
    * add reward bonuses

Given $\epsilon, \delta, m$ <br>
$\beta = \frac{1}{1 - \gamma} \sqrt{0.5 ln(2|S||A|\frac{m}{\delta})}$ <br>
$n_{sas}(s, a, s') = 0; s \in S, a \in A, s' \in S$ <br>
$rc(s, a) = 0, n_{sa}(s, a) = 0, \tilde{Q}(s, a) = \frac{1}{1 - \gamma} \forall s \in S, a \in A$ <br>
$t = 0; s_{t} = s_{init}$ <br>
loop <br>
$\quad$ $a_{t} = \underset{a \in A}{argmax} Q(s_{t}, a)$ <br>
$\quad$ Observe reward $r_{t}$ and state $s_{t + 1}$ <br>
$\quad$ $n_{sa}(s_{t}, a_{t}) += 1$ <br>
$\quad$ $n_{sas}(s_{t}, a_{t}, s_{t + 1}) += 1$ <br>
$\quad$ $rc(s_{t}, a_{t}) = \frac{rc(s_{t}, a_{t})n_{sa}(s_{t}, a_{t}) + r_{t}}{n_{sa}(s_{t}, a_{t}) + 1}$ <br>
$\quad$ $\hat{R}(s, a) = \frac{rc(s_{t}, a_{t})}{n(s_{t}, a_{t})}$ <br>
$\quad$ $\hat{T}(s'~|~s, a) = \frac{n_{sas}(s_{t}, a_{t}, s_{t + 1})}{n_{sa}(s_{t}, a_{t})} \forall s' \in S$ <br>
$\quad$ while not converged do <br>
$\quad\quad$ $\hat{Q}(s, a) = \hat{R}(s, a) + \gamma \sum_{s'}\hat{T}(s'~|~s, a)\underset{a'}{max}\tilde{Q}(s', a') + \underbrace{\frac{\beta}{\sqrt{n_{sa}(s, a)}}}_{reward ~ bonus} \forall s \in S, a \in A$

_Algorithm 2. Model-Based Interval Estimation with Exploration Bonus (MBIE-EB)._

Algorithm 2 (MBIE-EB) uses value iteration for model-based policy control (but it estimates the reward and dynamics models). It also implements an exploration bonus (or reward bonus).

# 6. Resource

If you missed the link right below the title, I'm providing the resource here again along with the course website.

- [Stanford CS234](https://www.youtube.com/watch?v=FgzM3zpZ55o)
- [Course Website](http://web.stanford.edu/class/cs234/index.html)

This is a series of 15 lectures provided by Stanford.
