# Bandits and Exploration/Exploitation

> In this post, we will learn simple reinforcement learning problem, the bandits, This is the summary of lecture "Fundamentals of Reinforcement Learning" from Coursera.

- toc: true 
- badges: true
- comments: true
- author: Chanseok Kang
- categories: [Python, Coursera, Reinforcement_Learning]
- image: 

## Sequential Decision Making with Evaluative Feedback

### Action-Values (or Action-Value function)

The value is the expected reward

$$ \begin{aligned} q_{*}(a) &\doteq \mathbb{E}[R_t \vert A_t = a] \quad \forall a \in \{1, \dots, k\} \\ &= \sum_{r} p(r \vert a ) r \end{aligned}$$

Here, $\doteq$ means **"is defined as"**. 

The goal of Action-value is to maximize the expected reward. In formula, we can express it like this,

$$ \arg\max_a q_{*}(a) $$

Then, how can we calculate the $q_{*}(a)$?

In K-armed bandit problem, we can explain it in medicine prescription case. Consider we have three types of medicines.

![medicine](image/bandit_q_a.png)

And each medicine has its own distribution.

![medicine](image/bandit_distribution.png)

Red one is Binomial, green one is normal, and blue one is uniform distribution. Based on this information, we can calculate each $q_{*}(a)$.

$$ \begin{aligned} \text{Red :} q_{*}(a) &= 0.5 \times -11 + 0.5 \times 9 = -0.1 \\ 
   \text{Green :} q_{*}(a) &= 1 \\
   \text{Blue :} q_{*}(a) &= 3 \\ \end{aligned}$$

## Learn Action Values

### Value of an Action

- The value of an action is the **expected reward** when that action is taken

$$ q_{*}(a) \doteq \mathbb{E}[R_t \vert A_t = a] $$

- Usually, $q_{*}(a)$ is not known for agent, so we need to estimate it. One way to estimate is the **Sample-average Method**

### Sample Average Method

$$ \begin{aligned} Q_t(a) &\doteq \frac{\text{sum of rewards when } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} \\ &= \frac{\sum_{i=1}^{t-1}R_i}{t-1} \end{aligned} $$

## Estimating Action Values Incrementally

### Incremental update rule

$$ \begin{aligned} Q_{n+1} &= \frac{1}{n} \sum_{i=1}^n R_i \\ 
 &= \frac{1}{n} (R_n + \sum_{i=1}^{n-1}R_i) \\ 
 &= \frac{1}{n} \big(R_n + (n-1) \frac{1}{n-1} \sum_{i=1}^{n-1}R_i \big) \end{aligned} $$
 
Recall that,

$$ Q_n = \frac{1}{n-1} \sum_{i=1}^{n-1}R_i $$

So we can substitute it with $Q_n$.

$$ \begin{aligned} Q_{n+1} &= \frac{1}{n} (R_n + (n-1)Q_n) \\
  &= \frac{1}{n}(R_n + nQ_n - Q_n) \\
  &= Q_n + \frac{1}{n}(R_n - Q_n) \end{aligned} $$
  
In upper formula, $R_n - Q_n$ means the error between the target(new reward) and previous estimate. And we are using $\frac{1}{n}$ as an step size, but we can choose another step size. If we assume the step size to $\alpha_n$, we can express the formula with general form.

$$ Q_{n+1} = Q_n + \alpha_n ( R_n - Q_n) $$

But when the $n$ is increased, that is, the older reward than just before will be decaying depending on step size $n$.

### Decaying past rewards

Using the formular that we derived from previous, we can explain the past reward decaying.

$$ \begin{aligned} Q_{n+1} &= Q_n + \alpha_n (R_n - Q_n) \\
 &= \alpha R_n + Q_n - \alpha Q_n \\
 &= \alpha R_n + (1 - \alpha)Q_n \\
 &= \alpha R_n + (1 - \alpha) [\alpha R_{n-1} + (1 - \alpha) Q_{n-1}] \\
 &= \alpha R_n + (1 - \alpha) \alpha R_{n-1} + (1 - \alpha)^2 Q_{n-1} \end{aligned} $$
 
We can unroll this until the initial action value appears.

$$ \begin{aligned} Q_{n+1} &= Q_n + \alpha_n (R_n - Q_n) \\
 &= \alpha R_n + (1-\alpha)\alpha R_{n-1} + (1 - \alpha)^2 \alpha R_{n-2} + \dots + (1-\alpha)^{n-1} \alpha R_1 + (1- \alpha)^{n} Q_1 \\
 &= (1-\alpha)^n Q_1 + \sum_{i=1}^{n} \alpha (1-\alpha)^{n-i} R_i \end{aligned} $$
 
From the formula, we can see that the current estimated value($Q_{n+1}$) is related on the initial action value and all the observed reward. But when $n$ is increased, $Q_1$ term will be decreased exponentially. After that, the influence of initial action value goes to zero with more and more data.

## What is the trade-off

### Exploration and Exploitation

- **Exploration**: improve knowledge for long-term benefit

- **Exploitation**: exploit knowledge for short-term benefit

- How do we choose when to explore and when to exploit? (Exploration-Exploitation dilemma)

### Epsilon-Greedy Action Selection

Here, Epsilon($\epsilon$) stands for the probability of choosing to explore.

$$ A_t \leftarrow \begin{cases} \arg\max_a Q_t(a) &\mbox{with probability } 1 - \epsilon \\
 a \sim \text{Uniform}(\{a_1 \dots a_k\}) &\mbox{with probability } \epsilon \end{cases} $$

## Optimistic Initial Values

We can define the action value to be optimistic: that is, set the estimated value to higher than reward. 

### Limitation of optimistic initial values

- Optimistic initial values only drive **early exploration**

 $\rightarrow$ this means agents will not continue exploring after some time.
- They are not well-suited for **non-stationary problems**.
 
 $\rightarrow$ Sometimes action values may change over the time, but optimistic agent will not notice that a different action is better now.
- We may not know what the optimistic initial value should be.

## Upper-Confidence Bound (UCB) Action Selection

### Uncertainty in Estimates

![ucb](image/ucb.png)

As we already notice, we cannot get true estimated action value($q_*(a)$). If we get current estimated value (Q(a)), true estimated value may lower or higher than current value. The idea to handle this, is to set confidence interval about current estimated value. In that case, true estimated value may be in this boundary with uncertainty.

For the optimism, we can set the initial value to the upper bound of confidence interval.

![ucb2](image/ucb2.png)

After time goes on, the confidence interval will be smaller than inital value, and close to the true estimated value. We can express this with formula

$$ A_t \doteq \arg \max \big[ Q_t(a) + c \sqrt{\frac{\ln{t}}{N_t(a)}} \big] $$

The first term($Q_t(a)$) is exploitation term, and the second term($c \sqrt{\frac{\ln{t}}{N_t(a)}}$) is the upper confidence bound exploration term, which is exploration term.