# Chapter 2: Multi-armed Bandits

## 1. *k*-armed Bandit Problem
- Simplest RL problem with only single state
- Set of `k` options (*actions*)
- At each time step $t$, choose an *action* $A_t$, then receive a *reward* $R_t \in \mathbb R$
- Expected reward (true *value* ) of action $a$ is $q_*(a)=E[R_t | A_t=a]$
- The true values and distribution are unknown
- Need estimate with estimated value $Q_t(a) \approx q_*(a)$
- Goal is maximize the expected total reward

## 2. Exploration vs Exploitation
- Greedy Action at time $t$ is $A_t^* =\arg\max\limits_a Q_t(a)$
- *Exploiting* if $A_t = A_t^*$
- *Exploring* if $A_t \neq A_t^*$
- Exploitaion maximizes the expected reward on the one step
- Exploration may produce the greater total reward in the long run
- Can't do both with any single action selection
- Need to balance Exploitation and Exploration

## 3. Action-value Methods
- Estimate the values of actions and use the estimates to make action selection decisions
- *sample-average* method:
$$Q_t(a)=\dfrac{\sum_{i=1}^{t-1}R_i \cdot \mathbb 1_{A_i=a}}{\sum_{i=1}^{t-1}\mathbb 1_{A_i=a}}$$
- $Q_t(a)$ coverages to $q_*(a)$ by the law of large numbers :
$$\lim\limits_{N_t(a)\rightarrow\infty}Q_t(a)=q_*(a)$$

## 4. ε-greedy Methods
- Usually select greedy actions
- Random pick an action (includes non-greedy actions) with probability `ε`
- Every action may be selected, all the $Q_t(a)$ can coverage to $q_*(a)$
- Possible to reduce `ε` over time to try to get the best of both high and low values

***************************************
Initialize, for $a = 1$ to $k$:
$$
\begin{aligned}
Q(a) & \leftarrow 0
\\
N(a) & \leftarrow 0
\end{aligned}
$$

Loop forever:
$$
\begin{aligned}
A & \leftarrow
    \begin{cases}
        \arg\max_a Q(a) &\text{with probability }(1-\epsilon)
        \\
        \text{a random action} &\text{with probability }\epsilon
    \end{cases}
\\
R & \leftarrow \text{bandit}(A)
\\
N(a) & \leftarrow N(a) + 1
\\
Q(a) & \leftarrow Q(a) + \dfrac{1}{N(A)}\big[R-Q(A)\big]
\end{aligned}  
$$
***************************************


## 5. Incremental Implementation
- Let $R_i$ is reward recieved after the $i$th selection
- Let $Q_{n+1}$ is the estimated value after action has been selected $n$ times
$$Q_{n+1}=\dfrac{1}{n}\sum_{i=1}^nR_i$$
- Rewrite $Q_{n+1}$ in form of incremental formulas
$$Q_{n+1}=Q_n + \dfrac{1}{n}\big[R_n-Q_n\big]$$

- **General form**:

$$NewEstimate \leftarrow OldEstimate + StepSize\big[Target - OldEstimate\big]$$

- $\big[Target - OldEstimate\big]$ is an **error** in the estimate
- $StepSize$ is denoted as $\alpha$ or, more generally by $\alpha_t(a)$


## 6. Tracking a Nonstationary Problem
- **Nonstationnary**: Reward probabilities are changed over time
- Use *exponentail recency-weighted average*:
$$
\begin{aligned}
Q_{n+1} &= Q_n+\alpha\big[Q_n\big]
\\
& = (1-\alpha)^nQ_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i
\end{aligned}
$$
where constant step-size $\alpha\in(0, 1]$

- Can vary the step-size parameter from step to step as $\alpha_n(a)$
- Stochastic approximation theory assures convergence with probability 1:
$$
\begin{cases}
\displaystyle\sum_{n=1}^{\infty}\alpha_n(a) = \infty
\\
\displaystyle\sum_{n=1}^{\infty}\alpha_n^2(a) < \infty
\end{cases}
$$
- The sample-average with $\alpha_n(a)=\dfrac{1}{n}$ will be convergenced

## 7. Optimistic Initial Values
- Bias $Q_1(a)$ for all methods above
- Bias decreases over time because of $(1-\alpha)<1$
- Bias is a very helpful initial values as prior knowledge
- First, setting $bias > 0$, will explore more, then the exploration is decreased with time
- Quite effective on stationary problems, but not for nonstationnary problems

## 8. Upper-Confidence-Bound Action Selection
- Select actions with largest upper confidence bound (**UCB**)
$$A_t = \arg\max\limits_a \Bigg[ Q_t(a) + c\sqrt{\dfrac{\ln t}{N_t(a)}} \Bigg]$$

  where $c>0$ controls the degree of exploration
  
- Reduce exporation over time
- Square-root term is measure of the uncetainty or variance in the estimate of *a*'s value
- $c$ determines the confidence level

## 9. Gradient Bandit Algorithms

## 10. Associative Search (Contextual Bandits)