# - Part I: Classical Bandit Algorithms

We consider a time-slotted bandit system $(t=1,2, \ldots)$ with three arms. We denote the arm set as $\{1,2,3\}$. Pulling each arm $j(j \in\{1,2,3\})$ will obtain a random reward $r_{j}$, which follows a Bernoulli distribution with mean $\theta_{j}$, i.e., $\operatorname{Bern}\left(\theta_{j}\right)$. Specifically,

$$
r_{j}= \begin{cases}1, & w \cdot p \cdot \theta_{j} \\ 0, & w \cdot p \cdot 1-\theta_{j}\end{cases}
$$

where $\theta_{j}, j \in\{1,2,3\}$ are parameters within $(0,1)$.
Now we run this bandit system for $N(N \gg 3)$ time slots. In each time slot $t$, we choose one and only one arm from these three arms, which we denote as $I(t) \in\{1,2,3\}$. Then we pull the arm $I(t)$ and obtain a random reward $r_{I(t)}$. Our objective is to find an optimal policy to choose an arm $I(t)$ in each time slot $t$ such that the expectation of the aggregated reward over $N$ time slots is maximized, i.e.,

$$
\max _{I(t), t=1, \ldots, N} \mathbb{E}\left[\sum_{t=1}^{N} r_{I(t)}\right]
$$

If we know the values of $\theta_{j}, j \in\{1,2,3\}$, this problem is trivial. Since $r_{I(t)} \sim \operatorname{Bern}\left(\theta_{I(t)}\right)$,

$$
\mathbb{E}\left[\sum_{t=1}^{N} r_{I(t)}\right]=\sum_{t=1}^{N} \mathbb{E}\left[r_{I(t)}\right]=\sum_{t=1}^{N} \theta_{I(t)}
$$

Let $I(t)=I^{*}=\arg \max \theta_{j}$ for $t=1,2, \ldots, N$, then

$$
\max _{I(t), t=1, \ldots, N} \mathbb{E}\left[\sum_{t=1}^{N} r_{I(t)}\right]=N \cdot \theta_{I^{*}}
$$

However, in reality, we do not know the values of $\theta_{j}, j \in\{1,2,3\}$. We need to estimate the values $\theta_{j}, j \in\{1,2,3\}$ via empirical samples, and then make the decisions in each time slot. Next we introduce three classical bandit algorithms: $\epsilon$-greedy, UCB, and TS, respectively.

![](greedy_UCB.png)
![](TS.png)

## Problems 1  
### Question  
Now suppose we obtain the parameters of the Bernoulli distributions from an oracle, which are shown in the following table. Choose $N=5000$ and compute the theoretically maximized expectation of aggregate rewards over $N$ time slots. We call it the oracle value. Note that these parameters $\theta_{j}, j \in \{1,2,3\}$ and oracle values are unknown to all bandit algorithms.
<center>
<table>
  <tr>
    <th>Arm j</th>
    <th>1</th>
    <th>2</th>
    <th>3</th>
  </tr>
  <tr>
    <td>θ<sub>j</td>
    <td>0.7</td>
    <td>0.5</td>
    <td>0.4</td>
  </tr>
</table>
</center>

### Solution
Since each arm's parameter is known from the oracle, we need to choose the arm with the largest parameter to maximize the expectation of aggregate rewards over $N$ time slots.

Given $\theta_1 = 0.7, \theta_2 = 0.5, \theta_3 = 0.4$,
we have $\theta_1 > \theta_2 > \theta_3$.
Thus, we choose arm 1 every time.

i.e. $$\forall t, I(t)=I^*=\arg \max\limits_{j\in\{1,2,3\}}\theta_j=1$$
$$\theta_{I(t)} = \theta_1 = 0.7$$

Since $r_{I(t)} \sim \text{Bern}(\theta_{I(t)})$,

$$E(r_{I(t)}) = \theta_{I(t)}$$

The maximum expected value is 
$$\max_{I(t),t=1,2,\cdots,N}\ E\big[\sum_{t=1}^Nr_{I(t)}\big]$$
$$=\max_{I(t),t=1,2,\cdots,N}\ \sum_{t=1}^NE\big[r_{I(t)}\big]$$
$$=N \cdot \theta_{I^*} = 5000 \times 0.7 = 3500$$

Therefore, with the given oracle parameters, the maximum expected value is 3500.
