# Upper Confidence Bound

**Author:** ZHENG Wenjie

**Last Update:** 2021-08-13

This notebook is related to Section 2.7 of the book.

For the plot renderer, I used 'notebook_connected' to reduce the file size. For personal use, replace it with 'notebook'.

In [1]:
import numpy as np
from tqdm import trange
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook_connected' # or 'notebook' for personal use

from chapter02 import Bandit, Greedy, UCB, play

## M times rerun on the same bandit

Here, we have a *single* bandit machine, and we run various algorithms on it over and over. Of course, each run has different random numbers. This experiment does not exist in the book.

In [2]:
M = 1000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((3, T))
average_hit = np.zeros((3, T))
for m in trange(M):
    reward = np.zeros((3, T))
    hit = np.zeros((3, T))
    for i, player in enumerate([Greedy(K, 0.1, seed=m), UCB(K, 1), UCB(K, 2)]):
        np.random.seed(0)
        bandit = Bandit(np.random.randn(K), σ, seed=m)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= M
average_hit /= M


invalid value encountered in true_divide


invalid value encountered in true_divide


divide by zero encountered in true_divide

100%|██████████| 1000/1000 [00:59<00:00, 16.76it/s]


In [3]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{M} times rerun on the same bandit')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='Greedy: ε=0.1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='UCB: c=1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='UCB: c=2'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.show()

## N independent bandits

Here, we have $N$ independent bandit machines, and each is tested against various algorithms. This is the experiment prescribed in the book. It is noteworthy that the experiment presented here is slightly different from Zhang Shangtong's, which uses a different set of $N$ bandits for different algorithms.

In [4]:
N = 1000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((3, T))
average_hit = np.zeros((3, T))
for n in trange(N):
    reward = np.zeros((3, T))
    hit = np.zeros((3, T))
    for i, player in enumerate([Greedy(K, 0.1, seed=n), UCB(K, 1), UCB(K, 2)]):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= N
average_hit /= N


invalid value encountered in true_divide


invalid value encountered in true_divide


divide by zero encountered in true_divide

100%|██████████| 1000/1000 [00:56<00:00, 17.73it/s]


In [5]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{N} independent bandits')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='Greedy: ε=0.1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='UCB: c=1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='UCB: c=2'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.show()

## Discussion

The experiment result is the same as the one presented in the book with one difference. We can observe that the first figure of the above two does not have the *spike* at $t=10$ as in the second figure (and as in the book) but at $t=3$. The reason is that, for that particular bandit, its optimal arm is the 4th.