# The 10-armed Testbed

**Author:** ZHENG Wenjie

**Last Update:** 2021-08-13

This notebook is related to Section 2.3 of the book.

In the book, the authors claimed that the pure greedy algorithm is not as good as the $\varepsilon$-greedy versions, because the pure greedy one does not explore enough.

Here, I will show you that the pure greedy one is just as good as the $\varepsilon$-greedy versions (if not better) in terms of average reward.

For the plot renderer, I used 'notebook_connected' to reduce the file size.
For personal use, replace it with 'notebook'.

In [1]:
import numpy as np
from tqdm import trange
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook_connected' # or 'notebook' for personal use

from chapter02 import Bandit, Greedy, play

## M times rerun on the same bandit

Here, we have a *single* bandit machine, and we run various algorithms on it over and over.
Of course, each run has different random numbers. This experiment does not exist in the book.

In [2]:
M = 2000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((3, T))
average_hit = np.zeros((3, T))
for m in trange(M):
    reward = np.zeros((3, T))
    hit = np.zeros((3, T))
    for i, player in enumerate([Greedy(K, 0), Greedy(K, 0.1, seed=m), Greedy(K, 0.01, seed=m)]):
        np.random.seed(0)
        bandit = Bandit(np.random.randn(K), σ, seed=m)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= M
average_hit /= M


invalid value encountered in true_divide

100%|██████████| 2000/2000 [01:10<00:00, 28.54it/s]


In [3]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{M} times rerun on the same bandit')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='ε=0'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='ε=0.1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='ε=0.01'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.show()

## N independent bandits

Here, we have $N$ independent bandit machines, and each is tested against various algorithms. This is the experiment prescribed in the book. It is noteworthy that the experiment presented here is slightly different from Zhang Shangtong's, which uses a different set of $N$ bandits for different algorithms.

In [4]:
N = 2000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((3, T))
average_hit = np.zeros((3, T))
for n in trange(N):
    reward = np.zeros((3, T))
    hit = np.zeros((3, T))
    for i, player in enumerate([Greedy(K, 0), Greedy(K, 0.1, seed=n), Greedy(K, 0.01, seed=n)]):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= N
average_hit /= N


invalid value encountered in true_divide

100%|██████████| 2000/2000 [01:14<00:00, 26.82it/s]


In [5]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{N} independent bandits')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='ε=0'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='ε=0.1'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='ε=0.01'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.show()

## Discussion

We can observe that the pure version often works better than the $\varepsilon$-greedy version in terms of the average reward, although the later is still better in terms of the percentage of optimal pull.

In this notebook, we arrive at a different conclusion from the one in the book. Sutton and Barto made a mistake because of the failure to realize that their implementation has two parameters: one is the recognized $\varepsilon$, and the other is the hidden initial action value estimates. They believe that these estimates will be immediately overwritten after their respective first pulls. Nevertheless, in fact, they are already in use (for comparision) before they are ever updated.

In my implementation, I used the true first-pull rewards as the initial esimates and thus avoided the hidden parameter. In addtion, this simple trick significantly improve the performance.