# Parameter Study

**Author:** ZHENG Wenjie

**Last Update:** 2021-08-13

This notebook is related to Section 2.10 of the book.

For the plot renderer, I used 'notebook_connected' to reduce the file size. For personal use, replace it with 'notebook'.

In [1]:
from collections import namedtuple
from tqdm import trange
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook_connected' # or 'notebook' for personal use

from chapter02 import Bandit, Greedy, Optimistic, UCB, Gradient, play

## N independent bandits

Here, we have $N$ independent bandit machines, and each is tested against various algorithms. This is the experiment prescribed in the book. It is noteworthy that the experiment presented here is slightly different from Zhang Shangtong's, which uses a different set of $N$ bandits for different algorithms.

In [2]:
Configuration = namedtuple('Configuration', ['algo', 'parameter', 'reward', 'hit'])

In [3]:
T = 1000
greedy = Configuration(Greedy, np.logspace(-7, -2, num=6, base=2), np.zeros((6, T)), np.zeros((6, T)))
optimistic = Configuration(Optimistic, np.logspace(-2, 2, num=5, base=2), np.zeros((5, T)), np.zeros((5, T)))
ucb = Configuration(UCB, np.logspace(-4, 2, num=7, base=2), np.zeros((7, T)), np.zeros((7, T)))
gradient = Configuration(Gradient, np.logspace(-5, 1, num=7, base=2), np.zeros((7, T)), np.zeros((7, T)))

In [4]:
N = 1000
K = 10
σ = 1

for n in trange(N):
    for i, p in enumerate(greedy.parameter):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        player = Greedy(K, ε=p, seed=n)
        reward, hit = play(bandit, player, T)
        greedy.reward[i] += reward
        greedy.hit[i] += hit
    for i, p in enumerate(optimistic.parameter):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        player = Optimistic(K, q0=p)
        reward, hit = play(bandit, player, T)
        optimistic.reward[i] += reward
        optimistic.hit[i] += hit
    for i, p in enumerate(ucb.parameter):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        player = UCB(K, c=p)
        reward, hit = play(bandit, player, T)
        ucb.reward[i] += reward
        ucb.hit[i] += hit
    for i, p in enumerate(gradient.parameter):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        player = Gradient(K, α=p, seed=n)
        reward, hit = play(bandit, player, T)
        gradient.reward[i] += reward
        gradient.hit[i] += hit
    
for conf in (greedy, optimistic, ucb, gradient):
    conf.reward[:] /= N
    conf.hit[:] /= N


invalid value encountered in true_divide


invalid value encountered in true_divide


divide by zero encountered in true_divide

100%|██████████| 1000/1000 [13:46<00:00,  1.21it/s]


## Average over first 1000 pulls

With 4 algorithms and a total of 25 parameter values, a temporal dimension will clutter all curves.
Therefore, instead we use the AUC (area under curve) approach as in the book.

In [5]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=False, vertical_spacing=0.08, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'Average over first {T} pulls')
fig.update_xaxes(
    type="log",
    tickmode = 'array',
    tickvals = np.logspace(-7, 2, num=10, base=2),
    ticktext = ['1/128', '1/64', '1/32', '1/16', '1/8', '1/4', '1/2', '1', '2', '4']   
)

fig.add_scatter(x=greedy.parameter, y=np.mean(greedy.reward, axis=1), legendgroup=0, name='ε-greedy', row=1, col=1)
fig.add_scatter(x=optimistic.parameter, y=np.mean(optimistic.reward, axis=1), legendgroup=1, name='optimistic, α=0.1', row=1, col=1)
fig.add_scatter(x=ucb.parameter, y=np.mean(ucb.reward, axis=1), legendgroup=2, name='UCB', row=1, col=1)
fig.add_scatter(x=gradient.parameter, y=np.mean(gradient.reward, axis=1), legendgroup=3, name='Gradient', row=1, col=1)

fig.add_scatter(x=greedy.parameter, y=np.mean(greedy.hit, axis=1), legendgroup=0, showlegend=False, marker_color='#636EFA', name='ε-greedy', row=2, col=1)
fig.add_scatter(x=optimistic.parameter, y=np.mean(optimistic.hit, axis=1), legendgroup=1, showlegend=False, marker_color='#EF553B', name='optimistic, α=0.1', row=2, col=1)
fig.add_scatter(x=ucb.parameter, y=np.mean(ucb.hit, axis=1), legendgroup=2, showlegend=False, marker_color='#00CC96', name='UCB', row=2, col=1)
fig.add_scatter(x=gradient.parameter, y=np.mean(gradient.hit, axis=1), legendgroup=3, showlegend=False, marker_color='#AB63FA', name='Gradient', row=2, col=1)


fig.show()

## Animation

Use animation to illustrate the temporal dimension. This animation does not exist in the book.

In [6]:
conf = [greedy, optimistic, ucb, gradient]

In [7]:
reward = pd.DataFrame(
    ((x.algo.__name__, x.parameter[i], j, x.reward[i,j]) for x in conf for i in range(len(x.parameter)) for j in range(T)), 
    columns=['Algorithm', 'Parameter', 'Time', 'Reward'])

In [8]:
fig = px.line(reward, x='Parameter', y='Reward', animation_frame='Time', animation_group='Parameter', color='Algorithm', 
              log_x=True, range_y=[-1, 2])

In [9]:
fig.update_xaxes(
    type="log",
    tickmode = 'array',
    tickvals = np.logspace(-7, 2, num=10, base=2),
    ticktext = ['1/128', '1/64', '1/32', '1/16', '1/8', '1/4', '1/2', '1', '2', '4']   
)
fig.show()

In [10]:
hit = pd.DataFrame(
    ((x.algo.__name__, x.parameter[i], j, x.hit[i,j]) for x in conf for i in range(len(x.parameter)) for j in range(T)), 
    columns=['Algorithm', 'Parameter', 'Time', 'Hit'])

In [11]:
fig = px.line(hit, x='Parameter', y='Hit', animation_frame='Time', animation_group='Parameter', color='Algorithm', 
        log_x=True, range_y=[0, 1])

In [12]:
fig.update_xaxes(
    type="log",
    tickmode = 'array',
    tickvals = np.logspace(-7, 2, num=10, base=2),
    ticktext = ['1/128', '1/64', '1/32', '1/16', '1/8', '1/4', '1/2', '1', '2', '4']   
)
fig.show()

## Discussion

In this experiment, I used objective rewards as the initial estimates for the $\varepsilon$-greedy algorithm as opposed to the subjective one in the book. The consequence is that it performs much better than in the book. We can observe that the optimistic algorithm starts the fastest, and the gradient method starts slowly but later catchs up. All algorithms can achieve good results asymptotically.

One thing of particular interest here is that all curves go up and down at the same rythme. This is because all algorithms work on the same bandit (with the same random number generator) in each simulation. A bad day for one algorithm is also a bad day for the others.

One can raise the question about the legitimacy of comparing the different parameters of these various algorithms. It is meaningful to compare them particularly because they (Greedy: $\varepsilon$; Optimistic: $Q_0$; UCB: $C$; Gradient: $\alpha$) all characterize the explore-exploit tradeoff. For the explanation of the $\alpha$ of the Gradient method, see the Discussion section of Notebook 02.08 Gradient Bandit Algorithm.