# Gradient Bandit Algorithm

**Author:** ZHENG Wenjie

**Last Update:** 2021-08-13

This notebook is related to Section 2.8 of the book.

For the plot renderer, I used 'notebook_connected' to reduce the file size. For personal use, replace it with 'notebook'.

In [1]:
import numpy as np
from tqdm import trange
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook_connected' # or 'notebook' for personal use

from chapter02 import Bandit, Gradient, play

## M times rerun on the same bandit

Here, we have a *single* bandit machine, and we run various algorithms on it over and over. Of course, each run has different random numbers. This experiment does not exist in the book.

In [2]:
M = 1000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((4, T))
average_hit = np.zeros((4, T))
for m in trange(M):
    reward = np.zeros((4, T))
    hit = np.zeros((4, T))
    for i, player in enumerate([Gradient(K, 0.1, seed=m), Gradient(K, 0.4, seed=m), Gradient(K, 0.1, baseline=-4, seed=m), Gradient(K, 0.4, baseline=-4, seed=m)]):
        np.random.seed(0)
        bandit = Bandit(np.random.randn(K), σ, seed=m)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= M
average_hit /= M

100%|██████████| 1000/1000 [04:55<00:00,  3.39it/s]


In [3]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{M} times rerun on the same bandit')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='α=0.1, objective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='α=0.4, objective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='α=0.1, subjective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[3, :], legendgroup=3, name='α=0.4, subjective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[3, :], legendgroup=3, showlegend=False, marker_color='#AB63FA'), row=2, col=1)
fig.show()

The word "subjective" refers to the use of subjective baseline (in this experiment, 4 units below the average reward across all bandits).
The word "objective" refers to the use of empirical average reward as the baseline. In the book, the current baseline does not include the current reward as per the algorithm, but the experiment *does* include the current reward.
In my implementation, the current baseline does *not* include the current reward. The difference is neglectable, though.

## N independent bandits

Here, we have $N$ independent bandit machines, and each is tested against various algorithms. This is the experiment prescribed in the book. It is noteworthy that the experiment presented here is slightly different from Zhang Shangtong's, which uses a different set of $N$ bandits for different algorithms.

In [4]:
N = 1000
K = 10
T = 1000
σ = 1
average_reward = np.zeros((4, T))
average_hit = np.zeros((4, T))
for n in trange(N):
    reward = np.zeros((4, T))
    hit = np.zeros((4, T))
    for i, player in enumerate([Gradient(K, 0.1, seed=n), Gradient(K, 0.4, seed=n), Gradient(K, 0.1, baseline=-4, seed=n), Gradient(K, 0.4, baseline=-4, seed=n)]):
        np.random.seed(n)
        bandit = Bandit(np.random.randn(K), σ, seed=n)
        reward[i], hit[i] = play(bandit, player, T)
    average_reward += reward
    average_hit += hit
average_reward /= N
average_hit /= N

100%|██████████| 1000/1000 [04:40<00:00,  3.57it/s]


In [5]:
fig = make_subplots(rows=2, cols=1, shared_xaxes=True, vertical_spacing=0.05, subplot_titles=['Average Reward', '% Optimal Pull'])
fig.update_layout(height=700, width=700, title=f'{N} independent bandits')
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[0, :], legendgroup=0, name='α=0.1, objective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[1, :], legendgroup=1, name='α=0.4, objective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[2, :], legendgroup=2, name='α=0.1, subjective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_reward[3, :], legendgroup=3, name='α=0.4, subjective'), row=1, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[0, :], legendgroup=0, showlegend=False, marker_color='#636EFA'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[1, :], legendgroup=1, showlegend=False, marker_color='#EF553B'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[2, :], legendgroup=2, showlegend=False, marker_color='#00CC96'), row=2, col=1)
fig.add_trace(go.Scatter(x=np.arange(T), y=average_hit[3, :], legendgroup=3, showlegend=False, marker_color='#AB63FA'), row=2, col=1)
fig.show()

The word "subjective" refers to the use of subjective baseline (in this experiment, 4 units below the average reward across all bandits).
The word "objective" refers to the use of empirical average reward as the baseline. In the book, the current baseline does not include the current reward as per the algorithm, but the experiment *does* include the current reward.
In my implementation, the current baseline does *not* include the current reward. The difference is neglectable, though.

## Discussion

One thing not pronunced by the authors is that the parameter $\alpha$ is not only a step size but also reflects the willingness of exploration. Indeed, the step size $\alpha$ can be deducted by prefixing $H$ with $\alpha$ in the sampling probability formula. The lower $\alpha$ is, the more likely the algorithm will explore.