# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .

In [2]:
import os
import time
import pandas as pd
import numpy as np

from ray import tune
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated and hence confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

In [5]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)


In [6]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [7]:
take_step()
take_step()


    action = 4, 
    obs:
        'item': [[0.22930181 0.17829404 0.06249102 0.25761133 0.0502669  0.31445318
  0.42194352 0.21400254 0.01203363 0.23430287 0.02236233 0.18997925
  0.37325581 0.04037145 0.51661284 0.14369684]] (25 items), 
        'item_id': [71 82 96 79 98 55 63 91 49 93 83 57 88 44  2 25  6 94 14 28 92 18 68  0
 12],
        'response': [0.8536230186667692], 
    reward = 0.8536230186667692, 
    finished? = True, 
    info = {'regret': 0.0}
    

    action = 20, 
    obs:
        'item': [[0.0577672  0.09927186 0.33542344 0.13267641 0.29488992 0.29222172
  0.33993029 0.29699188 0.16373869 0.28890032 0.3047448  0.30690089
  0.1451707  0.16043715 0.26718488 0.25505312]] (25 items), 
        'item_id': [28 92 61 34  1  3 16 46 96 88 18  7 54 31  0 87 47 56 65 79 43 26 62 72
 10],
        'response': [0.6931497895856744], 
    reward = 0.6931497895856744, 
    finished? = True, 
    info = {'regret': 0.18019264848245853}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

In the following `num_candidates` steps, which defaults to 25, you should see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward. 

In [8]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward}, regret = {info["regret"]}')

  0: reward = 0.6931497895856744, regret = 0.23028768259291255
  1: reward = 0.7779366192192996, regret = 0.14550085295928739
  2: reward = 0.8220624595353498, regret = 0.0222807569062885
  3: reward = 0.6275801005995154, regret = 0.29585737157907155
  4: reward = 0.8010393917897537, regret = 0.07230304627837925
  5: reward = 0.7478999739330106, regret = 0.1324006363343575
  6: reward = 0.8443432164416383, regret = 0.03595739382572971
  7: reward = 0.6630402744543217, regret = 0.2603971977242653
  8: reward = 0.6161082505283795, regret = 0.28456101687019575
  9: reward = 0.8526157113818931, regret = 0.04805355601668215
 10: reward = 0.711322759627876, regret = 0.162019678440257
 11: reward = 0.7867376019897191, regret = 0.06688541667705006
 12: reward = 0.7645978004360046, regret = 0.1588396717425824
 13: reward = 0.7412949151610494, regret = 0.15937435223752583
 14: reward = 0.6931497895856744, regret = 0.15946592179621866
 15: reward = 0.6858551011185544, regret = 0.18748733694957853

The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

In [9]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


The next cell will print a lot of output. Use the right-click menu, option _Enable Scrolling for Outputs_ to encapsulate the output in a scrollable text box.

In [12]:
start_time = time.time()

analysis = tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=67405)[0m 2020-06-10 08:51:28,692	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=67405)[0m 2020-06-10 08:51:28,694	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=67405)[0m 2020-06-10 08:51:28,709	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=67406)[0m 2020-06-10 08:51:28,687	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=67406)[0m 2020-06-10 08:51:28,688	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=67406)[0m 2020-06-10 08:51:28,703	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=67404)[0m 2020-06-10 08:51:28,718	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eage

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:67405,4,0.942036,400,0.862955
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:67404,3,0.705564,300,0.86544
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:67406,3,0.709957,300,0.901393
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:67403,3,0.597427,300,0.864558
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:67402,3,0.528558,300,0.869265


Result for contrib_LinUCB_ParametricItemRecoEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_08-51-34
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 0.8773313559202169
  episode_reward_mean: 0.8679658655059045
  episode_reward_min: 0.8043509367643551
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: 1eeb178ff9a544b685d9be3bab76ddd8
  experiment_tag: '0'
  grad_time_ms: 1.919
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.919
    learner:
      cumulative_regret: 6.566111139458839
      update_latency: 0.0003650188446044922
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 521.051
    opt_samples: 1.0
    sample_peak_throughput: 198.728
    sample_time_ms: 5.032
    update_time_ms: 0.002
  iterations_since_restore: 20
  learner:
    cumulative_regret: 6.566111139458839
    update_latency: 0.0003650188446044922
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  o

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,20,5.04874,2000,0.867966
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,20,5.07494,2000,0.872261
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,20,5.08487,2000,0.912281
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,20,5.01032,2000,0.862798
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,20,4.84901,2000,0.872881


The trials took 9.848594903945923 seconds



In [13]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_exec_api,config/use_pytorch,logdir
0,0.877331,0.804351,0.867966,1.0,100,2000,2000,5.032,1.919,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
1,0.89048,0.830026,0.872261,1.0,100,2000,2000,1.897,0.983,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
2,0.927985,0.844586,0.912281,1.0,100,2000,2000,1.425,0.801,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
3,0.907334,0.786058,0.862798,1.0,100,2000,2000,3.37,1.151,0.003,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
4,0.88454,0.80913,0.872881,1.0,100,2000,2000,2.502,2.681,0.003,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...


Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

In [23]:
analysis.trial_dataframes.len()

AttributeError: 'dict' object has no attribute 'len'

In [14]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [28]:
dd={'a':0, 'b':1, 'c':2}
dd.items().size()

AttributeError: 'dict_items' object has no attribute 'size'

In [15]:
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.493613,3.678152,3.226954,0.184957
200,4.251147,4.681852,3.788691,0.421572
300,4.638031,5.257113,4.090875,0.490399
400,4.924515,5.701073,4.207127,0.62335
500,5.150134,6.113882,4.385138,0.696177
600,5.348266,6.414613,4.48048,0.742806
700,5.494449,6.689018,4.562759,0.828577
800,5.639926,6.955515,4.650081,0.886309
900,5.739757,7.277345,4.674388,0.99424
1000,5.831165,7.483627,4.701875,1.059063


It will be easier to understand these results with a graph:

In [16]:
from bokeh.plotting import figure, show, output_file
from bokeh.models import Band, ColumnDataSource, Range1d
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [17]:
df['lower'] = df['mean'] - df['std']
df['upper'] = df['mean'] + df['std']
ymin=df['lower'].min()
ymax=df['upper'].max()

source = ColumnDataSource(df.reset_index())

TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(tools=TOOLS, y_range=Range1d(ymin,ymax))

p.scatter(x='num_steps_trained', y='mean', line_color='black', fill_alpha=0.3, size=5, source=source)
band = Band(base='num_steps_trained', lower='lower', upper='upper', source=source, level='underlay',
            fill_alpha=0.3, line_width=1, line_color='blue')
p.add_layout(band)

p.title.text = "Cumulative Regret"
p.xgrid[0].grid_line_alpha=0.5
p.ygrid[0].grid_line_alpha=0.5
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Regret'

show(p)

So the _cummulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.