# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .

In [8]:
import os
import time
import pandas as pd
import numpy as np

import ray
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated and hence confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

In [2]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)




In [3]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [4]:
take_step()
take_step()


    action = 19, 
    obs:
        'item': [[0.0836686  0.03413296 0.3548183  0.3186919  0.05647112 0.1685622
  0.06037586 0.21491483 0.36191626 0.18987564 0.35289666 0.39885143
  0.38335529 0.15497378 0.12475218 0.21387429]] (25 items), 
        'item_id': [96 45 47 64 51 29 10 24 67 36 98 49 73 39 90 14 17  3 69 58 52 75 60  6
 56],
        'response': [0.7742432650765447], 
    reward = 0.7742432650765447, 
    finished? = True, 
    info = {'regret': 0.05152240241759887}
    

    action = 3, 
    obs:
        'item': [[0.33545319 0.28567338 0.24924367 0.31581556 0.20958879 0.21383441
  0.11115569 0.25297519 0.06565145 0.35436604 0.01753631 0.37512328
  0.18226128 0.21928755 0.18862395 0.30033273]] (25 items), 
        'item_id': [91 79 50 40 15 20 93 25 35 39 47 32 64 17 37 77 58 97 92 76 44 75 16 57
 23],
        'response': [0.7915956008945028], 
    reward = 0.7915956008945028, 
    finished? = True, 
    info = {'regret': 0.04058009412694119}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

In the following `num_candidates` steps, which defaults to 25, you may see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward, but not for all runs. Which one has the lowest regret?

In [6]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward:7.5f}, regret = {info["regret"]:7.5f}')

  0: reward = 0.62367, regret = 0.21503
  1: reward = 0.77301, regret = 0.00412
  2: reward = 0.83870, regret = 0.00000
  3: reward = 0.69889, regret = 0.08194
  4: reward = 0.74752, regret = 0.10743
  5: reward = 0.84322, regret = 0.00000
  6: reward = 0.59587, regret = 0.23830
  7: reward = 0.59587, regret = 0.23630
  8: reward = 0.70778, regret = 0.14716
  9: reward = 0.76287, regret = 0.08035
 10: reward = 0.55542, regret = 0.27676
 11: reward = 0.59587, regret = 0.24282
 12: reward = 0.61115, regret = 0.21461
 13: reward = 0.66076, regret = 0.17342
 14: reward = 0.79247, regret = 0.03329
 15: reward = 0.77712, regret = 0.05706
 16: reward = 0.56246, regret = 0.27624
 17: reward = 0.76270, regret = 0.07148
 18: reward = 0.74752, regret = 0.09118
 19: reward = 0.70118, regret = 0.13752
 20: reward = 0.76287, regret = 0.07131
 21: reward = 0.56246, regret = 0.27624
 22: reward = 0.66846, regret = 0.17024
 23: reward = 0.66395, regret = 0.17475
 24: reward = 0.74212, regret = 0.09206


The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

Note that we imported `UCB_CONFIG` above, which has the properties defined that are expected _LinUCB_. We'll add another property to it for the environment. (Subsequent lessons will show other ways to work with the configuration.)

In [7]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


Now let's use [Ray Tune](http://tune.io) to train. First we'll ensure that Ray is properly initialized

In [9]:
!../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [10]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15832',
 'object_store_address': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764'}

The next cell will print a lot of output. Use the right-click menu, option _Enable Scrolling for Outputs_ to encapsulate the output in a scrollable text box.

In [11]:
start_time = time.time()

analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    verbose=2,  # Change to 0 or 1 to reduce the output.
    ray_auto_init=False,    # Don't allow Tune to initialize Ray.
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,454	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,456	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,483	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,448	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,449	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,465	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76834)[0m 2020-06-13 10:25:09,455	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eage

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:76835,1.0,0.494151,100.0,0.846617
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,,,,,


Result for contrib_LinUCB_ParametricItemRecoEnv_00002:
  custom_metrics: {}
  date: 2020-06-13_10-25-09
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9280944831908227
  episode_reward_mean: 0.8694560746347274
  episode_reward_min: 0.6353103557043434
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 56209c31bedd4affb6dff6ae552da3c4
  experiment_tag: '2'
  grad_time_ms: 0.552
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.552
    learner:
      cumulative_regret: 3.551424354495237
      update_latency: 0.00023221969604492188
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1810.856
    opt_samples: 1.0
    sample_peak_throughput: 706.6
    sample_time_ms: 1.415
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 3.551424354495237
    update_latency: 0.00023221969604492188
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_p

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:76835,16,5.17504,1600,0.882625
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:76836,16,4.78658,1600,0.897233
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:76834,16,4.79267,1600,0.90677
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:76837,17,5.19255,1700,0.918764
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:76838,16,5.06302,1600,0.904948


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-13_10-25-15
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9470153333930189
  episode_reward_mean: 0.8946172684156892
  episode_reward_min: 0.8196083749122994
  episodes_this_iter: 100
  episodes_total: 1700
  experiment_id: 1bf5e4e7869047f88dd6e07b98e3cb24
  experiment_tag: '1'
  grad_time_ms: 0.912
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.912
    learner:
      cumulative_regret: 5.8236768922896625
      update_latency: 0.00036406517028808594
    num_steps_sampled: 1700
    num_steps_trained: 1700
    opt_peak_throughput: 1096.808
    opt_samples: 1.0
    sample_peak_throughput: 498.077
    sample_time_ms: 2.008
    update_time_ms: 0.003
  iterations_since_restore: 17
  learner:
    cumulative_regret: 5.8236768922896625
    update_latency: 0.00036406517028808594
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1700
  num_steps_trained: 17

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,20,6.31882,2000,0.887076
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,20,6.16512,2000,0.892522
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,20,6.17001,2000,0.905924
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,20,6.09252,2000,0.917207
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,20,6.17614,2000,0.897268


The trials took 14.795217990875244 seconds



In [12]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_exec_api,config/use_pytorch,logdir
0,0.913371,0.829464,0.887076,1.0,100,2000,2000,1.678,0.8,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
1,0.947015,0.816333,0.892522,1.0,100,2000,2000,1.454,0.726,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
2,0.928094,0.848285,0.905924,1.0,100,2000,2000,1.441,0.75,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
3,0.951993,0.841317,0.917207,1.0,100,2000,2000,1.251,0.646,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
4,0.935275,0.837494,0.897268,1.0,100,2000,2000,1.743,0.95,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...


Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

One of the columns in the trial dataframes is the `learner/cumulative_regret`. Let's combine the trail DataFrames into a single DataFrame, then group over the `number_steps_trained` and project out the `learner/cumulative_regret`. Finally, aggregate for each `number_steps_trained` to compute the `mean`, `max`, `min`, and `std` (standard deviation) for the cumulative regret.

In [13]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [14]:
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.510061,3.938958,3.107594,0.301512
200,4.051401,4.31726,3.648172,0.249409
300,4.408159,4.781662,4.232443,0.222284
400,4.662473,5.035561,4.466877,0.219358
500,4.902268,5.236158,4.716805,0.201651
600,5.110317,5.535199,4.862963,0.25914
700,5.268552,5.753674,4.976128,0.299698
800,5.39789,5.978279,5.077381,0.343824
900,5.488008,6.080962,5.126822,0.355669
1000,5.590762,6.180468,5.222625,0.354345


It will be easier to understand these results with a graph:

In [16]:
from bokeh_util import plot_cumulative_regret
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [17]:
plot_cumulative_regret(df)

[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,014	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,015	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,024	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,013	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,014	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,024	INFO trainable.py:217 -- Getting current IP.


([image](../../images/rllib/LinUCB-Cumulative-Regret.png))

So the _cummulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.

The environment we're using randomly generates data on every step, so there will always be some regret even if we train for a longer period of time.

## Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.