# Ray RLlib Multi-Armed Bandits - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. 

Now we'll use _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. This increases the complexity of the context and the challenge of finding the optimal action to achieve the highest mean reward over time.

See the previous discussion of UCB in [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb)  and the [previous lesson](03-Simple-Multi-Armed-Bandit.ipynb) .

In [1]:
import os
import time
import pandas as pd
import numpy as np

from ray import tune
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Use `ParametricItemRecoEnv` ([parametric.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" (the "parameters") with randomly-generated features, some visible and some optionally hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

This environment is deliberately complicated and hence confusing to understand at first. So, let's look at its behavior. We'll create one using the default settings:

In [2]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space} (number of actions that can be selected)')

action space: Discrete(25) (number of actions that can be selected)




In [3]:
def take_step():
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
    print(f"""
    action = {action}, 
    obs:
        'item': {obs_item_foo}, 
        'item_id': {obs['item_id']},
        'response': {obs['response']}, 
    reward = {reward}, 
    finished? = {finished}, 
    info = {info}
    """)

In [4]:
take_step()
take_step()


    action = 23, 
    obs:
        'item': [[0.00123255 0.06023896 0.12626766 0.14187066 0.33332807 0.2679012
  0.06254647 0.43646254 0.25802858 0.13698179 0.08840018 0.21472145
  0.32366547 0.25727624 0.41840832 0.31261815]] (25 items), 
        'item_id': [77 75  1 41 80 36 47 90 85 27 86 97 78 87 26  8 16 83 66  4 63 54 34 98
 51],
        'response': [0.6064902582015401], 
    reward = 0.6064902582015401, 
    finished? = True, 
    info = {'regret': 0.26688273744069746}
    

    action = 8, 
    obs:
        'item': [[0.35339753 0.10633859 0.11532623 0.10206193 0.34478443 0.3140944
  0.17849625 0.04557977 0.09574596 0.33896961 0.15481522 0.3529807
  0.40987704 0.20351658 0.23457959 0.22702185]] (25 items), 
        'item_id': [92 31 28 95 96 17 56 46  9 27 40 32 64 24  4 43 89 44 39 34 59 68  7 82
 23],
        'response': [0.5816493716833073], 
    reward = 0.5816493716833073, 
    finished? = True, 
    info = {'regret': 0.293763122180254}
    


> **Note:** If you see a warning about _Box bound precision lowered by casting to float32_, you can safely ignore it.

The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response (reward), indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about 0.6 and 0.9 and the regret is the maximum value over all possible actions minus the reward for the specified action. 

The `item` shown is the subset of all the _items_ in the environment, with the `item_id` being the corresponding indices of the items shown in the larger collection of items. This list of 25 items is randomly chosen _for each step_, as you should be able to see from these two steps.

In the following `num_candidates` steps, which defaults to 25, you may see one regret of 0.0, which happens to be when the action was selected with the maximum possible reward, but not for all runs. Which one has the lowest regret?

In [5]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward}, regret = {info["regret"]}')

  0: reward = 0.7186498060135409, regret = 0.17915926898749757
  1: reward = 0.7483002163164545, regret = 0.15056446307741933
  2: reward = 0.8059510864387087, regret = 0.04499838755036256
  3: reward = 0.7186498060135409, regret = 0.11791640993122465
  4: reward = 0.755352087640741, regret = 0.1435125917531328
  5: reward = 0.6748210620044942, regret = 0.17612841198457707
  6: reward = 0.7585861123785249, regret = 0.11682638148503643
  7: reward = 0.5721543161822529, regret = 0.32671036321162095
  8: reward = 0.8120346928929392, regret = 0.08682998650093465
  9: reward = 0.71962790916608, regret = 0.17923677022779383
 10: reward = 0.8571418130920031, regret = 0.037885442715129725
 11: reward = 0.7774867171688689, regret = 0.07346275682020242
 12: reward = 0.8059510864387087, regret = 0.08907616936842411
 13: reward = 0.5776779011614113, regret = 0.3201311738396272
 14: reward = 0.680615742978746, regret = 0.21824893641512777
 15: reward = 0.8733729956422376, regret = 0.002039498221323

The up shot is that training to find the optimal, mean reward will be more challenging than our previous simple bandit.

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

Note that we imported `UCB_CONFIG` above, which has the properties defined that are expected _LinUCB_. We'll add another property to it for the environment. (Subsequent lessons will show other ways to work with the configuration.)

In [6]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


The next cell will print a lot of output. Use the right-click menu, option _Enable Scrolling for Outputs_ to encapsulate the output in a scrollable text box.

In [7]:
start_time = time.time()

analysis = tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False,
    verbose=2)  # Change to 0 or 1 to reduce the output.
)

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-10 14:20:35,426	INFO resource_spec.py:212 -- Starting Ray with 3.91 GiB memory available for workers and up to 1.97 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-10 14:20:35,777	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=79940)[0m 2020-06-10 14:20:45,092	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=79939)[0m 2020-06-10 14:20:45,091	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=79939)[0m 2020-06-10 14:20:45,093	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=79943)[0m 2020-06-10 14:20:45,089	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=79943)[0m 2020-06-10 14:20:45,091	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=79945)[0m 2020-06-10 14:20:45,093	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=79945)[0m

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:79943,1.0,0.28953,100.0,0.843312


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_14-20-45
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9466290368309259
  episode_reward_mean: 0.8814512459870827
  episode_reward_min: 0.746679146555184
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 53a2481a3fb3420992e1d0232f610cb5
  experiment_tag: '1'
  grad_time_ms: 0.54
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.54
    learner:
      cumulative_regret: 3.311302597429333
      update_latency: 0.0005810260772705078
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1850.156
    opt_samples: 1.0
    sample_peak_throughput: 714.423
    sample_time_ms: 1.4
    update_time_ms: 0.005
  iterations_since_restore: 1
  learner:
    cumulative_regret: 3.311302597429333
    update_latency: 0.0005810260772705078
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,20,5.03052,2000,0.858867
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,20,4.86778,2000,0.911818
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:79944,19,4.75186,1900,0.875099
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,20,4.80601,2000,0.827548
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,20,4.89883,2000,0.874058


Result for contrib_LinUCB_ParametricItemRecoEnv_00002:
  custom_metrics: {}
  date: 2020-06-10_14-20-50
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 0.8998296948606409
  episode_reward_mean: 0.8712075069247619
  episode_reward_min: 0.8047072961812431
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: 93feb05ebf6148a2833d5d617360cf2a
  experiment_tag: '2'
  grad_time_ms: 1.004
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.004
    learner:
      cumulative_regret: 5.309169914533084
      update_latency: 0.0008361339569091797
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 996.272
    opt_samples: 1.0
    sample_peak_throughput: 544.312
    sample_time_ms: 1.837
    update_time_ms: 0.003
  iterations_since_restore: 20
  learner:
    cumulative_regret: 5.309169914533084
    update_latency: 0.0008361339569091797
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  o

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,20,5.03052,2000,0.858867
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,20,4.86778,2000,0.911818
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,20,5.07479,2000,0.871208
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,20,4.80601,2000,0.827548
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,20,4.89883,2000,0.874058


The trials took 15.244249105453491 seconds



In [8]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_exec_api,config/use_pytorch,logdir
0,0.869082,0.81686,0.858867,1.0,100,2000,2000,1.668,0.782,0.003,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
1,0.946629,0.86514,0.911818,1.0,100,2000,2000,1.606,0.72,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
2,0.89983,0.804707,0.871208,1.0,100,2000,2000,1.837,1.004,0.003,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
3,0.841995,0.774359,0.827548,1.0,100,2000,2000,1.407,0.912,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
4,0.891802,0.817305,0.874058,1.0,100,2000,2000,1.428,0.741,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...


Note the `episode_reward_mean` values. Now let's analyze the _cumulative regrets_ of the trials. It's inevitable that we sometimes pick a suboptimal action, but was this done less often as time progressed?

One of the columns in the trial dataframes is the `learner/cumulative_regret`. Let's combine the trail DataFrames into a single DataFrame, then group over the `number_steps_trained` and project out the `learner/cumulative_regret`. Finally, aggregate for each `number_steps_trained` to compute the `mean`, `max`, `min`, and `std` (standard deviation) for the cumulative regret.

In [9]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [10]:
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.271095,3.499168,2.863206,0.252953
200,3.953709,4.115531,3.774561,0.152823
300,4.434133,4.613814,4.143466,0.178298
400,4.793887,5.009001,4.233726,0.322037
500,5.058663,5.537511,4.353675,0.435214
600,5.262876,5.911412,4.467905,0.521761
700,5.378028,6.081548,4.564454,0.549898
800,5.538494,6.266633,4.65342,0.58501
900,5.66974,6.456948,4.690469,0.642897
1000,5.775418,6.575187,4.766592,0.656184


It will be easier to understand these results with a graph:

In [14]:
from bokeh_util import plot_cumulative_regret, plot_model_weights
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [15]:
plot_cumulative_regret(df)

(Can't see the graph, here's an [image](../../images/rllib/LinUCB-cumulative-regret.png)).

So the _cummulative_ regret increases for the entire number of training steps for all five trials, but for larger step numbers, the amount of regret added decreases as we learn, so the graph begins to level off as the system gets better at optimizing the mean reward.

The environment we're using randomly generates data on every step, so there will always be some regret even if we train for a longer period of time.

## Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.