# Ray RLlib - Linear Upper Confidence Bound

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](02-Simple-Multi-Armed-Bandit.ipynb), we used _LinUCB_ (Linear Upper Confidence Bound) for the exploration-explotation strategy ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-upper-confidence-bound-contrib-linucb)), which assumes a linear dependency between the expected reward of an action and its context. We pointed out that a linear function is of the form $z = ax + by + c$, for example, where $x$, $y$, and $z$ are variables and $a$, $b$, and $c$ are constants. LinUCB models the representation space using a set of linear predictors.

Now we'll explore _LinUCB_ in a recommendation environment with _parametric actions_, which are discrete actions that have continuous parameters. At each step, the agent must select which action to use and which parameters to use with that action. 

From [Sutton 2018](../06-RL-References.ipynb#Books), LinUCB 

References for LinUCB and parameterized actions:

* Lihong Li, Wei Chu, John Langford, Robert E. Schapire, "A contextual-bandit approach to personalized news article recommendation", Proceedings of the 19th International Conference on World Wide Web (WWW 2010), [pdf](https://arxiv.org/abs/1003.0146).
* Wei Chu, Lihong Li, Lev Reyzin, Robert E. Schapire (), "Contextual bandits with linear payoff functions" (PDF), Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS 2011), [pdf](https://arxiv.org/abs/1003.0146).
* T.L. Lai, Herbert Robbins, “Asymptotically efficient adaptive allocation rules”, Advances in Applied Mathematics, Volume 6, Issue 1 (1985), pp 4-22, [link](https://doi.org/10.1016/0196-8858(85)90002-8).
* M N Katehakis and H Robbins, “Sequential choice from several populations”, Proc Natl Acad Sci U S A. 1995 Sep 12; 92(19): 8584–8585, [link](https://doi.org/10.1073/pnas.92.19.8584).
* Warrick Masson, Pravesh Ranchod, George Konidaris, "Reinforcement Learning with Parameterized Actions" [pdf](https://arxiv.org/abs/1509.01644).

In [9]:
import os
import time
import pandas as pd
import numpy as np

from ray import tune
from ray.rllib.contrib.bandits.agents import LinUCBTrainer
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

Use `ParametricItemRecoEnv` ([parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)) as the environment, which is a recommendation environment ("RecoEnv") that generates "items" with randomly-generated features, some visible and optionally some hidden. The default sizes are governed by `DEFAULT_RECO_CONFIG` also in [parametric.py](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/parametric.py)):

```python
DEFAULT_RECO_CONFIG = {
    "num_users": 1,        # More than one user at a time?
    "num_items": 100,      # Number of items to randomly sample.
    "feature_dim": 16,     # Number of features per item, with randomly generated values
    "slate_size": 1,       # More than one step at a time?
    "num_candidates": 25,  # Determines the action space and the the number of items randomly sampled from the num_items items.
    "seed": 1              # For randomization
}
```

Let's look at the default behavior:

In [52]:
pire = ParametricItemRecoEnv()
pire.reset()
print(f'action space: {pire.action_space}')
action = pire.action_space.sample()
obs, reward, finished, info = pire.step(action)
obs_item_foo = f"{obs['item'][:1]} ({len(obs['item'])} items)"
print(f"""
action = {action}, 
obs:
    'item': {obs_item_foo}, 
    'item_id': {obs['item_id']},
    'response': {obs['response']}, 
reward = {reward}, 
finished? = {finished}, 
info = {info}
""")

action space: Discrete(25)

action = 4, 
obs:
    'item': [[0.10202451 0.19664231 0.44530743 0.25907888 0.1517713  0.20830519
  0.39972454 0.26024591 0.09253173 0.19719537 0.16187536 0.43700101
  0.0049241  0.34413293 0.0666158  0.06370218]] (25 items), 
    'item_id': [ 7 74 16 18 62 25 51 48 33 94 85 34 41 69 53 24 77  9 71 45 66  0  5  2
 20],
    'response': [0.7306249741937558], 
reward = 0.7306249741937558, 
finished? = True, 
info = {'regret': 0.16393385117766057}



The rewards at each step are randomly computed using matrix multiplication of the various randomly-generated matrices of data, followed by selecting a response indexed by the particular action specified to `step`. However, as constructed the reward always comes out between about .64 and .87 and the regret is the maximum value over all possible actions minus the reward for the specified action. In the following `num_candidates` steps, which defaults to 25, you should see one regret of 0.0, which happens to be the maximum possible reward. 

In [75]:
for i in range(pire.num_candidates):
    action = pire.action_space.sample()
    obs, reward, finished, info = pire.step(action)
    print(f'{i:3d}: reward = {reward}, regret = {info["regret"]}')

  0: reward = 0.7055033114274591, regret = 0.1411572092375436
  1: reward = 0.7802960917713582, regret = 0.12002323247687607
  2: reward = 0.7055033114274591, regret = 0.15768762623138977
  3: reward = 0.8461772423074813, regret = 0.04838158306393514
  4: reward = 0.8466605206650027, regret = 0.0
  5: reward = 0.7992618174651833, regret = 0.10105750678305092
  6: reward = 0.6390725545541198, regret = 0.2612467696941144
  7: reward = 0.6156676430300783, regret = 0.284651681218156
  8: reward = 0.7714667319285453, regret = 0.09575891178979923
  9: reward = 0.8461772423074813, regret = 0.021048401410863282
 10: reward = 0.7714667319285453, regret = 0.0930994725548352
 11: reward = 0.7714667319285453, regret = 0.0930994725548352
 12: reward = 0.8001591804210235, regret = 0.04650134024397923
 13: reward = 0.6993927265111476, regret = 0.20092659773708665
 14: reward = 0.8418776391141665, regret = 0.025348004604178076
 15: reward = 0.6467887151108733, regret = 0.2139794535191073
 16: reward =

Now that we've explored `ParametricItemRecoEnv`, let's use it with _LinUCB_.

In [2]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 20 * timesteps_per_iteration (100 by default) = 2,000
training_iterations = 20

print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


In [4]:
start_time = time.time()

# first argument should no longer be `LinUCBTrainer`

analysis = tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

In [4]:
print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=22756)[0m 2020-06-08 17:18:32,657	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=22756)[0m 2020-06-08 17:18:32,658	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=22755)[0m 2020-06-08 17:18:32,656	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=22755)[0m 2020-06-08 17:18:32,657	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=22759)[0m 2020-06-08 17:18:32,655	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=22759)[0m 2020-06-08 17:18:32,656	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:22760,1,0.181221,100,0.796752
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:22757,1,0.173921,100,0.876807
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:22759,1,0.186402,100,0.860686
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:22755,2,0.398609,200,0.877462
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:22756,1,0.186187,100,0.877893


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-08_17-18-37
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9571880359594797
  episode_reward_mean: 0.9068167257285522
  episode_reward_min: 0.8661255888998514
  episodes_this_iter: 100
  episodes_total: 1800
  experiment_id: f43f5309cf454998b4536481c33692a0
  experiment_tag: '1'
  grad_time_ms: 0.755
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.755
    learner:
      cumulative_regret: 6.284078888789494
      update_latency: 0.0005629062652587891
    num_steps_sampled: 1800
    num_steps_trained: 1800
    opt_peak_throughput: 1324.963
    opt_samples: 1.0
    sample_peak_throughput: 194.274
    sample_time_ms: 5.147
    update_time_ms: 0.002
  iterations_since_restore: 18
  learner:
    cumulative_regret: 6.284078888789494
    update_latency: 0.0005629062652587891
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1800
  num_steps_trained: 1800
 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:22760,18,4.89113,1800,0.828919
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:22757,18,4.89365,1800,0.906817
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:22759,18,4.8371,1800,0.891643
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:22755,19,5.05503,1900,0.885349
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:22756,19,5.14412,1900,0.907343


Result for contrib_LinUCB_ParametricItemRecoEnv_00000:
  custom_metrics: {}
  date: 2020-06-08_17-18-38
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8619566422683141
  episode_reward_mean: 0.8281987895570359
  episode_reward_min: 0.7403604095551464
  episodes_this_iter: 100
  episodes_total: 1900
  experiment_id: 5f052f26fe604c458e58f04a7b4945ad
  experiment_tag: '0'
  grad_time_ms: 0.995
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.995
    learner:
      cumulative_regret: 5.545677306202458
      update_latency: 0.0006070137023925781
    num_steps_sampled: 1900
    num_steps_trained: 1900
    opt_peak_throughput: 1004.648
    opt_samples: 1.0
    sample_peak_throughput: 601.696
    sample_time_ms: 1.662
    update_time_ms: 0.002
  iterations_since_restore: 19
  learner:
    cumulative_regret: 5.545677306202458
    update_latency: 0.0006070137023925781
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1900
  num_steps_trained: 1900
 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,20,5.41301,2000,0.828398
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,20,5.38405,2000,0.910278
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,20,5.34727,2000,0.884615
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,20,5.31074,2000,0.886636
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,20,5.39993,2000,0.906308


The trials took 10.629887104034424 seconds



In [8]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,config/seed,config/shuffle_buffer_size,config/soft_horizon,config/synchronize_filters,config/tf_session_args,config/timesteps_per_iteration,config/train_batch_size,config/use_exec_api,config/use_pytorch,logdir
0,0.861957,0.747152,0.828398,1.0,100,2000,2000,1.445,0.737,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
1,0.957188,0.843713,0.910278,1.0,100,2000,2000,1.636,0.902,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
2,0.923136,0.816543,0.884615,1.0,100,2000,2000,1.877,1.051,0.003,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
3,0.919342,0.822906,0.886636,1.0,100,2000,2000,1.514,0.753,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...
4,0.937596,0.83443,0.906308,1.0,100,2000,2000,1.534,0.852,0.002,...,,0,False,True,"{'allow_soft_placement': True, 'device_count':...",100,1,False,True,/Users/deanwampler/ray_results/contrib/LinUCB/...


In [17]:
# Analyze cumulative regrets of the trials
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])

In [18]:
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.364865,3.602027,3.069012,0.220297
200,3.98838,4.367711,3.84254,0.219889
300,4.303345,4.760648,4.115545,0.261865
400,4.649001,4.978133,4.445932,0.223887
500,4.859206,5.296558,4.560299,0.291189
600,5.01271,5.385695,4.717595,0.282159
700,5.103951,5.428125,4.740504,0.282206
800,5.217915,5.556915,4.822609,0.328547
900,5.325117,5.66671,4.924149,0.338508
1000,5.410992,5.780081,4.950459,0.397302


In [13]:
from bokeh.plotting import figure, show, output_file
from bokeh.models import Band, ColumnDataSource, Range1d
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [96]:
df['lower'] = df['mean'] - df['std']
df['upper'] = df['mean'] + df['std']
ymin=df['lower'].min()
ymax=df['upper'].max()

source = ColumnDataSource(df.reset_index())

TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(tools=TOOLS, y_range=Range1d(ymin,ymax))

p.scatter(x='num_steps_trained', y='mean', line_color='black', fill_alpha=0.3, size=5, source=source)
band = Band(base='num_steps_trained', lower='lower', upper='upper', source=source, level='underlay',
            fill_alpha=0.3, line_width=1, line_color='blue')
p.add_layout(band)

p.title.text = "Cumulative Regret"
p.xgrid[0].grid_line_alpha=0.5
p.ygrid[0].grid_line_alpha=0.5
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Regret'

show(p)

So the cummulative regret increases, then begins to level off as the system gets better at optimizing the mean reward.