# Ray RLlib Multi-Armed Bandits - Linear Thompson Sampling

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson uses a second exploration strategy we discussed briefly in lesson [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb), _Thompson Sampling_, with a linear variant, [LinTS](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints).

## Wheel Bandit

We'll use it on the `Wheel Bandit` problem ([RLlib discrete.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py)), which is an artificial problem designed to force exploration. It is described in the paper [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127) (see _The Wheel Bandit_ section). The paper uses it to  model 2D contexts, but it can be generalized to more than two dimensions.

You can visualize this problem as a wheel (circle) with four other regions around it. An exploration parameter delta $\delta$ defines a threshold, such that if the norm of the context vector is less than or equal to delta (inside the “wheel”) then the leader action is taken (conventionally numbered `1`). Otherwise, the other four actions are explored.

From figure 3 in [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127), the Wheel Bandit can be visualized this way:

![Wheel Bandit](../../images/rllib/Wheel-Bandit.png)

The radius of the entire colored circle is 1.0, while the radius of the blue "core" is $\delta$.

Contexts are sampled randomly within the unit circle (radius 1.0). The optimal action for the blue, red, green, black, or yellow region is the action 1, 2, 3, 4, or 5, respectively. In other words, if the context is in the blue region, radius < $\delta$, action 1 is optimal, if it is in the upper-right-hand quadrant with radius between $\delta$ and 1.0, then action 2 is optimal, etc.

The parameter $\delta$ controls how aggressively we explore. The reward $r$ for each action and context combination are based on a normal distribution as follows:

Action 1 offers the reward, $r \sim \mathcal{N}({\mu_1,\sigma^2})$, independent of context.

Actions 2-5 offer the reward, $r \sim \mathcal{N}({\mu_2,\sigma^2})$ where $\mu_2 < \mu_1$, _when they are suboptimal choices_. When they are optimal, the reward is $r \sim \mathcal{N}({\mu_3,\sigma^2})$ where $\mu_3 \gg \mu_1$.

In addition to $\delta$, the parameters $\mu_1$, $\mu_2$ $\mu_3$, and $\sigma$ are configurable. The default values for these parameters in the paper and in the [RLlib implementation](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py) are as follows:

```python
DEFAULT_CONFIG_WHEEL = {
    "delta": 0.5,
    "mu_1": 1.2,
    "mu_2": 1.0,
    "mu_3": 50.0,
    "std": 0.01  # sigma
}
```

Note that the probability of a context randomly falling in the high-reward region (not blue) is 1 − $\delta^2$. Therefore, the difficulty of the problem increases with $\delta$, and algorithms used with this bandit are more likely to get stuck repeatedly selecting action 1 for large $\delta$.

## Use Wheel Bandit with Thompson Sampling

Note the import in the next cell of `LinTSTrainer` and how it is used below when setting up the _Tune_ job. For the `LinUCB` example in the [previous lesson](04-Linear-Upper-Confidence-bound.ipynb), we didn't import the corresponding `LinUCBTrainer`, but passed a "magic" string to Tune, `contrib/LinUCB`. This approach is an alternative.

In [1]:
import time
import numpy as np
import pandas as pd
import ray
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [2]:
from bokeh_util import plot_cumulative_regret, plot_wheel_bandit_model_weights
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [3]:
wbe = WheelBanditEnv()
wbe.config



{'delta': 0.5, 'mu_1': 1.2, 'mu_2': 1, 'mu_3': 50, 'std': 0.01}

The effective number of `training_iterations` will be `20 * timesteps_per_iteration == 2,000` where the timesteps per iteration is `100` by default.

In [4]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


What's in the standard config object for _LinTS_ anyway??

In [5]:
TS_CONFIG

{'num_workers': 0,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 1,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 1,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [256, 256],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': ray.rllib.contrib.bandits.envs.discrete.WheelBanditEnv,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr':

Initialize Ray...

In [8]:
!../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [9]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15832',
 'object_store_address': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764'}

In [10]:
start_time = time.time()

analysis = ray.tune.run(
    LinTSTrainer,
    config=TS_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=2,
    checkpoint_at_end=True,
    verbose=2,              # Change to 0 or 1 to reduce the output.
    ray_auto_init=False,    # Don't allow Tune to initialize Ray.
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,014	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,015	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76840)[0m 2020-06-13 10:30:43,024	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,013	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,014	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76844)[0m 2020-06-13 10:30:43,024	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-13_10-30-43
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:76840,1,0.163929,100,29.4253
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:76844,2,0.320174,200,17.1813


Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-13_10-30-46
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 50.0242677018256
  episode_reward_mean: 40.68799880282648
  episode_reward_min: 0.9804621121217783
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: 9cb9808a002f42b3ba283ed86cf12573
  experiment_tag: '0'
  grad_time_ms: 0.281
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.281
    learner:
      cumulative_regret: 1960.7670248947884
      update_latency: 0.00020194053649902344
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 3560.832
    opt_samples: 1.0
    sample_peak_throughput: 1041.649
    sample_time_ms: 0.96
    update_time_ms: 0.001
  iterations_since_restore: 20
  learner:
    cumulative_regret: 1960.7670248947884
    update_latency: 0.00020194053649902344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  off_policy_esti

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,2.96992,2000,40.688
LinTS_WheelBanditEnv_00001,TERMINATED,,20,3.01704,2000,16.1915


The trials took 7.954367160797119 seconds



Analyze cumulative regrets of the trials

In [11]:
df = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df = df.append(df_trial, ignore_index=True)

regrets = df \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

In [12]:
regrets

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,1620.953058,2306.874826,935.031289,970.039868
200,2752.74155,4368.643812,1136.839287,2285.230895
300,3762.284111,6284.25278,1240.315442,3566.602296
400,4625.602712,7907.10804,1344.097384,4640.74934
500,5610.191994,9823.580427,1396.80356,5958.631066
600,6496.823674,11543.509368,1450.13798,7137.091353
700,7432.89228,13360.898731,1504.885829,8383.467121
800,8319.80409,15129.141741,1510.46644,9629.857656
900,9084.371585,16653.967193,1514.775977,10705.024771
1000,9996.3943,18472.364732,1520.423868,11986.832339


In [13]:
plot_cumulative_regret(regrets)

([image](../../images/rllib/LinTS-Cumulative-Regret-05.png))

Here is how you can restore a trainer from a checkpoint:

In [14]:
trial = analysis.trials[0]
trainer = LinTSTrainer(config=TS_CONFIG)
trainer.restore(trial.checkpoint.value)

2020-06-13 10:32:02,189	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 10:32:02,196	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 10:32:02,210	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:32:02,216	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:32:02,217	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-13_10-30-381336kilh/checkpoint_20/checkpoint-20
2020-06-13 10:32:02,218	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 2.9699223041534424, '_episodes_total': 2000}


Get model to plot arm weights distribution

In [15]:
model = trainer.get_policy().model
means = [model.arms[i].theta.numpy() for i in range(5)]
covs = [model.arms[i].covariance.numpy() for i in range(5)]
model, means, covs

(DiscreteLinearModelThompsonSampling(
   (arms): ModuleList(
     (0): OnlineLinearRegression()
     (1): OnlineLinearRegression()
     (2): OnlineLinearRegression()
     (3): OnlineLinearRegression()
     (4): OnlineLinearRegression()
   )
 ),
 [array([-0.520057 ,  0.8268448], dtype=float32),
  array([45.117825, 44.797653], dtype=float32),
  array([-42.706234,  46.898422], dtype=float32),
  array([-43.92157 , -44.499313], dtype=float32),
  array([ 44.84488 , -43.830235], dtype=float32)],
 [array([[0.7197646 , 0.1892137 ],
         [0.18921368, 0.715647  ]], dtype=float32),
  array([[ 0.01288966, -0.00779896],
         [-0.00779896,  0.01257929]], dtype=float32),
  array([[0.01161321, 0.00728622],
         [0.00728622, 0.01245194]], dtype=float32),
  array([[ 0.0144654 , -0.00871688],
         [-0.00871688,  0.01327495]], dtype=float32),
  array([[0.01235591, 0.00775846],
         [0.00775846, 0.01252754]], dtype=float32)])

Plot weight distributions for different arms

In [16]:
plot_wheel_bandit_model_weights(means, covs)

([image](../../images/rllib/LinTS-Weight-Distribution-of-Arms-05.png))

## Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this exercise.