# Ray RLlib Multi-Armed Bandits - Linear Thompson Sampling

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson uses a second exploration strategy we discussed briefly in lesson [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb), _Thompson Sampling_, with a linear variant, [LinTS](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints).

## Wheel Bandit

We'll use it on the `Wheel Bandit` problem ([RLlib discrete.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py)), which is an artificial problem designed to force exploration. It is described in the paper [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127) (see _The Wheel Bandit_ section). The paper uses it to  model 2D contexts, but it can be generalized to more than two dimensions.

You can visualize this problem as a wheel (circle) with four other regions around it. An exploration parameter delta $\delta$ defines a threshold, such that if the norm of the context vector is less than or equal to delta (inside the “wheel”) then the leader action is taken (conventionally numbered `1`). Otherwise, the other four actions are explored.

From figure 3 in [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127), the Wheel Bandit can be visualized this way:

![Wheel Bandit](../../images/rllib/Wheel-Bandit.png)

The radius of the entire colored circle is 1.0, while the radius of the blue "core" is $\delta$.

Contexts are sampled randomly within the unit circle (radius 1.0). The optimal action for the blue, red, green, black, or yellow region is the action 1, 2, 3, 4, or 5, respectively. In other words, if the context is in the blue region, radius < $\delta$, action 1 is optimal, if it is in the upper-right-hand quadrant with radius between $\delta$ and 1.0, then action 2 is optimal, etc.

The parameter $\delta$ controls how aggressively we explore. The reward $r$ for each action and context combination are based on a normal distribution as follows:

Action 1 offers the reward, $r \sim \mathcal{N}({\mu_1,\sigma^2})$, independent of context.

Actions 2-5 offer the reward, $r \sim \mathcal{N}({\mu_2,\sigma^2})$ where $\mu_2 < \mu_1$, _when they are suboptimal choices_. When they are optimal, the reward is $r \sim \mathcal{N}({\mu_3,\sigma^2})$ where $\mu_3 \gg \mu_1$.

In addition to $\delta$, the parameters $\mu_1$, $\mu_2$ $\mu_3$, and $\sigma$ are configurable. The default values for these parameters in the paper and in the [RLlib implementation](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py) are as follows:

```python
DEFAULT_CONFIG_WHEEL = {
    "delta": 0.5,
    "mu_1": 1.2,
    "mu_2": 1.0,
    "mu_3": 50.0,
    "std": 0.01  # sigma
}
```

Note that the probability of a context randomly falling in the high-reward region (not blue) is 1 − $\delta^2$. Therefore, the difficulty of the problem increases with $\delta$, and algorithms used with this bandit are more likely to get stuck repeatedly selecting action 1 for large $\delta$.

## Use Wheel Bandit with Thompson Sampling

Note the import in the next cell of `LinTSTrainer` and how it is used below when setting up the _Tune_ job. For the `LinUCB` example in the [previous lesson](04-Linear-Upper-Confidence-bound.ipynb), we didn't import the corresponding `LinUCBTrainer`, but passed a "magic" string to Tune, `contrib/LinUCB`. This approach is an alternative.

In [2]:
import time
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from ray import tune
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [10]:
from bokeh_util import plot_cumulative_regret, plot_wheel_bandit_model_weights
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [30]:
wbe = WheelBanditEnv()
wbe.config

{'delta': 0.5, 'mu_1': 1.2, 'mu_2': 1, 'mu_3': 50, 'std': 0.01}

The effective number of `training_iterations` will be `20 * timesteps_per_iteration == 2,000` where the timesteps per iteration is `100` by default.

In [3]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


What's in the standard config object for _LinTS_ anyway??

In [32]:
TS_CONFIG

{'num_workers': 0,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 1,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 1,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [256, 256],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': ray.rllib.contrib.bandits.envs.discrete.WheelBanditEnv,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr':

In [5]:
start_time = time.time()

analysis = tune.run(
    LinTSTrainer,
    config=TS_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=2,
    checkpoint_at_end=True,
    verbose=2            # Change to 0 or 1 to reduce the output.
)

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-10 16:28:55,252	INFO resource_spec.py:212 -- Starting Ray with 4.15 GiB memory available for workers and up to 2.09 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-10 16:28:56,511	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=84292)[0m 2020-06-10 16:29:02,338	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=84292)[0m 2020-06-10 16:29:02,339	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=84292)[0m 2020-06-10 16:29:02,347	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=84291)[0m 2020-06-10 16:29:02,338	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=84291)[0m 2020-06-10 16:29:02,339	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=84291)[0m 2020-06-10 16:29:02,347	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_16-29-02
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:84292,1.0,0.140619,100.0,22.0946
LinTS_WheelBanditEnv_00001,RUNNING,,,,,


Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_16-29-02
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 50.01794466461183
  episode_reward_mean: 25.510228026493014
  episode_reward_min: 0.9805564814351245
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 6d2db622fd5343bc8d42934e88b6c783
  experiment_tag: '1'
  grad_time_ms: 0.254
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.254
    learner:
      cumulative_regret: 1570.4835152788055
      update_latency: 0.00016260147094726562
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 3934.988
    opt_samples: 1.0
    sample_peak_throughput: 1123.725
    sample_time_ms: 0.89
    update_time_ms: 0.001
  iterations_since_restore: 1
  learner:
    cumulative_regret: 1570.4835152788055
    update_latency: 0.00016260147094726562
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimat

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,2.67065,2000,39.2208
LinTS_WheelBanditEnv_00001,TERMINATED,,20,2.65737,2000,36.2794


The trials took 10.201154947280884 seconds



Analyze cumulative regrets of the trials

In [6]:
df = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df = df.append(df_trial, ignore_index=True)

regrets = df \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

In [13]:
regrets

Unnamed: 0_level_0,mean,max,min,std,lower,upper
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
100,1619.37935,1668.275186,1570.483515,69.149153,1550.230197,1688.528504
200,2456.68921,2456.814819,2456.563601,0.177638,2456.511572,2456.866848
300,3025.534856,3392.419121,2658.65059,518.852704,2506.682152,3544.38756
400,3177.364391,3446.949872,2907.778909,381.251444,2796.112947,3558.615835
500,3378.607935,3647.269597,3109.946273,379.944966,2998.662969,3758.552901
600,3456.559719,3800.351341,3112.768097,486.194774,2970.364945,3942.754493
700,3534.039792,3903.296124,3164.783461,522.207312,3011.832481,4056.247104
800,3661.753925,4104.330007,3219.177842,625.897098,3035.856827,4287.651022
900,3789.177402,4207.167747,3371.187057,591.127615,3198.049787,4380.305017
1000,3843.254799,4213.163619,3473.345979,523.13007,3320.124729,4366.384869


In [12]:
plot_cumulative_regret(regrets)

([image](../../images/rllib/LinTS-cumulative-regret.png))

Here is how you can restore a trainer from a checkpoint:

In [14]:
trial = analysis.trials[0]
trainer = LinTSTrainer(config=TS_CONFIG)
trainer.restore(trial.checkpoint.value)

2020-06-10 16:38:13,500	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-10 16:38:13,512	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-10 16:38:13,568	INFO trainable.py:217 -- Getting current IP.
2020-06-10 16:38:13,582	INFO trainable.py:217 -- Getting current IP.
2020-06-10 16:38:13,583	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-10_16-28-561mm4h9yf/checkpoint_20/checkpoint-20
2020-06-10 16:38:13,584	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 2.670645236968994, '_episodes_total': 2000}


Get model to plot arm weights distribution

In [15]:
model = trainer.get_policy().model
means = [model.arms[i].theta.numpy() for i in range(5)]
covs = [model.arms[i].covariance.numpy() for i in range(5)]
model, means, covs

(DiscreteLinearModelThompsonSampling(
   (arms): ModuleList(
     (0): OnlineLinearRegression()
     (1): OnlineLinearRegression()
     (2): OnlineLinearRegression()
     (3): OnlineLinearRegression()
     (4): OnlineLinearRegression()
   )
 ),
 [array([ 0.43531385, -1.338794  ], dtype=float32),
  array([41.889137, 46.24012 ], dtype=float32),
  array([-44.802326,  45.43111 ], dtype=float32),
  array([-44.95004 , -42.628048], dtype=float32),
  array([ 42.929142, -44.037468], dtype=float32)],
 [array([[0.54407024, 0.14276597],
         [0.14276595, 0.18919492]], dtype=float32),
  array([[ 0.01185324, -0.00700816],
         [-0.00700816,  0.01187685]], dtype=float32),
  array([[0.01302548, 0.00896426],
         [0.00896426, 0.01466974]], dtype=float32),
  array([[ 0.01259771, -0.00803566],
         [-0.00803566,  0.01341333]], dtype=float32),
  array([[0.01261109, 0.00748223],
         [0.00748223, 0.01256214]], dtype=float32)])

Plot weight distributions for different arms

In [29]:
plot_wheel_bandit_model_weights(means, covs)

([image](../../images/rllib/LinTS-Weight-Distribution-of-Arms.png))

## Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.