# Ray RLlib Multi-Armed Bandits - Linear Thompson Sampling

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This lesson uses a second exploration strategy we discussed briefly in lesson [02 Exploration vs. Exploitation Strategies](02-Exploration-vs-Exploitation-Strategies.ipynb), _Thompson Sampling_, with a linear variant, [LinTS](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints).

## Wheel Bandit

We'll use it on the `Wheel Bandit` problem ([RLlib discrete.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py)), which is an artificial problem designed to force exploration. It is described in the paper [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127) (see _The Wheel Bandit_ section). The paper uses it to  model 2D contexts, but it can be generalized to more than two dimensions.

You can visualize this problem as a wheel (circle) with four other regions around it. An exploration parameter delta $\delta$ defines a threshold, such that if the norm of the context vector is less than or equal to delta (inside the “wheel”) then the leader action is taken (conventionally numbered `1`). Otherwise, the other four actions are explored.

From figure 3 in [Deep Bayesian Bandits Showdown](https://arxiv.org/abs/1802.09127), the Wheel Bandit can be visualized this way:

![Wheel Bandit](../../images/rllib/Wheel-Bandit.png)

The radius of the entire colored circle is 1.0, while the radius of the blue "core" is $\delta$.

Contexts are sampled randomly within the unit circle (radius 1.0). The optimal action for the blue, red, green, black, or yellow region is the action 1, 2, 3, 4, or 5, respectively. In other words, if the context is in the blue region, radius < $\delta$, action 1 is optimal, if it is in the upper-right-hand quadrant with radius between $\delta$ and 1.0, then action 2 is optimal, etc.

The parameter $\delta$ controls how aggressively we explore. The reward $r$ for each action and context combination are based on a normal distribution as follows:

Action 1 offers the reward, $r \sim \mathcal{N}({\mu_1,\sigma^2})$, independent of context.

Actions 2-5 offer the reward, $r \sim \mathcal{N}({\mu_2,\sigma^2})$ where $\mu_2 < \mu_1$, _when they are suboptimal choices_. When they are optimal, the reward is $r \sim \mathcal{N}({\mu_3,\sigma^2})$ where $\mu_3 \gg \mu_1$.

In addition to $\delta$, the parameters $\mu_1$, $\mu_2$ $\mu_3$, and $\sigma$ are configurable. The default values for these parameters in the paper and in the [RLlib implementation](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py) are as follows:

```python
DEFAULT_CONFIG_WHEEL = {
    "delta": 0.5,
    "mu_1": 1.2,
    "mu_2": 1.0,
    "mu_3": 50.0,
    "std": 0.01  # sigma
}
```

Note that the probability of a context randomly falling in the high-reward region (not blue) is 1 − $\delta^2$. Therefore, the difficulty of the problem increases with $\delta$, and algorithms used with this bandit are more likely to get stuck repeatedly selecting action 1 for large $\delta$.

## Use Wheel Bandit with Thompson Sampling

Note the import in the next cell of `LinTSTrainer` and how it is used below when setting up the _Tune_ job. For the `LinUCB` example in the [previous lesson](04-Linear-Upper-Confidence-bound.ipynb), we didn't import the corresponding `LinUCBTrainer`, but passed a "magic" string to Tune, `contrib/LinUCB`, which RLlib already knows how to associate with the corresponding `LinUCBTrainer` implementation. Passing the class explicitly, as we do here, is an alternative. The [RLlib environments documentation](https://docs.ray.io/en/latest/rllib-env.html) discusses these techniques.

In [1]:
import time
import numpy as np
import pandas as pd
import ray
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [2]:
from bokeh_util import plot_cumulative_regret, plot_wheel_bandit_model_weights
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [3]:
wbe = WheelBanditEnv()
wbe.config



{'delta': 0.5, 'mu_1': 1.2, 'mu_2': 1, 'mu_3': 50, 'std': 0.01}

The effective number of `training_iterations` will be `20 * timesteps_per_iteration == 2,000` where the timesteps per iteration is `100` by default.

In [4]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


What's in the standard config object for _LinTS_ anyway??

In [5]:
TS_CONFIG

{'num_workers': 0,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 1,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 1,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [256, 256],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': ray.rllib.contrib.bandits.envs.discrete.WheelBanditEnv,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr':

Initialize Ray...

In [6]:
!../../tools/start-ray.sh --check --verbose


INFO: Ray is not running. Run ../../tools/start-ray.sh with no options in a terminal window to start Ray.
INFO: (You can start a terminal in Jupyter. Click the + under the Edit menu.)



In [8]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:11135',
 'object_store_address': '/tmp/ray/session_2020-06-19_14-40-43_289473_44550/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-19_14-40-43_289473_44550/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-19_14-40-43_289473_44550'}

In [9]:
start_time = time.time()

analysis = ray.tune.run(
    LinTSTrainer,
    config=TS_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=2,
    checkpoint_at_end=True,
    verbose=2,              # Change to 0 or 1 to reduce the output.
    ray_auto_init=False,    # Don't allow Tune to initialize Ray.
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=44561)[0m 2020-06-20 11:21:01,910	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=44561)[0m 2020-06-20 11:21:01,912	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=44561)[0m 2020-06-20 11:21:01,924	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=44566)[0m 2020-06-20 11:21:01,910	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=44566)[0m 2020-06-20 11:21:01,912	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=44566)[0m 2020-06-20 11:21:01,924	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-20_11-21-02
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:44566,3,0.526102,300,40.2003
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:44561,3,0.535795,300,35.7897


Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-20_11-21-04
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 50.0227460003925
  episode_reward_mean: 37.74865802450403
  episode_reward_min: 0.9789736950063161
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: f335bc2ce5654abab5b8b006e32c2913
  experiment_tag: '1'
  grad_time_ms: 0.236
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.236
    learner:
      cumulative_regret: 3774.2085419596924
      update_latency: 0.0001442432403564453
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 4243.1
    opt_samples: 1.0
    sample_peak_throughput: 1214.895
    sample_time_ms: 0.823
    update_time_ms: 0.001
  iterations_since_restore: 20
  learner:
    cumulative_regret: 3774.2085419596924
    update_latency: 0.0001442432403564453
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  off_policy_estimat

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,2.6097,2000,34.8097
LinTS_WheelBanditEnv_00001,TERMINATED,,20,2.62832,2000,37.7487


The trials took 7.3318190574646 seconds



Analyze cumulative regrets of the trials

In [10]:
df = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df = df.append(df_trial, ignore_index=True)

regrets = df \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

In [11]:
regrets

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,2060.226,2352.468031,1767.983968,413.292644
200,2237.348204,2506.42218,1968.274229,380.528065
300,2315.606592,2610.05111,2021.162073,416.407431
400,2345.716312,2616.469813,2074.962811,382.903274
500,2449.665841,2671.913107,2227.418574,314.305099
600,2479.899188,2727.12522,2232.673156,349.630407
700,2582.671408,2879.313302,2286.029514,419.51499
800,2661.054523,2933.503237,2388.605809,385.300666
900,2740.362677,2988.261121,2492.464232,350.581342
1000,2793.841588,2992.966167,2594.717009,281.60468


In [12]:
plot_cumulative_regret(regrets)

As always, here is an [image](../../images/rllib/LinTS-Cumulative-Regret-05.png) from a previous run. How similar is your graph? We have observed a great deal of variability from one run to the next, more than we have seen with _LinUCB_. This suggests that extra caution is required when using _LinTS_ to ensure that good results are achieved.

Here is how you can restore a trainer from a checkpoint:

In [13]:
trial = analysis.trials[0]
trainer = LinTSTrainer(config=TS_CONFIG)
trainer.restore(trial.checkpoint.value)

2020-06-20 11:24:58,631	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-20 11:24:58,635	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-20 11:24:58,649	INFO trainable.py:217 -- Getting current IP.
2020-06-20 11:24:58,655	INFO trainable.py:217 -- Getting current IP.
2020-06-20 11:24:58,657	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-20_11-20-5780on2kxk/checkpoint_20/checkpoint-20
2020-06-20 11:24:58,657	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 2.6096982955932617, '_episodes_total': 2000}


Get model to plot arm weights distribution

In [14]:
model = trainer.get_policy().model
means = [model.arms[i].theta.numpy() for i in range(5)]
covs = [model.arms[i].covariance.numpy() for i in range(5)]
model, means, covs

(DiscreteLinearModelThompsonSampling(
   (arms): ModuleList(
     (0): OnlineLinearRegression()
     (1): OnlineLinearRegression()
     (2): OnlineLinearRegression()
     (3): OnlineLinearRegression()
     (4): OnlineLinearRegression()
   )
 ),
 [array([-0.46475023, -0.2449946 ], dtype=float32),
  array([43.755466, 46.074657], dtype=float32),
  array([-45.390217,  44.895374], dtype=float32),
  array([-44.445683, -44.668976], dtype=float32),
  array([ 43.503498, -44.165836], dtype=float32)],
 [array([[0.4216544, 0.2537617],
         [0.2537617, 0.5554002]], dtype=float32),
  array([[ 0.01255945, -0.00741833],
         [-0.00741833,  0.01186487]], dtype=float32),
  array([[0.01207665, 0.00747614],
         [0.00747614, 0.0125689 ]], dtype=float32),
  array([[ 0.01200655, -0.00679587],
         [-0.00679587,  0.01178853]], dtype=float32),
  array([[0.01293112, 0.00790997],
         [0.00790997, 0.01320421]], dtype=float32)])

Plot weight distributions for different arms

In [15]:
plot_wheel_bandit_model_weights(means, covs)

[2m[36m(pid=44563)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/solutions/../market.tsv (config: {})
[2m[36m(pid=44562)[0m 2020-06-20 11:36:45,873	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=44562)[0m 2020-06-20 11:36:45,877	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=44562)[0m 2020-06-20 11:36:45,891	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=44567)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/solutions/../market.tsv (config: {})
[2m[36m(pid=44567)[0m 2020-06-20 11:36:45,872	INFO trainer.py:421 -- Tip: set '

Here is an [image](../../images/rllib/LinTS-Weight-Distribution-of-Arms-05.png) from a previous run. How similar is your graph?

## Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this exercise.