# Ray RLlib - Multi-Armed Bandits - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [1]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import random
from ray import tune
from ray.tune.progress_reporter import JupyterNotebookReporter
import time

## 03: Simple Multi-Armed Bandits - Exercise 1

First, set up a function to generate the rewards for n arms. To keep it somewhat simple, just use the original rewards for -1 in `SimpleBandit`, `[-10,0,10]` and repeat it as much as necessary, and optionally offset the start.

In [2]:
class SimpleContextualBandit2 (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        self.current_context = random.choice([-1.,1.])
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit2(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'
    

In [3]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [-1.  1.], bandit = SimpleContextualBandit2(action_space=Discrete(3), observation_space=Box(2,), current_context=1.0, rewards per context={-1.0: [-10, 0, 10], 1.0: [10, 0, -10]})'

In [5]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit2,
}

In [6]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-08 13:58:52,018	INFO resource_spec.py:212 -- Starting Ray with 4.44 GiB memory available for workers and up to 2.22 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-08 13:58:52,344	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


Trial name,status,loc
contrib_LinUCB_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,475	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,478	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,486	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-08_13-59-00
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 91d770ae39234ca1bfd77582362df8dc
  experiment_tag: '0'
  grad_time_ms: 0.246
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.246
    learner:
      cumulative_regret: 10.0
      update_latency: 0.0001289844512939453
    num_steps_sampled: 100
    num_steps_trained: 100

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBandit2_00000,TERMINATED,,2,0.239915,200,10


The trials took 8.765048027038574 seconds



In [7]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.894,0.285,0.001,...,1118.72,1.0,10.0,0.00023,,,10.0,0.00023,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


It trains just as easily as the original implementation that didn't switch contexts between steps. Is this surprising? Probably not, because the relationship between the reward and the context remains linear, so what LinUCB learns for one context is correct for the second context, too. Also, _Tune_ runs many episodes, so it studies both contexts.

## 03: Simple Multi-Armed Bandits - Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Also remove the change made for exercise 1, the line `self.current_context = random.choice([-1.,1.])` in the `step` method.

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

In [8]:
class SimpleContextualBanditNonlinear (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {   # Changed here:
            -1.: [-10, 10, 0],
            1.: [0, 10, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBanditNonlinear(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

In [9]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBanditNonlinear(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 10, 0], 1.0: [0, 10, -10]})'

In [10]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}


In [11]:
# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

In [12]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,502	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,505	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,512	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-10
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.4
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.255
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.255
    learner:
      cumulative_regret: 460.0
      update_latency: 0.00013589859008789062
    num_steps_sampled: 100
    num_steps_

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,17,1.88235,1700,5.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-15
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 4400
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.306
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.306
    learner:
      cumulative_regret: 21780.0
      update_latency: 0.0001728534698486328
    num_steps_sampled: 4400
    num_steps_trained: 4400
    opt_peak_throughput: 3264.304
    opt_samples: 1.0
    sample_peak_throughput: 1297.462
    sample_time_ms: 0.771
    update_time_ms: 0.001
  iterations_since_restore: 44
  learner:
    cumulative_regret: 21780.0
    update_latency: 0.0001728534698486328
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4400
  num_steps_trained: 4400
  off_policy_estimator: {}
  opt_peak_throughput: 3264

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,60,6.5581,6000,4.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-20
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.8
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 8700
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.321
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.321
    learner:
      cumulative_regret: 42800.0
      update_latency: 0.000209808349609375
    num_steps_sampled: 8700
    num_steps_trained: 8700
    opt_peak_throughput: 3111.501
    opt_samples: 1.0
    sample_peak_throughput: 1437.045
    sample_time_ms: 0.696
    update_time_ms: 0.001
  iterations_since_restore: 87
  learner:
    cumulative_regret: 42800.0
    update_latency: 0.000209808349609375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 8700
  num_steps_trained: 8700
  off_policy_estimator: {}
  opt_peak_throughput: 3111.5

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,102,11.2425,10200,5.1


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-25
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.4
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 12800
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.406
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.406
    learner:
      cumulative_regret: 63530.0
      update_latency: 0.00024175643920898438
    num_steps_sampled: 12800
    num_steps_trained: 12800
    opt_peak_throughput: 2461.59
    opt_samples: 1.0
    sample_peak_throughput: 1320.625
    sample_time_ms: 0.757
    update_time_ms: 0.002
  iterations_since_restore: 128
  learner:
    cumulative_regret: 63530.0
    update_latency: 0.00024175643920898438
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 12800
  num_steps_trained: 12800
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,142,15.898,14200,5.5


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-30
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 16700
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.38
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.38
    learner:
      cumulative_regret: 83060.0
      update_latency: 0.00027298927307128906
    num_steps_sampled: 16700
    num_steps_trained: 16700
    opt_peak_throughput: 2634.613
    opt_samples: 1.0
    sample_peak_throughput: 1390.085
    sample_time_ms: 0.719
    update_time_ms: 0.001
  iterations_since_restore: 167
  learner:
    cumulative_regret: 83060.0
    update_latency: 0.00027298927307128906
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 16700
  num_steps_trained: 16700
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,181,20.6512,18100,5.2


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-35
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.7
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 20000
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.445
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.445
    learner:
      cumulative_regret: 99230.0
      update_latency: 0.0003299713134765625
    num_steps_sampled: 20000
    num_steps_trained: 20000
    opt_peak_throughput: 2245.946
    opt_samples: 1.0
    sample_peak_throughput: 1244.97
    sample_time_ms: 0.803
    update_time_ms: 0.001
  iterations_since_restore: 200
  learner:
    cumulative_regret: 99230.0
    update_latency: 0.0003299713134765625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 20000
  num_steps_trained: 20000
  off_policy_estimator: {}
  opt_peak_throughput: 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,22.9847,20000,5.7


The trials took 27.73778510093689 seconds



In [13]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,5.7,1.0,100,20000,20000,0.803,0.445,0.001,...,1244.97,1.0,99230.0,0.00033,22.1,66.1,99230.0,0.00033,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinUCB/...


It ran the maximum of 20,000 steps and the best it does is about 4.8, not 10.0. the `episode_reward_mean` is chaotic:

![Nonlinear model with LinUCB](../../../images/rllib/TensorBoard2.png).

Because LinUCB expcts a linear relationship between the context and each reward, it's not surprising that it fails to converge to the desired reward mean.

## 03: Simple Multi-Armed Bandits - Exercise 3

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](../02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

In [14]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBandit2,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

In [14]:
df = analysis.dataframe()
df

Trial name,status,loc
contrib_LinTS_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,052	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,056	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,063	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-08_14-02-51
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 481647293a464969a3d73baab12f468b
  experiment_tag: '0'
  grad_time_ms: 0.25
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.25
    learner:
      cumulative_regret: 10.0
      update_latency: 0.00013208389282226562
    num_steps_sampled: 100
    num_steps_trained: 100
 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBandit2_00000,TERMINATED,,2,0.213528,200,10


The trials took 3.132218837738037 seconds



Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.732,0.232,0.001,...,1366.312,1.0,10.0,0.000131,,,10.0,0.000131,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinTS/c...


As before, the training only takes 200 steps and converge to the desired reward mean of `10.0`.

Now let's try the nonlinear bandit:

In [None]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

In [16]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,-10.0,4.5,1.0,100,20000,20000,0.74,0.446,0.003,...,1350.692,1.0,100750.0,0.00042,16.9,68.7,100750.0,0.00042,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinTS/c...


This run with Thompson sampling yields similar results with the reward mean about 4.5 and failure chaotic results over 20000 steps as shown in the TensorBoard graph.