# Ray RLlib - Multi-Armed Bandits - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [1]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import pandas as pd
import os, time, random
from ray import tune
from ray.tune.progress_reporter import JupyterNotebookReporter

## 03: Simple Multi-Armed Bandits - Exercise 1

First, set up a function to generate the rewards for n arms. To keep it somewhat simple, just use the original rewards for -1 in `SimpleBandit`, `[-10,0,10]` and repeat it as much as necessary, and optionally offset the start.

In [2]:
class SimpleContextualBandit2 (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        self.current_context = random.choice([-1.,1.])
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit2(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'
    

In [3]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBandit2(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 0, 10], 1.0: [10, 0, -10]})'

In [4]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit2,
}

In [5]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-10 20:26:19,461	INFO resource_spec.py:212 -- Starting Ray with 4.15 GiB memory available for workers and up to 2.08 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-10 20:26:19,780	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8267[39m[22m


Trial name,status,loc
contrib_LinUCB_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=89086)[0m 2020-06-10 20:26:29,643	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89086)[0m 2020-06-10 20:26:29,647	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89086)[0m 2020-06-10 20:26:29,655	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-29
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 8d581c013a78430f925515d40e8004da
  experiment_tag: '0'
  grad_time_ms: 0.262
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.262
    learner:
      cumulative_regret: 10.0
      update_latency: 0.0001342296600341797
    num_steps_sampled: 100
    num_steps_trained: 100

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBandit2_00000,TERMINATED,,2,0.215459,200,10


The trials took 10.468096017837524 seconds



In [6]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.798,0.254,0.001,...,1252.405,1.0,10.0,0.000151,,,10.0,0.000151,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


It trains just as easily as the original implementation that didn't switch contexts between steps. Is this surprising? Probably not, because the relationship between the reward and the context remains linear, so what LinUCB learns for one context is correct for the second context, too. Also, _Tune_ runs many episodes, so it studies both contexts.

## 03: Simple Multi-Armed Bandits - Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Also remove the change made for exercise 1, the line `self.current_context = random.choice([-1.,1.])` in the `step` method.

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

In [7]:
class SimpleContextualBanditNonlinear (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {   # Changed here:
            -1.: [-10, 10, 0],
            1.: [0, 10, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBanditNonlinear(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

In [8]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBanditNonlinear(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 10, 0], 1.0: [0, 10, -10]})'

In [9]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

current_context = -1.0
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 1, reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], action = 1, reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}


In [10]:
# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

In [11]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=89082)[0m 2020-06-10 20:26:33,671	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89082)[0m 2020-06-10 20:26:33,675	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89082)[0m 2020-06-10 20:26:33,696	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-33
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.232
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.232
    learner:
      cumulative_regret: 500.0
      update_latency: 0.00012922286987304688
    num_steps_sampled: 100
    num_steps_

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,12,1.24188,1200,5.5


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-38
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 4000
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.347
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.347
    learner:
      cumulative_regret: 19950.0
      update_latency: 0.00017189979553222656
    num_steps_sampled: 4000
    num_steps_trained: 4000
    opt_peak_throughput: 2877.936
    opt_samples: 1.0
    sample_peak_throughput: 1223.864
    sample_time_ms: 0.817
    update_time_ms: 0.001
  iterations_since_restore: 40
  learner:
    cumulative_regret: 19950.0
    update_latency: 0.00017189979553222656
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4000
  num_steps_trained: 4000
  off_policy_estimator: {}
  opt_peak_throughput: 28

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,50,5.98747,5000,4.5


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-43
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.7
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 7200
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.362
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.362
    learner:
      cumulative_regret: 35730.0
      update_latency: 0.0002009868621826172
    num_steps_sampled: 7200
    num_steps_trained: 7200
    opt_peak_throughput: 2761.591
    opt_samples: 1.0
    sample_peak_throughput: 1268.733
    sample_time_ms: 0.788
    update_time_ms: 0.001
  iterations_since_restore: 72
  learner:
    cumulative_regret: 35730.0
    update_latency: 0.0002009868621826172
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7200
  num_steps_trained: 7200
  off_policy_estimator: {}
  opt_peak_throughput: 2761

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,81,10.7096,8100,5.2


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-48
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.8
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 10900
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.37
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.37
    learner:
      cumulative_regret: 53930.0
      update_latency: 0.00030112266540527344
    num_steps_sampled: 10900
    num_steps_trained: 10900
    opt_peak_throughput: 2702.864
    opt_samples: 1.0
    sample_peak_throughput: 1422.859
    sample_time_ms: 0.703
    update_time_ms: 0.001
  iterations_since_restore: 109
  learner:
    cumulative_regret: 53930.0
    update_latency: 0.00030112266540527344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 10900
  num_steps_trained: 10900
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,119,15.4452,11900,4.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-54
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 14800
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.382
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.382
    learner:
      cumulative_regret: 73670.0
      update_latency: 0.0002741813659667969
    num_steps_sampled: 14800
    num_steps_trained: 14800
    opt_peak_throughput: 2618.658
    opt_samples: 1.0
    sample_peak_throughput: 1361.212
    sample_time_ms: 0.735
    update_time_ms: 0.001
  iterations_since_restore: 148
  learner:
    cumulative_regret: 73670.0
    update_latency: 0.0002741813659667969
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 14800
  num_steps_trained: 14800
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,156,20.1823,15600,5.2


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-26-59
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.5
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 18300
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.421
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.421
    learner:
      cumulative_regret: 91230.0
      update_latency: 0.0002949237823486328
    num_steps_sampled: 18300
    num_steps_trained: 18300
    opt_peak_throughput: 2373.955
    opt_samples: 1.0
    sample_peak_throughput: 1410.609
    sample_time_ms: 0.709
    update_time_ms: 0.001
  iterations_since_restore: 183
  learner:
    cumulative_regret: 91230.0
    update_latency: 0.0002949237823486328
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 18300
  num_steps_trained: 18300
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89082,192,24.9043,19200,4.5


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-01
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.2
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 20000
  experiment_id: 831cfafc3876426793903ba9b6d1231e
  experiment_tag: '0'
  grad_time_ms: 0.461
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.461
    learner:
      cumulative_regret: 99810.0
      update_latency: 0.0006542205810546875
    num_steps_sampled: 20000
    num_steps_trained: 20000
    opt_peak_throughput: 2168.944
    opt_samples: 1.0
    sample_peak_throughput: 1215.811
    sample_time_ms: 0.822
    update_time_ms: 0.002
  iterations_since_restore: 200
  learner:
    cumulative_regret: 99810.0
    update_latency: 0.0006542205810546875
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 20000
  num_steps_trained: 20000
  off_policy_estimator: {}
  opt_peak_throughput:

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,25.9232,20000,5.2


The trials took 31.534225702285767 seconds



In [12]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,5.2,1.0,100,20000,20000,0.822,0.461,0.002,...,1215.811,1.0,99810.0,0.000654,,,99810.0,0.000654,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinUCB/...


It ran the maximum of 20,000 steps and the best it does is about 4.8, not 10.0. the `episode_reward_mean` is chaotic:

![Nonlinear model with LinUCB](../../../images/rllib/TensorBoard2.png).

Because LinUCB expcts a linear relationship between the context and each reward, it's not surprising that it fails to converge to the desired reward mean.

## 03: Simple Multi-Armed Bandits - Exercise 3

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](../02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

In [13]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBandit2,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinTS_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=89080)[0m 2020-06-10 20:27:05,179	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89080)[0m 2020-06-10 20:27:05,182	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89080)[0m 2020-06-10 20:27:05,192	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-05
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 10.0
  episode_reward_min: 10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 8cca3aff318a455890dbbd10281d2760
  experiment_tag: '0'
  grad_time_ms: 0.255
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.255
    learner:
      cumulative_regret: 0.0
      update_latency: 0.00024199485778808594
    num_steps_sampled: 100
    num_steps_trained: 100

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBandit2_00000,TERMINATED,,1,0.170974,100,10


The trials took 3.844277858734131 seconds



In [14]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,100,100,0.747,0.255,0.001,...,1339.22,1.0,0.0,0.000242,31.2,64.1,0.0,0.000242,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinTS/c...


As before, the training only takes 200 steps and converge to the desired reward mean of `10.0`.

Now let's try the nonlinear bandit:

In [15]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=89084)[0m 2020-06-10 20:27:08,545	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89084)[0m 2020-06-10 20:27:08,549	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89084)[0m 2020-06-10 20:27:08,559	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-08
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.6
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.311
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.311
    learner:
      cumulative_regret: 540.0
      update_latency: 0.00014400482177734375
    num_steps_sampled: 100
    num_steps_t

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,16,1.7827,1600,5.1


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-13
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 3800
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.373
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.373
    learner:
      cumulative_regret: 19280.0
      update_latency: 0.0002040863037109375
    num_steps_sampled: 3800
    num_steps_trained: 3800
    opt_peak_throughput: 2680.409
    opt_samples: 1.0
    sample_peak_throughput: 1177.348
    sample_time_ms: 0.849
    update_time_ms: 0.001
  iterations_since_restore: 38
  learner:
    cumulative_regret: 19280.0
    update_latency: 0.0002040863037109375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3800
  num_steps_trained: 3800
  off_policy_estimator: {}
  opt_peak_throughput: 2680.

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,51,6.42993,5100,4.5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-18
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.6
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 7500
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.337
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.337
    learner:
      cumulative_regret: 38160.0
      update_latency: 0.0002791881561279297
    num_steps_sampled: 7500
    num_steps_trained: 7500
    opt_peak_throughput: 2969.419
    opt_samples: 1.0
    sample_peak_throughput: 1429.503
    sample_time_ms: 0.7
    update_time_ms: 0.001
  iterations_since_restore: 75
  learner:
    cumulative_regret: 38160.0
    update_latency: 0.0002791881561279297
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7500
  num_steps_trained: 7500
  off_policy_estimator: {}
  opt_peak_throughput: 2969.41

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,83,11.199,8300,4.8


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-23
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.4
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 10300
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.548
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.548
    learner:
      cumulative_regret: 52350.0
      update_latency: 0.00028586387634277344
    num_steps_sampled: 10300
    num_steps_trained: 10300
    opt_peak_throughput: 1824.166
    opt_samples: 1.0
    sample_peak_throughput: 805.25
    sample_time_ms: 1.242
    update_time_ms: 0.001
  iterations_since_restore: 103
  learner:
    cumulative_regret: 52350.0
    update_latency: 0.00028586387634277344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 10300
  num_steps_trained: 10300
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,112,15.9366,11200,5.6


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-29
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 12500
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.494
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.494
    learner:
      cumulative_regret: 63370.0
      update_latency: 0.00031304359436035156
    num_steps_sampled: 12500
    num_steps_trained: 12500
    opt_peak_throughput: 2024.473
    opt_samples: 1.0
    sample_peak_throughput: 1021.357
    sample_time_ms: 0.979
    update_time_ms: 0.001
  iterations_since_restore: 125
  learner:
    cumulative_regret: 63370.0
    update_latency: 0.00031304359436035156
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 12500
  num_steps_trained: 12500
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,130,20.8831,13000,6


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-34
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.6
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 13500
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 10.051
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 10.051
    learner:
      cumulative_regret: 68550.0
      update_latency: 0.0007760524749755859
    num_steps_sampled: 13500
    num_steps_trained: 13500
    opt_peak_throughput: 99.488
    opt_samples: 1.0
    sample_peak_throughput: 300.06
    sample_time_ms: 3.333
    update_time_ms: 0.003
  iterations_since_restore: 135
  learner:
    cumulative_regret: 68550.0
    update_latency: 0.0007760524749755859
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 13500
  num_steps_trained: 13500
  off_policy_estimator: {}
  opt_peak_throughput:

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,138,26.5052,13800,5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-39
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.3
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 14000
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 8.026
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 8.026
    learner:
      cumulative_regret: 71030.0
      update_latency: 0.0008411407470703125
    num_steps_sampled: 14000
    num_steps_trained: 14000
    opt_peak_throughput: 124.601
    opt_samples: 1.0
    sample_peak_throughput: 97.781
    sample_time_ms: 10.227
    update_time_ms: 0.003
  iterations_since_restore: 140
  learner:
    cumulative_regret: 71030.0
    update_latency: 0.0008411407470703125
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 14000
  num_steps_trained: 14000
  off_policy_estimator: {}
  opt_peak_throughput: 1

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,146,31.9249,14600,5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-44
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.8
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 14800
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 1.82
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.82
    learner:
      cumulative_regret: 75270.0
      update_latency: 0.0006432533264160156
    num_steps_sampled: 14800
    num_steps_trained: 14800
    opt_peak_throughput: 549.46
    opt_samples: 1.0
    sample_peak_throughput: 201.683
    sample_time_ms: 4.958
    update_time_ms: 0.003
  iterations_since_restore: 148
  learner:
    cumulative_regret: 75270.0
    update_latency: 0.0006432533264160156
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 14800
  num_steps_trained: 14800
  off_policy_estimator: {}
  opt_peak_throughput: 549.

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,151,36.9188,15100,4


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-49
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.4
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 15400
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 1.052
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.052
    learner:
      cumulative_regret: 78500.0
      update_latency: 0.0005450248718261719
    num_steps_sampled: 15400
    num_steps_trained: 15400
    opt_peak_throughput: 950.55
    opt_samples: 1.0
    sample_peak_throughput: 583.458
    sample_time_ms: 1.714
    update_time_ms: 0.002
  iterations_since_restore: 154
  learner:
    cumulative_regret: 78500.0
    update_latency: 0.0005450248718261719
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 15400
  num_steps_trained: 15400
  off_policy_estimator: {}
  opt_peak_throughput: 95

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,158,41.9607,15800,4.8


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-54
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.1
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 16100
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.697
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.697
    learner:
      cumulative_regret: 82160.0
      update_latency: 0.0004942417144775391
    num_steps_sampled: 16100
    num_steps_trained: 16100
    opt_peak_throughput: 1434.49
    opt_samples: 1.0
    sample_peak_throughput: 870.766
    sample_time_ms: 1.148
    update_time_ms: 0.001
  iterations_since_restore: 161
  learner:
    cumulative_regret: 82160.0
    update_latency: 0.0004942417144775391
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 16100
  num_steps_trained: 16100
  off_policy_estimator: {}
  opt_peak_throughput: 1

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:89084,176,46.6483,17600,4.1


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-10_20-27-59
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 6.3
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 18800
  experiment_id: ba83bb44ce2649f8b5647d19f07502ba
  experiment_tag: '0'
  grad_time_ms: 0.576
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.576
    learner:
      cumulative_regret: 95430.0
      update_latency: 0.0004210472106933594
    num_steps_sampled: 18800
    num_steps_trained: 18800
    opt_peak_throughput: 1737.563
    opt_samples: 1.0
    sample_peak_throughput: 1144.921
    sample_time_ms: 0.873
    update_time_ms: 0.004
  iterations_since_restore: 188
  learner:
    cumulative_regret: 95430.0
    update_latency: 0.0004210472106933594
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 18800
  num_steps_trained: 18800
  off_policy_estimator: {}
  opt_peak_throughput:

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,50.1301,20000,4.7


The trials took 56.41524696350098 seconds



In [16]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,4.7,1.0,100,20000,20000,0.804,0.455,0.001,...,1243.457,1.0,101700.0,0.000336,,,101700.0,0.000336,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinTS/c...


This run with Thompson sampling yields similar results with the reward mean about 4.5 and failure chaotic results over 20000 steps as shown in the TensorBoard graph.

## 04: Linear Upper Confidence Bound - Exercise 1

Change the `training_iterations` from 20 to 50. Does the characteristic behavior of cumulative regret change at higher steps?

In [17]:
from ray import tune
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

In [18]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 40 * timesteps_per_iteration (100 by default) = 4,000
training_iterations = 40

print("Running training for %s time steps" % training_iterations)

Running training for 40 time steps


In [19]:
start_time = time.time()

analysis = tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=89083)[0m 2020-06-10 20:28:09,614	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89083)[0m 2020-06-10 20:28:09,616	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89081)[0m 2020-06-10 20:28:09,616	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89081)[0m 2020-06-10 20:28:09,618	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89085)[0m 2020-06-10 20:28:09,615	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89085)[0m 2020-06-10 20:28:09,617	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,1.0,0.30033,100.0,0.849664
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,,,,,


Result for contrib_LinUCB_ParametricItemRecoEnv_00003:
  custom_metrics: {}
  date: 2020-06-10_20-28-09
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9136942913748326
  episode_reward_mean: 0.8700057969748397
  episode_reward_min: 0.6567531097785329
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 8c4788bdf8f64a7f91266275bfc08f3f
  experiment_tag: '3'
  grad_time_ms: 0.775
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.775
    learner:
      cumulative_regret: 2.849333132520525
      update_latency: 0.0004000663757324219
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1290.714
    opt_samples: 1.0
    sample_peak_throughput: 467.103
    sample_time_ms: 2.141
    update_time_ms: 0.003
  iterations_since_restore: 1
  learner:
    cumulative_regret: 2.849333132520525
    update_latency: 0.0004000663757324219
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_p

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,12,5.21053,1200,0.8839
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:89085,11,4.92014,1100,0.891387
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:89083,11,4.87588,1100,0.85429
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:89079,11,4.79274,1100,0.898247
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:89109,10,4.42531,1000,0.903


Result for contrib_LinUCB_ParametricItemRecoEnv_00003:
  custom_metrics: {}
  date: 2020-06-10_20-28-15
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9136942913748326
  episode_reward_mean: 0.8972688585039341
  episode_reward_min: 0.8643415920205066
  episodes_this_iter: 100
  episodes_total: 1200
  experiment_id: 8c4788bdf8f64a7f91266275bfc08f3f
  experiment_tag: '3'
  grad_time_ms: 0.767
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.767
    learner:
      cumulative_regret: 7.219961850945604
      update_latency: 0.0005240440368652344
    num_steps_sampled: 1200
    num_steps_trained: 1200
    opt_peak_throughput: 1303.105
    opt_samples: 1.0
    sample_peak_throughput: 525.569
    sample_time_ms: 1.903
    update_time_ms: 0.002
  iterations_since_restore: 12
  learner:
    cumulative_regret: 7.219961850945604
    update_latency: 0.0005240440368652344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1200
  num_steps_trained: 1200
 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,20,9.66195,2000,0.882946
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:89085,20,9.47419,2000,0.886802
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:89083,20,9.56191,2000,0.857657
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:89079,21,10.53,2100,0.899929
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:89109,19,9.11211,1900,0.904892


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_20-28-20
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8969559357751138
  episode_reward_mean: 0.8890713670447853
  episode_reward_min: 0.8345529000328934
  episodes_this_iter: 100
  episodes_total: 2100
  experiment_id: 405b0d904c6c4aa6bf44735330c8fac1
  experiment_tag: '1'
  grad_time_ms: 5.212
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 5.212
    learner:
      cumulative_regret: 6.309720893171111
      update_latency: 0.0007548332214355469
    num_steps_sampled: 2100
    num_steps_trained: 2100
    opt_peak_throughput: 191.869
    opt_samples: 1.0
    sample_peak_throughput: 351.011
    sample_time_ms: 2.849
    update_time_ms: 0.9
  iterations_since_restore: 21
  learner:
    cumulative_regret: 6.309720893171111
    update_latency: 0.0007548332214355469
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2100
  num_steps_trained: 2100
  of

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,29,15.081,2900,0.880819
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:89085,30,15.1687,3000,0.88958
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:89083,29,14.7734,2900,0.85274
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:89079,30,15.0704,3000,0.899246
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:89109,29,14.7182,2900,0.90343


Result for contrib_LinUCB_ParametricItemRecoEnv_00002:
  custom_metrics: {}
  date: 2020-06-10_20-28-26
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8702599828274352
  episode_reward_mean: 0.8527141656112239
  episode_reward_min: 0.7691227537481429
  episodes_this_iter: 100
  episodes_total: 3000
  experiment_id: c9a84219b371497f9125b136b44769ed
  experiment_tag: '2'
  grad_time_ms: 9.778
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 9.778
    learner:
      cumulative_regret: 5.430067624476357
      update_latency: 0.0008330345153808594
    num_steps_sampled: 3000
    num_steps_trained: 3000
    opt_peak_throughput: 102.266
    opt_samples: 1.0
    sample_peak_throughput: 57.661
    sample_time_ms: 17.343
    update_time_ms: 0.003
  iterations_since_restore: 30
  learner:
    cumulative_regret: 5.430067624476357
    update_latency: 0.0008330345153808594
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3000
  num_steps_trained: 3000
  

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,34,20.2304,3400,0.886639
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:89085,34,19.6838,3400,0.888633
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:89083,34,20.0088,3400,0.85437
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:89079,35,20.1097,3500,0.898365
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:89109,34,19.7022,3400,0.904904


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_20-28-30
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8969559357751138
  episode_reward_mean: 0.8890794798592643
  episode_reward_min: 0.812728163799366
  episodes_this_iter: 100
  episodes_total: 3500
  experiment_id: 405b0d904c6c4aa6bf44735330c8fac1
  experiment_tag: '1'
  grad_time_ms: 1.639
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.639
    learner:
      cumulative_regret: 6.591459757186618
      update_latency: 0.0015501976013183594
    num_steps_sampled: 3500
    num_steps_trained: 3500
    opt_peak_throughput: 609.965
    opt_samples: 1.0
    sample_peak_throughput: 414.801
    sample_time_ms: 2.411
    update_time_ms: 0.002
  iterations_since_restore: 35
  learner:
    cumulative_regret: 6.591459757186618
    update_latency: 0.0015501976013183594
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3500
  num_steps_trained: 3500
  o

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:89081,38,24.9558,3800,0.88312
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:89085,38,23.7907,3800,0.88997
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:89083,39,25.4867,3900,0.849437
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:89079,39,24.5508,3900,0.90025
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:89109,38,24.5268,3800,0.906002


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_20-28-36
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8969559357751138
  episode_reward_mean: 0.8884875956375108
  episode_reward_min: 0.8285140566161642
  episodes_this_iter: 100
  episodes_total: 3900
  experiment_id: 405b0d904c6c4aa6bf44735330c8fac1
  experiment_tag: '1'
  grad_time_ms: 18.347
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 18.347
    learner:
      cumulative_regret: 6.6656569029152095
      update_latency: 0.0009992122650146484
    num_steps_sampled: 3900
    num_steps_trained: 3900
    opt_peak_throughput: 54.504
    opt_samples: 1.0
    sample_peak_throughput: 66.911
    sample_time_ms: 14.945
    update_time_ms: 0.016
  iterations_since_restore: 39
  learner:
    cumulative_regret: 6.6656569029152095
    update_latency: 0.0009992122650146484
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3900
  num_steps_trained: 3900

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,40,28.033,4000,0.886721
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,40,26.8793,4000,0.889061
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,40,27.2216,4000,0.851777
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,40,26.1726,4000,0.900764
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,40,27.7942,4000,0.903961


The trials took 37.75342679023743 seconds



In [20]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.252337,3.485253,2.849333,0.241374
200,3.932223,4.075063,3.724355,0.153372
300,4.391678,4.598596,4.236468,0.160653
400,4.681613,4.913034,4.433639,0.232332
500,4.909012,5.260438,4.462939,0.322807
600,5.122754,5.710011,4.569432,0.434812
700,5.327174,6.086834,4.681075,0.540261
800,5.457958,6.333549,4.820762,0.590528
900,5.573001,6.634517,4.870603,0.691867
1000,5.675521,6.873927,4.947293,0.757268


In [21]:
import sys
sys.path.append("..")  # So we can load the "bokeh_util" from the parent directory.

In [22]:
from bokeh_util import plot_cumulative_regret, plot_model_weights
# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [23]:
plot_cumulative_regret(df)

([image](../../../images/rllib/LinUCB-cumulative-regret-linUCB2.png))

The slope appears to stop flattening, suggesting that the previous number of steps, 2000, was sufficient to get the optimal behavior. Beyond that, regret continues to accumulate, but it's linear in the number of steps, neither getting better or worse.  

## 05: Linear Thompson Sampling - Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

In [24]:
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [25]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


In [26]:
def run_ts(delta):
    TS_CONFIG["delta"] = delta

    start_time = time.time()

    analysis = tune.run(
        LinTSTrainer,
        config=TS_CONFIG,
        stop={"training_iteration": training_iterations},
        num_samples=2,
        checkpoint_at_end=True
        )

    print("The trials took", time.time() - start_time, "seconds\n")

    df = pd.DataFrame()

    for key, df_trial in analysis.trial_dataframes.items():
        df = df.append(df_trial, ignore_index=True)

    ts_regrets = df \
        .groupby("num_steps_trained")["learner/cumulative_regret"] \
        .aggregate(["mean", "max", "min", "std"])
    
    trial = analysis.trials[0]
    trainer = LinTSTrainer(config=TS_CONFIG)
    trainer.restore(trial.checkpoint.value)
    
    model = trainer.get_policy().model
    means = [model.arms[i].theta.numpy() for i in range(5)]
    covs = [model.arms[i].covariance.numpy() for i in range(5)]

    return ts_regrets, model, means, covs

In [27]:
delta = 0.7
ts_regrets7, model7, means7, covs7 = run_ts(delta)

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=89141)[0m 2020-06-10 20:28:54,082	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89141)[0m 2020-06-10 20:28:54,084	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89141)[0m 2020-06-10 20:28:54,099	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=89142)[0m 2020-06-10 20:28:54,082	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89142)[0m 2020-06-10 20:28:54,084	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89142)[0m 2020-06-10 20:28:54,100	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_20-28-54
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,,,,,
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:89142,1.0,0.641963,100.0,31.3818


Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_20-28-54
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 50.025029497609985
  episode_reward_mean: 13.254214191368032
  episode_reward_min: 0.9705291322371877
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: abf6b70f24cd448a9f651e62fa79108c
  experiment_tag: '0'
  grad_time_ms: 0.563
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.563
    learner:
      cumulative_regret: 2210.7582341361626
      update_latency: 0.0005869865417480469
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1775.066
    opt_samples: 1.0
    sample_peak_throughput: 547.138
    sample_time_ms: 1.828
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 2210.7582341361626
    update_latency: 0.0005869865417480469
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimato

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:89141,15,5.21548,1500,31.3893
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:89142,16,5.46622,1600,38.7291


Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_20-28-59
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 50.020323012682724
  episode_reward_mean: 27.96025183665854
  episode_reward_min: 0.9823477492033534
  episodes_this_iter: 100
  episodes_total: 1600
  experiment_id: abf6b70f24cd448a9f651e62fa79108c
  experiment_tag: '0'
  grad_time_ms: 0.593
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.593
    learner:
      cumulative_regret: 18446.15024438094
      update_latency: 0.0005130767822265625
    num_steps_sampled: 1600
    num_steps_trained: 1600
    opt_peak_throughput: 1687.577
    opt_samples: 1.0
    sample_peak_throughput: 528.363
    sample_time_ms: 1.893
    update_time_ms: 0.002
  iterations_since_restore: 16
  learner:
    cumulative_regret: 18446.15024438094
    update_latency: 0.0005130767822265625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1600
  num_steps_trained: 1600
  off_policy_estim

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,6.98209,2000,26.4905
LinTS_WheelBanditEnv_00001,TERMINATED,,20,7.03456,2000,38.2409


2020-06-10 20:29:01,526	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-10 20:29:01,536	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.


The trials took 19.274649143218994 seconds



2020-06-10 20:29:01,672	INFO trainable.py:217 -- Getting current IP.
2020-06-10 20:29:01,683	INFO trainable.py:217 -- Getting current IP.
2020-06-10 20:29:01,685	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-10_20-28-42w3gugqec/checkpoint_20/checkpoint-20
2020-06-10 20:29:01,686	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 6.982086658477783, '_episodes_total': 2000}


In [28]:
plot_cumulative_regret(ts_regrets7)

([image](../../../images/rllib/LinTS-cumulative-regret2.png))

The cumulative regret values are much higher than for $\delta = 0.5$ in the lesson, and the standard deviation is ... well crazy. We mentioned in the lesson that the problem becomes harder for higher $\delta$, which fits this result.

In [29]:
import numpy as np

In [30]:
plot_model_weights(means7, covs7)

([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms2.png))

Much less separation of the clusters compared to $\delta = 0.5$.

In [31]:
delta = 0.9
ts_regrets9, model9, means9, covs9 = run_ts(delta)

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=89145)[0m 2020-06-10 20:29:10,511	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89145)[0m 2020-06-10 20:29:10,512	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89145)[0m 2020-06-10 20:29:10,522	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=89151)[0m 2020-06-10 20:29:10,510	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=89151)[0m 2020-06-10 20:29:10,512	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=89151)[0m 2020-06-10 20:29:10,522	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_20-29-10
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:89145,1.0,0.197057,100.0,22.078
LinTS_WheelBanditEnv_00001,RUNNING,,,,,


Result for LinTS_WheelBanditEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_20-29-10
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 50.01564006885617
  episode_reward_mean: 18.158459166330296
  episode_reward_min: 0.9789907581841277
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 61cd909d01034d6891987d05915f8ebb
  experiment_tag: '1'
  grad_time_ms: 0.393
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.393
    learner:
      cumulative_regret: 1866.657109199669
      update_latency: 0.00030422210693359375
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 2547.097
    opt_samples: 1.0
    sample_peak_throughput: 710.899
    sample_time_ms: 1.407
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 1866.657109199669
    update_latency: 0.00030422210693359375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimator

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,4.10791,2000,35.2992
LinTS_WheelBanditEnv_00001,TERMINATED,,20,4.17366,2000,29.9162


2020-06-10 20:29:15,037	INFO trainable.py:217 -- Getting current IP.
2020-06-10 20:29:15,042	INFO trainable.py:217 -- Getting current IP.
2020-06-10 20:29:15,044	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-10_20-29-0253r0mn3i/checkpoint_20/checkpoint-20
2020-06-10 20:29:15,045	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 4.107909917831421, '_episodes_total': 2000}


The trials took 12.801892042160034 seconds



In [32]:
plot_cumulative_regret(ts_regrets9)

([image](../../../images/rllib/LinTS-cumulative-regret3.png))

Qualitatively the same as for $\delta = 0.7$, but the size of the cumulative regret values are even higher. 

In [33]:
plot_model_weights(means9, covs9)

  x, y = np.random.multivariate_normal(means[i] / 30, covs[i], 5000).T


([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms3.png))

Curiously, this result looks much closer to $\delta = 0.5$ than $0.7$!