# Ray RLlib - Multi-Armed Bandits - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [21]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import pandas as pd
import os, time, random
from ray import tune
from ray.tune.progress_reporter import JupyterNotebookReporter

## 03: Simple Multi-Armed Bandits - Exercise 1

First, set up a function to generate the rewards for n arms. To keep it somewhat simple, just use the original rewards for -1 in `SimpleBandit`, `[-10,0,10]` and repeat it as much as necessary, and optionally offset the start.

In [2]:
class SimpleContextualBandit2 (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        self.current_context = random.choice([-1.,1.])
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit2(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'
    

In [3]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [-1.  1.], bandit = SimpleContextualBandit2(action_space=Discrete(3), observation_space=Box(2,), current_context=1.0, rewards per context={-1.0: [-10, 0, 10], 1.0: [10, 0, -10]})'

In [5]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit2,
}

In [6]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-08 13:58:52,018	INFO resource_spec.py:212 -- Starting Ray with 4.44 GiB memory available for workers and up to 2.22 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-08 13:58:52,344	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


Trial name,status,loc
contrib_LinUCB_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,475	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,478	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13285)[0m 2020-06-08 13:59:00,486	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-08_13-59-00
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 91d770ae39234ca1bfd77582362df8dc
  experiment_tag: '0'
  grad_time_ms: 0.246
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.246
    learner:
      cumulative_regret: 10.0
      update_latency: 0.0001289844512939453
    num_steps_sampled: 100
    num_steps_trained: 100

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBandit2_00000,TERMINATED,,2,0.239915,200,10


The trials took 8.765048027038574 seconds



In [7]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.894,0.285,0.001,...,1118.72,1.0,10.0,0.00023,,,10.0,0.00023,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


It trains just as easily as the original implementation that didn't switch contexts between steps. Is this surprising? Probably not, because the relationship between the reward and the context remains linear, so what LinUCB learns for one context is correct for the second context, too. Also, _Tune_ runs many episodes, so it studies both contexts.

## 03: Simple Multi-Armed Bandits - Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Also remove the change made for exercise 1, the line `self.current_context = random.choice([-1.,1.])` in the `step` method.

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

In [8]:
class SimpleContextualBanditNonlinear (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {   # Changed here:
            -1.: [-10, 10, 0],
            1.: [0, 10, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBanditNonlinear(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

In [9]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBanditNonlinear(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 10, 0], 1.0: [0, 10, -10]})'

In [10]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}


In [11]:
# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

In [12]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,502	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,505	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13289)[0m 2020-06-08 14:01:10,512	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-10
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.4
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.255
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.255
    learner:
      cumulative_regret: 460.0
      update_latency: 0.00013589859008789062
    num_steps_sampled: 100
    num_steps_

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,17,1.88235,1700,5.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-15
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 4400
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.306
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.306
    learner:
      cumulative_regret: 21780.0
      update_latency: 0.0001728534698486328
    num_steps_sampled: 4400
    num_steps_trained: 4400
    opt_peak_throughput: 3264.304
    opt_samples: 1.0
    sample_peak_throughput: 1297.462
    sample_time_ms: 0.771
    update_time_ms: 0.001
  iterations_since_restore: 44
  learner:
    cumulative_regret: 21780.0
    update_latency: 0.0001728534698486328
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4400
  num_steps_trained: 4400
  off_policy_estimator: {}
  opt_peak_throughput: 3264

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,60,6.5581,6000,4.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-20
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.8
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 8700
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.321
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.321
    learner:
      cumulative_regret: 42800.0
      update_latency: 0.000209808349609375
    num_steps_sampled: 8700
    num_steps_trained: 8700
    opt_peak_throughput: 3111.501
    opt_samples: 1.0
    sample_peak_throughput: 1437.045
    sample_time_ms: 0.696
    update_time_ms: 0.001
  iterations_since_restore: 87
  learner:
    cumulative_regret: 42800.0
    update_latency: 0.000209808349609375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 8700
  num_steps_trained: 8700
  off_policy_estimator: {}
  opt_peak_throughput: 3111.5

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,102,11.2425,10200,5.1


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-25
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.4
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 12800
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.406
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.406
    learner:
      cumulative_regret: 63530.0
      update_latency: 0.00024175643920898438
    num_steps_sampled: 12800
    num_steps_trained: 12800
    opt_peak_throughput: 2461.59
    opt_samples: 1.0
    sample_peak_throughput: 1320.625
    sample_time_ms: 0.757
    update_time_ms: 0.002
  iterations_since_restore: 128
  learner:
    cumulative_regret: 63530.0
    update_latency: 0.00024175643920898438
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 12800
  num_steps_trained: 12800
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,142,15.898,14200,5.5


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-30
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 16700
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.38
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.38
    learner:
      cumulative_regret: 83060.0
      update_latency: 0.00027298927307128906
    num_steps_sampled: 16700
    num_steps_trained: 16700
    opt_peak_throughput: 2634.613
    opt_samples: 1.0
    sample_peak_throughput: 1390.085
    sample_time_ms: 0.719
    update_time_ms: 0.001
  iterations_since_restore: 167
  learner:
    cumulative_regret: 83060.0
    update_latency: 0.00027298927307128906
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 16700
  num_steps_trained: 16700
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:13289,181,20.6512,18100,5.2


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-08_14-01-35
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.7
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 20000
  experiment_id: 1ff2fc0067bd458aaaabe106df9751e1
  experiment_tag: '0'
  grad_time_ms: 0.445
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.445
    learner:
      cumulative_regret: 99230.0
      update_latency: 0.0003299713134765625
    num_steps_sampled: 20000
    num_steps_trained: 20000
    opt_peak_throughput: 2245.946
    opt_samples: 1.0
    sample_peak_throughput: 1244.97
    sample_time_ms: 0.803
    update_time_ms: 0.001
  iterations_since_restore: 200
  learner:
    cumulative_regret: 99230.0
    update_latency: 0.0003299713134765625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 20000
  num_steps_trained: 20000
  off_policy_estimator: {}
  opt_peak_throughput: 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,22.9847,20000,5.7


The trials took 27.73778510093689 seconds



In [13]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,5.7,1.0,100,20000,20000,0.803,0.445,0.001,...,1244.97,1.0,99230.0,0.00033,22.1,66.1,99230.0,0.00033,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinUCB/...


It ran the maximum of 20,000 steps and the best it does is about 4.8, not 10.0. the `episode_reward_mean` is chaotic:

![Nonlinear model with LinUCB](../../../images/rllib/TensorBoard2.png).

Because LinUCB expcts a linear relationship between the context and each reward, it's not surprising that it fails to converge to the desired reward mean.

## 03: Simple Multi-Armed Bandits - Exercise 3

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](../02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

In [14]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBandit2,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

In [14]:
df = analysis.dataframe()
df

Trial name,status,loc
contrib_LinTS_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,052	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,056	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=13291)[0m 2020-06-08 14:02:51,063	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-08_14-02-51
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 481647293a464969a3d73baab12f468b
  experiment_tag: '0'
  grad_time_ms: 0.25
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.25
    learner:
      cumulative_regret: 10.0
      update_latency: 0.00013208389282226562
    num_steps_sampled: 100
    num_steps_trained: 100
 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBandit2_00000,TERMINATED,,2,0.213528,200,10


The trials took 3.132218837738037 seconds



Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.732,0.232,0.001,...,1366.312,1.0,10.0,0.000131,,,10.0,0.000131,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinTS/c...


As before, the training only takes 200 steps and converge to the desired reward mean of `10.0`.

Now let's try the nonlinear bandit:

In [None]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

start_time = time.time()

analysis = tune.run("contrib/LinTS", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

In [16]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,-10.0,4.5,1.0,100,20000,20000,0.74,0.446,0.003,...,1350.692,1.0,100750.0,0.00042,16.9,68.7,100750.0,0.00042,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinTS/c...


This run with Thompson sampling yields similar results with the reward mean about 4.5 and failure chaotic results over 20000 steps as shown in the TensorBoard graph.

## 04: Linear Upper Confidence Bound - Exercise 1

Change the `training_iterations` from 20 to 50. Does the characteristic behavior of cumulative regret change at higher steps?

In [33]:
from ray import tune
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

In [29]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 50 * timesteps_per_iteration (100 by default) = 5,000
training_iterations = 50

print("Running training for %s time steps" % training_iterations)

Running training for 50 time steps


In [30]:
start_time = time.time()

analysis = tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=80282)[0m 2020-06-10 14:39:59,214	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80282)[0m 2020-06-10 14:39:59,221	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=80284)[0m 2020-06-10 14:39:59,229	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80284)[0m 2020-06-10 14:39:59,235	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=80285)[0m 2020-06-10 14:39:59,216	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80285)[0m 2020-06-10 14:39:59,221	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:80285,1.0,0.273223,100.0,0.85564
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,,,,,


Result for contrib_LinUCB_ParametricItemRecoEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_14-39-59
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9130084799930833
  episode_reward_mean: 0.85169154951746
  episode_reward_min: 0.6321990038185692
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: a696831a2d8f458aa1154d8dc84e8e7f
  experiment_tag: '0'
  grad_time_ms: 0.513
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.513
    learner:
      cumulative_regret: 3.75249494289159
      update_latency: 0.00026988983154296875
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1950.658
    opt_samples: 1.0
    sample_peak_throughput: 693.354
    sample_time_ms: 1.442
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 3.75249494289159
    update_latency: 0.00026988983154296875
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_pol

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:80283,15,4.96707,1500,0.888318
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:80285,15,4.95175,1500,0.887565
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:80284,15,4.88446,1500,0.909507
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:80288,16,4.97795,1600,0.886607
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:80282,15,4.8785,1500,0.85026


Result for contrib_LinUCB_ParametricItemRecoEnv_00001:
  custom_metrics: {}
  date: 2020-06-10_14-40-04
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8989514652605076
  episode_reward_mean: 0.8875141480598222
  episode_reward_min: 0.8489871972592722
  episodes_this_iter: 100
  episodes_total: 1600
  experiment_id: c2df8311aa46490ebcd32e5bcdd6a560
  experiment_tag: '1'
  grad_time_ms: 0.801
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.801
    learner:
      cumulative_regret: 6.35025204543502
      update_latency: 0.0005362033843994141
    num_steps_sampled: 1600
    num_steps_trained: 1600
    opt_peak_throughput: 1247.785
    opt_samples: 1.0
    sample_peak_throughput: 602.517
    sample_time_ms: 1.66
    update_time_ms: 0.003
  iterations_since_restore: 16
  learner:
    cumulative_regret: 6.35025204543502
    update_latency: 0.0005362033843994141
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1600
  num_steps_trained: 1600
  of

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:80283,33,9.71592,3300,0.888259
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:80285,34,9.88724,3400,0.889061
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:80284,33,9.51818,3300,0.905144
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:80288,34,9.65986,3400,0.891363
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:80282,33,9.65147,3300,0.848303


Result for contrib_LinUCB_ParametricItemRecoEnv_00004:
  custom_metrics: {}
  date: 2020-06-10_14-40-09
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8777501761402411
  episode_reward_mean: 0.8497868334009115
  episode_reward_min: 0.7606959960591636
  episodes_this_iter: 100
  episodes_total: 3400
  experiment_id: 143155dd06b94fa2a5a1cd9ab939ba8b
  experiment_tag: '4'
  grad_time_ms: 0.996
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.996
    learner:
      cumulative_regret: 5.793794073964872
      update_latency: 0.0009129047393798828
    num_steps_sampled: 3400
    num_steps_trained: 3400
    opt_peak_throughput: 1004.215
    opt_samples: 1.0
    sample_peak_throughput: 659.192
    sample_time_ms: 1.517
    update_time_ms: 0.002
  iterations_since_restore: 34
  learner:
    cumulative_regret: 5.793794073964872
    update_latency: 0.0009129047393798828
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3400
  num_steps_trained: 3400
 



Result for contrib_LinUCB_ParametricItemRecoEnv_00004:
  custom_metrics: {}
  date: 2020-06-10_14-40-14
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8777501761402411
  episode_reward_mean: 0.8552289931757145
  episode_reward_min: 0.7807767063805695
  episodes_this_iter: 100
  episodes_total: 4400
  experiment_id: 143155dd06b94fa2a5a1cd9ab939ba8b
  experiment_tag: '4'
  grad_time_ms: 1.035
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.035
    learner:
      cumulative_regret: 5.938656220595568
      update_latency: 0.0010323524475097656
    num_steps_sampled: 4400
    num_steps_trained: 4400
    opt_peak_throughput: 966.385
    opt_samples: 1.0
    sample_peak_throughput: 706.73
    sample_time_ms: 1.415
    update_time_ms: 0.002
  iterations_since_restore: 44
  learner:
    cumulative_regret: 5.938656220595568
    update_latency: 0.0010323524475097656
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4400
  num_steps_trained: 4400
  o

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:80283,43,12.396,4300,0.886025
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:80285,43,12.458,4300,0.886222
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:80284,43,12.3439,4300,0.901075
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:80288,43,12.0727,4300,0.882945
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:80282,44,12.635,4400,0.855229


Result for contrib_LinUCB_ParametricItemRecoEnv_00000:
  custom_metrics: {}
  date: 2020-06-10_14-40-14
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9130084799930833
  episode_reward_mean: 0.8845187990388982
  episode_reward_min: 0.8283102401581812
  episodes_this_iter: 100
  episodes_total: 4400
  experiment_id: a696831a2d8f458aa1154d8dc84e8e7f
  experiment_tag: '0'
  grad_time_ms: 1.146
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.146
    learner:
      cumulative_regret: 6.195141434894597
      update_latency: 0.0009942054748535156
    num_steps_sampled: 4400
    num_steps_trained: 4400
    opt_peak_throughput: 872.632
    opt_samples: 1.0
    sample_peak_throughput: 619.808
    sample_time_ms: 1.613
    update_time_ms: 0.002
  iterations_since_restore: 44
  learner:
    cumulative_regret: 6.195141434894597
    update_latency: 0.0009942054748535156
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4400
  num_steps_trained: 4400
  

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,50,14.9868,5000,0.888229
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,50,14.9885,5000,0.888094
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,50,14.8721,5000,0.899576
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,50,14.7232,5000,0.8949
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,50,14.8611,5000,0.848998


The trials took 23.51621699333191 seconds



In [31]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.511102,3.817931,3.093602,0.3557
200,4.064315,4.49137,3.377228,0.424538
300,4.47064,4.822568,3.520875,0.543014
400,4.685584,5.06892,3.6534,0.588148
500,4.847873,5.323248,3.707895,0.650397
600,4.959078,5.494924,3.765879,0.689057
700,5.07485,5.605005,3.806786,0.737699
800,5.150581,5.696489,3.882395,0.742712
900,5.254922,5.808644,3.927877,0.773881
1000,5.321351,5.881939,3.981708,0.79023


In [27]:
from bokeh.plotting import figure, show, output_file
from bokeh.models import Band, ColumnDataSource, Range1d
import bokeh.io
# The next two lines prevent Bokeh from opening the graph in a new window.
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [32]:
df['lower'] = df['mean'] - df['std']
df['upper'] = df['mean'] + df['std']
ymin=df['lower'].min()
ymax=df['upper'].max()

source = ColumnDataSource(df.reset_index())

TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(tools=TOOLS, y_range=Range1d(ymin,ymax))

p.scatter(x='num_steps_trained', y='mean', line_color='black', fill_alpha=0.3, size=5, source=source)
band = Band(base='num_steps_trained', lower='lower', upper='upper', source=source, level='underlay',
            fill_alpha=0.3, line_width=1, line_color='blue')
p.add_layout(band)

p.title.text = "Cumulative Regret"
p.xgrid[0].grid_line_alpha=0.5
p.ygrid[0].grid_line_alpha=0.5
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Regret'

show(p)

The slope appears to stop flattening, suggesting that the previous number of steps, 2000, was sufficient to get the optimal behavior. Beyond that, regret continues to accumulate, but it's linear in the number of steps, neither getting better or worse.  

## 04: Linear Upper Confidence Bound - Exercise 2

Change the `training_iterations` back to the original value of 20 and try the `LinearDiscreteEnv` ([discrete.py source code](https://github.com/ray-project/ray/blob/master/rllib/contrib/bandits/envs/discrete.py)) as the environment instead of the `ParametricItemRecoEnv`. Also replace `UCB_CONFIG` with `DEFAULT_CONFIG_LINEAR`, which you'll need to import:

```python
from ray.rllib.contrib.bandits.envs import LinearDiscreteEnv
from ray.rllib.contrib.bandits.envs.discrete import DEFAULT_CONFIG_LINEAR
```

`LinearDiscreteEnv` samples data from linearly parameterized arms. The reward for context $X$ and arm $a$ is given by $X^T * \theta_a$, for some latent (hidden) set of parameters $\theta_i : i = 1, ..., k$. The $\theta$ values are sampled uniformly at random, the contexts are Gaussian, and Gaussian noise is added to the rewards.

What does the cumulative regret look like?

In [42]:
from ray.rllib.contrib.bandits.envs import LinearDiscreteEnv
from ray.rllib.contrib.bandits.envs.discrete import DEFAULT_CONFIG_LINEAR

In [1]:
lde = LinearDiscreteEnv()
lde.sigma  # the default standard deviation value for the result noise

NameError: name 'LinearDiscreteEnv' is not defined

In [45]:
DEFAULT_CONFIG_LINEAR["env"] = LinearDiscreteEnv

training_iterations = 20
start_time = time.time()

analysis = tune.run(
    "contrib/LinUCB",
    config=DEFAULT_CONFIG_LINEAR,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00001,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00002,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00003,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00004,PENDING,


2020-06-10 15:47:33,783	ERROR trial_runner.py:519 -- Trial contrib_LinUCB_LinearDiscreteEnv_00000: Error processing event.
Traceback (most recent call last):
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py", line 1515, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::LinUCB.train()[39m (pid=80938, ip=192.168.1.149)
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 462, in ray._r

[2m[36m(pid=80937)[0m 2020-06-10 15:47:33,756	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=80938)[0m 2020-06-10 15:47:33,757	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution


Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00001,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00002,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00003,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00004,RUNNING,

Trial name,# failures,error file
contrib_LinUCB_LinearDiscreteEnv_00000,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_0_2020-06-10_15-47-27rfpae3sv/error.txt


2020-06-10 15:47:33,821	ERROR trial_runner.py:519 -- Trial contrib_LinUCB_LinearDiscreteEnv_00001: Error processing event.
Traceback (most recent call last):
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py", line 1515, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::LinUCB.train()[39m (pid=80937, ip=192.168.1.149)
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 462, in ray._r

Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00001,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00002,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00003,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00004,ERROR,

Trial name,# failures,error file
contrib_LinUCB_LinearDiscreteEnv_00000,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_0_2020-06-10_15-47-27rfpae3sv/error.txt
contrib_LinUCB_LinearDiscreteEnv_00001,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_1_2020-06-10_15-47-27iy93qn91/error.txt
contrib_LinUCB_LinearDiscreteEnv_00002,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_2_2020-06-10_15-47-28d6xj_u3v/error.txt
contrib_LinUCB_LinearDiscreteEnv_00003,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_3_2020-06-10_15-47-28c2j50oiz/error.txt
contrib_LinUCB_LinearDiscreteEnv_00004,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_4_2020-06-10_15-47-28fyl8d44r/error.txt


[2m[36m(pid=81678)[0m 2020-06-10 15:47:33,913	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=81679)[0m 2020-06-10 15:47:33,939	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=81680)[0m 2020-06-10 15:47:33,915	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution


TuneError: ('Trials did not complete', [contrib_LinUCB_LinearDiscreteEnv_00000, contrib_LinUCB_LinearDiscreteEnv_00001, contrib_LinUCB_LinearDiscreteEnv_00002, contrib_LinUCB_LinearDiscreteEnv_00003, contrib_LinUCB_LinearDiscreteEnv_00004])

In [46]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,7.611318,9.448277,5.593304,1.464855
200,8.258989,10.153932,6.232646,1.488315
300,8.41741,10.44285,6.266983,1.584913
400,8.562657,10.681834,6.279945,1.689853
500,8.671795,10.738362,6.414168,1.698496
600,8.719458,10.779685,6.429091,1.720209
700,8.79343,10.876758,6.601496,1.691483
800,8.841272,10.902503,6.622167,1.69092
900,8.899291,10.906337,6.722436,1.675522
1000,8.941444,10.918904,6.756243,1.691254


In [47]:
df['lower'] = df['mean'] - df['std']
df['upper'] = df['mean'] + df['std']
ymin=df['lower'].min()
ymax=df['upper'].max()

source = ColumnDataSource(df.reset_index())

TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p = figure(tools=TOOLS, y_range=Range1d(ymin,ymax))

p.scatter(x='num_steps_trained', y='mean', line_color='black', fill_alpha=0.3, size=5, source=source)
band = Band(base='num_steps_trained', lower='lower', upper='upper', source=source, level='underlay',
            fill_alpha=0.3, line_width=1, line_color='blue')
p.add_layout(band)

p.title.text = "Cumulative Regret"
p.xgrid[0].grid_line_alpha=0.5
p.ygrid[0].grid_line_alpha=0.5
p.xaxis.axis_label = 'Training Steps'
p.yaxis.axis_label = 'Regret'

show(p)

The regret appears to flatten more quickly, but the standard deviation is huge! So, let's see if changing any of the parameters defined in `DEFAULT_CONFIG_LINEAR` makes a difference. 

> **NOTE:** If you change a value and try it, then change another value and try again, reset the first value, etc.!

In [64]:
DEFAULT_CONFIG_LINEAR["reward_noise_std"] = 0.001  # default 0.01
DEFAULT_CONFIG_LINEAR["num_actions"] = 4           # default 4
DEFAULT_CONFIG_LINEAR["feature_dim"] = 8           # default 8

analysis = tune.run(
    "contrib/LinUCB",
    config=DEFAULT_CONFIG_LINEAR,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False
)

Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00001,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00002,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00003,PENDING,
contrib_LinUCB_LinearDiscreteEnv_00004,PENDING,


2020-06-10 16:02:12,675	ERROR trial_runner.py:519 -- Trial contrib_LinUCB_LinearDiscreteEnv_00000: Error processing event.
Traceback (most recent call last):
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py", line 1515, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::LinUCB.train()[39m (pid=82991, ip=192.168.1.149)
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 462, in ray._r

Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00001,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00002,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00003,RUNNING,
contrib_LinUCB_LinearDiscreteEnv_00004,RUNNING,

Trial name,# failures,error file
contrib_LinUCB_LinearDiscreteEnv_00000,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_0_2020-06-10_16-02-071h2ye8rk/error.txt


[2m[36m(pid=82991)[0m 2020-06-10 16:02:12,667	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution


2020-06-10 16:02:13,341	ERROR trial_runner.py:519 -- Trial contrib_LinUCB_LinearDiscreteEnv_00002: Error processing event.
Traceback (most recent call last):
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 467, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 431, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/Users/deanwampler/anaconda3/envs/anyscale-academy/lib/python3.7/site-packages/ray/worker.py", line 1515, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: [36mray::LinUCB.train()[39m (pid=83043, ip=192.168.1.149)
  File "python/ray/_raylet.pyx", line 424, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 462, in ray._r

[2m[36m(pid=83041)[0m 2020-06-10 16:02:13,338	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=83043)[0m 2020-06-10 16:02:13,335	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=83042)[0m 2020-06-10 16:02:13,338	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution


Trial name,status,loc
contrib_LinUCB_LinearDiscreteEnv_00000,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00001,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00002,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00003,ERROR,
contrib_LinUCB_LinearDiscreteEnv_00004,ERROR,

Trial name,# failures,error file
contrib_LinUCB_LinearDiscreteEnv_00000,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_0_2020-06-10_16-02-071h2ye8rk/error.txt
contrib_LinUCB_LinearDiscreteEnv_00001,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_1_2020-06-10_16-02-07v9cieqvg/error.txt
contrib_LinUCB_LinearDiscreteEnv_00002,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_2_2020-06-10_16-02-07wtuk5mi3/error.txt
contrib_LinUCB_LinearDiscreteEnv_00003,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_3_2020-06-10_16-02-078rv3syv0/error.txt
contrib_LinUCB_LinearDiscreteEnv_00004,1,/Users/deanwampler/ray_results/contrib/LinUCB/contrib_LinUCB_LinearDiscreteEnv_4_2020-06-10_16-02-072f0qh259/error.txt


[2m[36m(pid=83044)[0m 2020-06-10 16:02:13,381	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution


TuneError: ('Trials did not complete', [contrib_LinUCB_LinearDiscreteEnv_00000, contrib_LinUCB_LinearDiscreteEnv_00001, contrib_LinUCB_LinearDiscreteEnv_00002, contrib_LinUCB_LinearDiscreteEnv_00003, contrib_LinUCB_LinearDiscreteEnv_00004])

In [65]:
frame2 = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df2 = frame2.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df2

KeyError: 'num_steps_trained'

In [57]:
df2['lower'] = df2['mean'] - df2['std']
df2['upper'] = df2['mean'] + df2['std']
ymin2=df2['lower'].min()
ymax2=df2['upper'].max()

source2 = ColumnDataSource(df2.reset_index())

TOOLS = "pan,wheel_zoom,box_zoom,reset,save"
p2 = figure(tools=TOOLS, y_range=Range1d(ymin2,ymax2))

p2.scatter(x='num_steps_trained', y='mean', line_color='black', fill_alpha=0.3, size=5, source=source2)
band2 = Band(base='num_steps_trained', lower='lower', upper='upper', source=source2, level='underlay',
            fill_alpha=0.3, line_width=1, line_color='blue')
p2.add_layout(band2)

p2.title.text = "Cumulative Regret"
p2.xgrid[0].grid_line_alpha=0.5
p2.ygrid[0].grid_line_alpha=0.5
p2.xaxis.axis_label = 'Training Steps'
p2.yaxis.axis_label = 'Regret'

show(p2)

Nope, changing the standard deviation and any of the other fields didn't made a difference.