# Ray RLlib - Multi-Armed Bandits - Exercise Solutions

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [1]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import pandas as pd
import os, time, random
import ray
from ray.tune.progress_reporter import JupyterNotebookReporter

In [3]:
!../../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [4]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15832',
 'object_store_address': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-12_08-58-38_626987_40764'}

## 03: Simple Multi-Armed Bandits - Exercise 1

First, set up a function to generate the rewards for n arms. To keep it somewhat simple, just use the original rewards for -1 in `SimpleBandit`, `[-10,0,10]` and repeat it as much as necessary, and optionally offset the start.

In [5]:
class SimpleContextualBandit2 (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        self.current_context = random.choice([-1.,1.])
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit2(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'
    

In [6]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBandit2(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 0, 10], 1.0: [10, 0, -10]})'

In [7]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit2,
}

In [8]:
start_time = time.time()

analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
                        progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                        verbose=2,              # Change to 0 or 1 to reduce the output.
                        ray_auto_init=False)    # Don't allow Tune to initialize Ray.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=78530)[0m 2020-06-13 10:46:56,758	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=78530)[0m 2020-06-13 10:46:56,762	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=78530)[0m 2020-06-13 10:46:56,787	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-13_10-46-57
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.7
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: a4bf63942e5f4d5ab357e19001c0f7dd
  experiment_tag: '0'
  grad_time_ms: 0.272
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.272
    learner:
      cumulative_regret: 30.0
      update_latency: 0.00013494491577148438
    num_steps_sampled: 100
    num_steps_trained: 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBandit2_00000,TERMINATED,,2,0.38694,200,10


The trials took 8.502413749694824 seconds



In [9]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.781,0.258,0.001,...,1280.234,1.0,30.0,0.00013,,,30.0,0.00013,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


It trains just as easily as the original implementation that didn't switch contexts between steps. Is this surprising? Probably not, because the relationship between the reward and the context remains linear, so what LinUCB learns for one context is correct for the second context, too. Also, _Tune_ runs many episodes, so it studies both contexts.

## 03: Simple Multi-Armed Bandits - Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Also remove the change made for exercise 1, the line `self.current_context = random.choice([-1.,1.])` in the `step` method.

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

In [9]:
class SimpleContextualBanditNonlinear (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {   # Changed here:
            -1.: [-10, 10, 0],
            1.: [0, 10, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBanditNonlinear(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

In [10]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], bandit = SimpleContextualBanditNonlinear(action_space=Discrete(3), observation_space=Box(2,), current_context=-1.0, rewards per context={-1.0: [-10, 10, 0], 1.0: [0, 10, -10]})'

In [11]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

current_context = -1.0
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 1, reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], action = 0, reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], action = 2, reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], action = 1, reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], action = 1, reward =   10, done = True , info = {'regret': 0}


In [12]:
# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

In [13]:
start_time = time.time()

analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
                        progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                        verbose=2,  # Change to 0 or 1 to reduce the output.
                        ray_auto_init=False)    # Don't allow Tune to initialize Ray.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=76391)[0m 2020-06-13 10:05:21,193	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76391)[0m 2020-06-13 10:05:21,196	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76391)[0m 2020-06-13 10:05:21,204	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-21
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.7
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.246
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.246
    learner:
      cumulative_regret: 530.0
      update_latency: 0.0003178119659423828
    num_steps_sampled: 100
    num_steps_t

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,8,1.05643,800,4.6


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-26
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 3.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 3600
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.283
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.283
    learner:
      cumulative_regret: 18500.0
      update_latency: 0.0002932548522949219
    num_steps_sampled: 3600
    num_steps_trained: 3600
    opt_peak_throughput: 3535.916
    opt_samples: 1.0
    sample_peak_throughput: 1361.699
    sample_time_ms: 0.734
    update_time_ms: 0.001
  iterations_since_restore: 36
  learner:
    cumulative_regret: 18500.0
    update_latency: 0.0002932548522949219
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3600
  num_steps_trained: 3600
  off_policy_estimator: {}
  opt_peak_throughput: 3535

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,44,5.84628,4400,5.1


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-31
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.1
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 7400
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.335
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.335
    learner:
      cumulative_regret: 37180.0
      update_latency: 0.0002689361572265625
    num_steps_sampled: 7400
    num_steps_trained: 7400
    opt_peak_throughput: 2986.545
    opt_samples: 1.0
    sample_peak_throughput: 1276.144
    sample_time_ms: 0.784
    update_time_ms: 0.001
  iterations_since_restore: 74
  learner:
    cumulative_regret: 37180.0
    update_latency: 0.0002689361572265625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7400
  num_steps_trained: 7400
  off_policy_estimator: {}
  opt_peak_throughput: 2986

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,81,10.6179,8100,4.8


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-36
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.3
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 10600
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.416
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.416
    learner:
      cumulative_regret: 53150.0
      update_latency: 0.000308990478515625
    num_steps_sampled: 10600
    num_steps_trained: 10600
    opt_peak_throughput: 2401.548
    opt_samples: 1.0
    sample_peak_throughput: 1454.184
    sample_time_ms: 0.688
    update_time_ms: 0.001
  iterations_since_restore: 106
  learner:
    cumulative_regret: 53150.0
    update_latency: 0.000308990478515625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 10600
  num_steps_trained: 10600
  off_policy_estimator: {}
  opt_peak_throughput: 

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,114,15.3349,11400,5.2


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-41
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.5
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 14300
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.426
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.426
    learner:
      cumulative_regret: 71850.0
      update_latency: 0.0002779960632324219
    num_steps_sampled: 14300
    num_steps_trained: 14300
    opt_peak_throughput: 2347.514
    opt_samples: 1.0
    sample_peak_throughput: 1266.74
    sample_time_ms: 0.789
    update_time_ms: 0.001
  iterations_since_restore: 143
  learner:
    cumulative_regret: 71850.0
    update_latency: 0.0002779960632324219
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 14300
  num_steps_trained: 14300
  off_policy_estimator: {}
  opt_peak_throughput:

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,150,20.0695,15000,5.4


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-46
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 17600
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.426
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.426
    learner:
      cumulative_regret: 88060.0
      update_latency: 0.00030612945556640625
    num_steps_sampled: 17600
    num_steps_trained: 17600
    opt_peak_throughput: 2346.464
    opt_samples: 1.0
    sample_peak_throughput: 1297.662
    sample_time_ms: 0.771
    update_time_ms: 0.001
  iterations_since_restore: 176
  learner:
    cumulative_regret: 88060.0
    update_latency: 0.00030612945556640625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 17600
  num_steps_trained: 17600
  off_policy_estimator: {}
  opt_peak_throughp

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76391,184,24.9029,18400,4.3


Result for contrib_LinUCB_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-50
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 20000
  experiment_id: f82f25d45bae4a7da26d0b5e47261df9
  experiment_tag: '0'
  grad_time_ms: 0.487
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.487
    learner:
      cumulative_regret: 100110.0
      update_latency: 0.0003261566162109375
    num_steps_sampled: 20000
    num_steps_trained: 20000
    opt_peak_throughput: 2051.907
    opt_samples: 1.0
    sample_peak_throughput: 1363.735
    sample_time_ms: 0.733
    update_time_ms: 0.001
  iterations_since_restore: 200
  learner:
    cumulative_regret: 100110.0
    update_latency: 0.0003261566162109375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 20000
  num_steps_trained: 20000
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,27.1658,20000,5


The trials took 33.04778599739075 seconds



In [14]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,5.0,1.0,100,20000,20000,0.733,0.487,0.001,...,1363.735,1.0,100110.0,0.000326,,,100110.0,0.000326,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinUCB/...


It ran the maximum of 20,000 steps and the best it does is about 4.8, not 10.0. the `episode_reward_mean` is chaotic:

![Nonlinear model with LinUCB](../../../images/rllib/TensorBoard2.png).

Because LinUCB expcts a linear relationship between the context and each reward, it's not surprising that it fails to converge to the desired reward mean.

## 03: Simple Multi-Armed Bandits - Exercise 3

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](../02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

In [15]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBandit2,
}

start_time = time.time()

analysis = ray.tune.run("contrib/LinTS", config=config, stop=stop, 
                        progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                        verbose=2,  # Change to 0 or 1 to reduce the output.
                        ray_auto_init=False)    # Don't allow Tune to initialize Ray.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinTS_SimpleContextualBandit2_00000,RUNNING,


[2m[36m(pid=76405)[0m 2020-06-13 10:05:54,140	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76405)[0m 2020-06-13 10:05:54,145	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76405)[0m 2020-06-13 10:05:54,152	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBandit2_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-54
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.3
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: ea0c3d97122c4b8c9649b78b480d4dcf
  experiment_tag: '0'
  grad_time_ms: 0.263
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.263
    learner:
      cumulative_regret: 70.0
      update_latency: 0.0001361370086669922
    num_steps_sampled: 100
    num_steps_trained: 10

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBandit2_00000,TERMINATED,,2,0.21326,200,10


The trials took 4.11173415184021 seconds



In [16]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.923,0.287,0.001,...,1082.876,1.0,70.0,0.000164,,,70.0,0.000164,<class '__main__.SimpleContextualBandit2'>,/Users/deanwampler/ray_results/contrib/LinTS/c...


As before, the training only takes 200 steps and converge to the desired reward mean of `10.0`.

Now let's try the nonlinear bandit:

In [17]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

start_time = time.time()

analysis = ray.tune.run("contrib/LinTS", config=config, stop=stop, 
                        progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                        verbose=2,  # Change to 0 or 1 to reduce the output.
                        ray_auto_init=False)    # Don't allow Tune to initialize Ray.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,


[2m[36m(pid=76412)[0m 2020-06-13 10:05:59,196	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76412)[0m 2020-06-13 10:05:59,200	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76412)[0m 2020-06-13 10:05:59,215	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-05-59
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.8
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.26
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.26
    learner:
      cumulative_regret: 520.0
      update_latency: 0.0001418590545654297
    num_steps_sampled: 100
    num_steps_trai

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,3,0.319366,300,5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-04
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.0
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 3600
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.281
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.281
    learner:
      cumulative_regret: 18400.0
      update_latency: 0.00016379356384277344
    num_steps_sampled: 3600
    num_steps_trained: 3600
    opt_peak_throughput: 3563.555
    opt_samples: 1.0
    sample_peak_throughput: 1432.774
    sample_time_ms: 0.698
    update_time_ms: 0.001
  iterations_since_restore: 36
  learner:
    cumulative_regret: 18400.0
    update_latency: 0.00016379356384277344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3600
  num_steps_trained: 3600
  off_policy_estimator: {}
  opt_peak_throughput: 356

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,38,5.02932,3800,4.8


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-09
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.6
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 7300
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.302
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.302
    learner:
      cumulative_regret: 37480.0
      update_latency: 0.00019097328186035156
    num_steps_sampled: 7300
    num_steps_trained: 7300
    opt_peak_throughput: 3309.899
    opt_samples: 1.0
    sample_peak_throughput: 1437.981
    sample_time_ms: 0.695
    update_time_ms: 0.001
  iterations_since_restore: 73
  learner:
    cumulative_regret: 37480.0
    update_latency: 0.00019097328186035156
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7300
  num_steps_trained: 7300
  off_policy_estimator: {}
  opt_peak_throughput: 330

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,75,9.80871,7500,5.5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-14
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 5.5
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 11200
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.391
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.391
    learner:
      cumulative_regret: 57350.0
      update_latency: 0.00023698806762695312
    num_steps_sampled: 11200
    num_steps_trained: 11200
    opt_peak_throughput: 2555.477
    opt_samples: 1.0
    sample_peak_throughput: 1358.875
    sample_time_ms: 0.736
    update_time_ms: 0.001
  iterations_since_restore: 112
  learner:
    cumulative_regret: 57350.0
    update_latency: 0.00023698806762695312
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 11200
  num_steps_trained: 11200
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,113,14.5061,11300,5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-19
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.8
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 14300
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 2.134
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 2.134
    learner:
      cumulative_regret: 72870.0
      update_latency: 0.0008649826049804688
    num_steps_sampled: 14300
    num_steps_trained: 14300
    opt_peak_throughput: 468.669
    opt_samples: 1.0
    sample_peak_throughput: 200.271
    sample_time_ms: 4.993
    update_time_ms: 0.002
  iterations_since_restore: 143
  learner:
    cumulative_regret: 72870.0
    update_latency: 0.0008649826049804688
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 14300
  num_steps_trained: 14300
  off_policy_estimator: {}
  opt_peak_throughput: 4

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,145,19.2632,14500,5


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-24
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.6
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 17800
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.575
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.575
    learner:
      cumulative_regret: 90470.0
      update_latency: 0.0004169940948486328
    num_steps_sampled: 17800
    num_steps_trained: 17800
    opt_peak_throughput: 1739.004
    opt_samples: 1.0
    sample_peak_throughput: 1146.486
    sample_time_ms: 0.872
    update_time_ms: 0.001
  iterations_since_restore: 178
  learner:
    cumulative_regret: 90470.0
    update_latency: 0.0004169940948486328
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 17800
  num_steps_trained: 17800
  off_policy_estimator: {}
  opt_peak_throughput:

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,RUNNING,192.168.1.149:76412,179,24.0366,17900,4


Result for contrib_LinTS_SimpleContextualBanditNonlinear_00000:
  custom_metrics: {}
  date: 2020-06-13_10-06-28
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 4.9
  episode_reward_min: 0.0
  episodes_this_iter: 100
  episodes_total: 20000
  experiment_id: c1d9402d73a442bdb1da70d0de203ecc
  experiment_tag: '0'
  grad_time_ms: 0.492
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.492
    learner:
      cumulative_regret: 101450.0
      update_latency: 0.0003581047058105469
    num_steps_sampled: 20000
    num_steps_trained: 20000
    opt_peak_throughput: 2033.405
    opt_samples: 1.0
    sample_peak_throughput: 1217.752
    sample_time_ms: 0.821
    update_time_ms: 0.001
  iterations_since_restore: 200
  learner:
    cumulative_regret: 101450.0
    update_latency: 0.0003581047058105469
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 20000
  num_steps_trained: 20000
  off_policy_estimator: {}
  opt_peak_throughput

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinTS_SimpleContextualBanditNonlinear_00000,TERMINATED,,200,27.4536,20000,4.9


The trials took 34.26559376716614 seconds



In [18]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,0.0,4.9,1.0,100,20000,20000,0.821,0.492,0.001,...,1217.752,1.0,101450.0,0.000358,,,101450.0,0.000358,<class '__main__.SimpleContextualBanditNonline...,/Users/deanwampler/ray_results/contrib/LinTS/c...


This run with Thompson sampling yields similar results with the reward mean about 4.5 and failure chaotic results over 20000 steps as shown in the TensorBoard graph.

## 04: Linear Upper Confidence Bound - Exercise 1

Change the `training_iterations` from 20 to 50. Does the characteristic behavior of cumulative regret change at higher steps?

In [19]:
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

In [20]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 40 * timesteps_per_iteration (100 by default) = 4,000
training_iterations = 40

print("Running training for %s time steps" % training_iterations)

Running training for 40 time steps


In [21]:
start_time = time.time()

analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False,
    ray_auto_init=False,
)

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,
contrib_LinUCB_ParametricItemRecoEnv_00001,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00002,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00003,PENDING,
contrib_LinUCB_ParametricItemRecoEnv_00004,PENDING,


[2m[36m(pid=76459)[0m 2020-06-13 10:06:39,415	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76459)[0m 2020-06-13 10:06:39,417	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76459)[0m 2020-06-13 10:06:39,435	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76458)[0m 2020-06-13 10:06:39,417	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76458)[0m 2020-06-13 10:06:39,419	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76458)[0m 2020-06-13 10:06:39,436	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76460)[0m 2020-06-13 10:06:39,415	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eage

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,,,,,
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:76461,1.0,0.578362,100.0,0.822345
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,,,,,


Result for contrib_LinUCB_ParametricItemRecoEnv_00002:
  custom_metrics: {}
  date: 2020-06-13_10-06-40
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9231671675046684
  episode_reward_mean: 0.860258364102751
  episode_reward_min: 0.6743905821642643
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 2a6481b6551e49f4af4870ec5d6848c2
  experiment_tag: '2'
  grad_time_ms: 0.676
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.676
    learner:
      cumulative_regret: 3.103353487167749
      update_latency: 0.00021004676818847656
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 1478.325
    opt_samples: 1.0
    sample_peak_throughput: 649.233
    sample_time_ms: 1.54
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 3.103353487167749
    update_latency: 0.00021004676818847656
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_p

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:76460,17,5.27654,1700,0.892079
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:76458,17,5.29252,1700,0.831759
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:76459,17,5.36803,1700,0.893171
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:76461,17,5.30406,1700,0.855757
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:76462,17,5.27179,1700,0.873623


Result for contrib_LinUCB_ParametricItemRecoEnv_00003:
  custom_metrics: {}
  date: 2020-06-13_10-06-45
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.8818416347487551
  episode_reward_mean: 0.856395662721718
  episode_reward_min: 0.7640262960572408
  episodes_this_iter: 100
  episodes_total: 1800
  experiment_id: c625535dd6d5410f8cc9f0f539f3bdb8
  experiment_tag: '3'
  grad_time_ms: 0.666
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.666
    learner:
      cumulative_regret: 4.8699662480696455
      update_latency: 0.0003490447998046875
    num_steps_sampled: 1800
    num_steps_trained: 1800
    opt_peak_throughput: 1501.397
    opt_samples: 1.0
    sample_peak_throughput: 753.544
    sample_time_ms: 1.327
    update_time_ms: 0.002
  iterations_since_restore: 18
  learner:
    cumulative_regret: 4.8699662480696455
    update_latency: 0.0003490447998046875
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1800
  num_steps_trained: 1800


Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,RUNNING,192.168.1.149:76460,32,9.51628,3200,0.894662
contrib_LinUCB_ParametricItemRecoEnv_00001,RUNNING,192.168.1.149:76458,33,9.61933,3300,0.837605
contrib_LinUCB_ParametricItemRecoEnv_00002,RUNNING,192.168.1.149:76459,33,10.1649,3300,0.896673
contrib_LinUCB_ParametricItemRecoEnv_00003,RUNNING,192.168.1.149:76461,33,9.66055,3300,0.85647
contrib_LinUCB_ParametricItemRecoEnv_00004,RUNNING,192.168.1.149:76462,33,9.46918,3300,0.864681


Result for contrib_LinUCB_ParametricItemRecoEnv_00004:
  custom_metrics: {}
  date: 2020-06-13_10-06-50
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 0.9056056794660591
  episode_reward_mean: 0.8672499534258682
  episode_reward_min: 0.7896459683429845
  episodes_this_iter: 100
  episodes_total: 3500
  experiment_id: 22dabc92f24f4577989c895a58f3092c
  experiment_tag: '4'
  grad_time_ms: 1.019
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 1.019
    learner:
      cumulative_regret: 6.052107350361885
      update_latency: 0.0005650520324707031
    num_steps_sampled: 3500
    num_steps_trained: 3500
    opt_peak_throughput: 981.009
    opt_samples: 1.0
    sample_peak_throughput: 638.859
    sample_time_ms: 1.565
    update_time_ms: 0.002
  iterations_since_restore: 35
  learner:
    cumulative_regret: 6.052107350361885
    update_latency: 0.0005650520324707031
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3500
  num_steps_trained: 3500
  

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_ParametricItemRecoEnv_00000,TERMINATED,,40,13.2641,4000,0.893892
contrib_LinUCB_ParametricItemRecoEnv_00001,TERMINATED,,40,13.0403,4000,0.833237
contrib_LinUCB_ParametricItemRecoEnv_00002,TERMINATED,,40,13.2846,4000,0.895149
contrib_LinUCB_ParametricItemRecoEnv_00003,TERMINATED,,40,12.9609,4000,0.857493
contrib_LinUCB_ParametricItemRecoEnv_00004,TERMINATED,,40,12.8471,4000,0.864985


The trials took 24.783339977264404 seconds



In [22]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("num_steps_trained")[
    "learner/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,3.482287,4.015056,3.103353,0.347313
200,4.029911,4.587597,3.728915,0.335497
300,4.388879,4.896651,3.925946,0.347402
400,4.658526,5.233587,4.12147,0.405123
500,4.800057,5.378828,4.276671,0.399433
600,4.92398,5.453361,4.333705,0.415956
700,5.066738,5.619135,4.440111,0.440764
800,5.18613,5.742702,4.481608,0.468715
900,5.257798,5.869861,4.548723,0.485774
1000,5.322279,5.927804,4.588776,0.490643


In [25]:
import sys
sys.path.append("..")          # So we can load the bokeh_util from the parent directory...
sys.path.append('../../../..') # ... and line_plot functions from "util"

from bokeh_util import plot_cumulative_regret, plot_wheel_bandit_model_weights
from util.line_plots import plot_line, plot_line_with_stddev, plot_line_with_min_max

In [26]:

# The next two lines prevent Bokeh from opening the graph in a new window.
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [27]:
plot_cumulative_regret(df)

([image](../../../images/rllib/LinUCB-cumulative-regret2.png))

The slope appears to stop flattening, suggesting that the previous number of steps, 2000, was sufficient to get the optimal behavior. Beyond that, regret continues to accumulate, but it's linear in the number of steps, neither getting better or worse.  

## 05: Linear Thompson Sampling - Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

In [28]:
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [29]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

Running training for 20 time steps


In [30]:
def run_ts(delta):
    TS_CONFIG["delta"] = delta

    start_time = time.time()

    analysis = ray.tune.run(
        LinTSTrainer,
        config=TS_CONFIG,
        stop={"training_iteration": training_iterations},
        num_samples=2,
        checkpoint_at_end=True,
        ray_auto_init=False,
        )

    print("The trials took", time.time() - start_time, "seconds\n")

    df = pd.DataFrame()

    for key, df_trial in analysis.trial_dataframes.items():
        df = df.append(df_trial, ignore_index=True)

    ts_regrets = df \
        .groupby("num_steps_trained")["learner/cumulative_regret"] \
        .aggregate(["mean", "max", "min", "std"])
    
    trial = analysis.trials[0]
    trainer = LinTSTrainer(config=TS_CONFIG)
    trainer.restore(trial.checkpoint.value)
    
    model = trainer.get_policy().model
    means = [model.arms[i].theta.numpy() for i in range(5)]
    covs = [model.arms[i].covariance.numpy() for i in range(5)]

    return ts_regrets, model, means, covs

In [31]:
delta = 0.7
ts_regrets7, model7, means7, covs7 = run_ts(delta)

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=76474)[0m 2020-06-13 10:09:15,205	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76474)[0m 2020-06-13 10:09:15,206	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76477)[0m 2020-06-13 10:09:15,205	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76477)[0m 2020-06-13 10:09:15,207	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76477)[0m 2020-06-13 10:09:15,215	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76474)[0m 2020-06-13 10:09:15,215	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-13_10-09-15
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:76477,2,0.259939,200,33.8326
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:76474,1,0.127478,100,16.696


Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-13_10-09-18
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 50.02264905218151
  episode_reward_mean: 37.750428775334136
  episode_reward_min: 0.9787377924646361
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: cd6fafa3caf34c499e41aa1e78c7a37e
  experiment_tag: '0'
  grad_time_ms: 0.328
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.328
    learner:
      cumulative_regret: 3134.2099765460766
      update_latency: 0.00029015541076660156
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 3044.425
    opt_samples: 1.0
    sample_peak_throughput: 1056.526
    sample_time_ms: 0.946
    update_time_ms: 0.001
  iterations_since_restore: 20
  learner:
    cumulative_regret: 3134.2099765460766
    update_latency: 0.00029015541076660156
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  off_policy_e

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,2.67278,2000,37.7504
LinTS_WheelBanditEnv_00001,TERMINATED,,20,2.71919,2000,38.2392


2020-06-13 10:09:18,149	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 10:09:18,155	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 10:09:18,188	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:09:18,196	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:09:18,197	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-13_10-09-103chovigs/checkpoint_20/checkpoint-20
2020-06-13 10:09:18,198	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 2.6727774143218994, '_episodes_total': 2000}


The trials took 7.720871925354004 seconds



In [32]:
plot_cumulative_regret(ts_regrets7)

([image](../../../images/rllib/LinTS-cumulative-regret-07.png))

The cumulative regret values are much higher than for $\delta = 0.5$ in the lesson, and the standard deviation is ... well crazy. We mentioned in the lesson that the problem becomes harder for higher $\delta$, which fits this result.

In [33]:
plot_wheel_bandit_model_weights(means7, covs7)

([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-07.png))

Compare to the separation of the clusters compared to $\delta = 0.5$:

![image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-05.png)


In [34]:
delta = 0.9
ts_regrets9, model9, means9, covs9 = run_ts(delta)

Trial name,status,loc
LinTS_WheelBanditEnv_00000,RUNNING,
LinTS_WheelBanditEnv_00001,PENDING,


[2m[36m(pid=76476)[0m 2020-06-13 10:09:53,171	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76476)[0m 2020-06-13 10:09:53,172	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76473)[0m 2020-06-13 10:09:53,171	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76473)[0m 2020-06-13 10:09:53,172	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76476)[0m 2020-06-13 10:09:53,181	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76473)[0m 2020-06-13 10:09:53,181	INFO trainable.py:217 -- Getting current IP.
Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-13_10-09-53
  done: false
  episode_len_mean: 1.0
  episode_rewa

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,RUNNING,192.168.1.149:76473,4,0.491196,400,36.2807
LinTS_WheelBanditEnv_00001,RUNNING,192.168.1.149:76476,3,0.372519,300,31.3865


Result for LinTS_WheelBanditEnv_00000:
  custom_metrics: {}
  date: 2020-06-13_10-09-56
  done: true
  episode_len_mean: 1.0
  episode_reward_max: 50.01714874129944
  episode_reward_mean: 35.79030847841163
  episode_reward_min: 0.9871210376901981
  episodes_this_iter: 100
  episodes_total: 2000
  experiment_id: b2fc1ddd2318419caf2b5311b33c6d7e
  experiment_tag: '0'
  grad_time_ms: 0.279
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.279
    learner:
      cumulative_regret: 2699.2833997831626
      update_latency: 0.00014781951904296875
    num_steps_sampled: 2000
    num_steps_trained: 2000
    opt_peak_throughput: 3580.591
    opt_samples: 1.0
    sample_peak_throughput: 1063.087
    sample_time_ms: 0.941
    update_time_ms: 0.001
  iterations_since_restore: 20
  learner:
    cumulative_regret: 2699.2833997831626
    update_latency: 0.00014781951904296875
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2000
  num_steps_trained: 2000
  off_policy_es

Trial name,status,loc,iter,total time (s),ts,reward
LinTS_WheelBanditEnv_00000,TERMINATED,,20,3.04152,2000,35.7903
LinTS_WheelBanditEnv_00001,TERMINATED,,20,3.04471,2000,37.2604


2020-06-13 10:09:56,516	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:09:56,520	INFO trainable.py:217 -- Getting current IP.
2020-06-13 10:09:56,522	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: /Users/deanwampler/ray_results/LinTS/LinTS_WheelBanditEnv_0_2020-06-13_10-09-486ikvzt6a/checkpoint_20/checkpoint-20
2020-06-13 10:09:56,522	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 20, '_timesteps_total': 2000, '_time_total': 3.0415170192718506, '_episodes_total': 2000}


The trials took 7.924978971481323 seconds



In [35]:
plot_cumulative_regret(ts_regrets9)

([image](../../../images/rllib/LinTS-Cumulative-Regret-09.png))

Qualitatively the same as for $\delta = 0.7$, but the size of the cumulative regret values are even higher. 

In [36]:
plot_wheel_bandit_model_weights(means9, covs9)

([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-09.png))

## 06 Market Example - Exercise 1

Try using a `LinUCBTrainer`-based trainer. How does the annualized return compare?

In [37]:
# Some properties we'll need:
DEFAULT_MAX_INFLATION = 100.0
DEFAULT_TICKERS = ["sp500", "t.bill", "t.bond", "corp"]
DEFAULT_DATA_FILE = os.path.abspath(os.path.curdir) + '/../market.tsv'  # full path

def load_market_data (file_name):
    with open(file_name, "r") as f:
        return pd.read_table(f)

In [38]:
df = load_market_data(DEFAULT_DATA_FILE)
df

Unnamed: 0,year,inflation,sp500,t.bill,t.bond,corp
0,1928,-1.15,45.49,4.28,2.01,4.42
1,1929,0.00,-8.30,3.16,4.20,3.02
2,1930,-2.67,-23.07,7.42,7.41,3.30
3,1931,-8.93,-38.33,12.34,7.00,-7.41
4,1932,-10.30,1.85,12.68,21.28,37.78
...,...,...,...,...,...,...
87,2015,0.12,1.26,-0.07,1.16,-0.82
88,2016,1.26,10.38,-0.93,-0.56,8.99
89,2017,2.13,19.07,-1.17,0.66,7.44
90,2018,2.44,-6.51,-0.49,-2.40,-5.08


In [39]:
n_years = len(df)

In [40]:
from ray.rllib.agents.trainer import with_base_config, with_common_config
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.agents.lin_ucb import LinUCBTrainer
import ray

In [41]:
class MarketBandit (gym.Env):

    def __init__ (self, config={}):
        self.max_inflation = config.get('max-inflation', DEFAULT_MAX_INFLATION)
        self.tickers = config.get('tickers', DEFAULT_TICKERS)
        self.data_file = config.get('data-file', DEFAULT_DATA_FILE)
        print(f"MarketBandit: max_inflation: {self.max_inflation}, tickers: {self.tickers}, data file: {self.data_file} (config: {config})")

        self.action_space = Discrete(4)
        self.observation_space = Box(
            low  = -self.max_inflation,
            high =  self.max_inflation,
            shape=(1, )
        )
        self.df = load_market_data(self.data_file)
        self.cur_context = None


    def reset (self):
        self.year = self.df["year"].min()
        self.cur_context = self.df.loc[self.df["year"] == self.year]["inflation"][0]
        self.done = False
        self.info = {}

        return [self.cur_context]


    def step (self, action):
        if self.done:
            reward = 0.
            regret = 0.
        else:
            row = self.df.loc[self.df["year"] == self.year]

            # calculate reward
            ticker = self.tickers[action]
            reward = float(row[ticker])

            # calculate regret
            max_reward = max(map(lambda t: float(row[t]), self.tickers))
            regret = round(max_reward - reward)

            # update the context
            self.cur_context = float(row["inflation"])

            # increment the year
            self.year += 1

            if self.year >= self.df["year"].max():
                self.done = True

        context = [self.cur_context]
        #context = self.observation_space.sample()

        self.info = {
            "regret": regret,
            "year": self.year
        }

        return [context, reward, self.done, self.info]


    def seed (self, seed=None):
        """Sets the seed for this env's random number generator(s).
        Note:
            Some environments use multiple pseudorandom number generators.
            We want to capture all such seeds used in order to ensure that
            there aren't accidental correlations between multiple generators.
        Returns:
            list<bigint>: Returns the list of seeds used in this env's random
              number generators. The first value in the list should be the
              "main" seed, or the value which a reproducer should pass to
              'seed'. Often, the main seed equals the provided 'seed', but
              this won't be true if seed=None, for example.
        """
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

In [42]:
market_config = with_base_config(UCB_CONFIG, {
    "env":           MarketBandit,
    'max-inflation': DEFAULT_MAX_INFLATION,
    'tickers':       DEFAULT_TICKERS,
    'data-file':     DEFAULT_DATA_FILE
})

stop = {
    "training_iteration": 100
}

In [43]:
MarketLinUCBTrainer = LinUCBTrainer.with_updates(
    name="MarketLinUCBTrainer",
    default_config=market_config,      # Will be merged with Trainer.COMMON_CONFIG (rllib/agent/trainer.py)
    #default_policy=[somePolicyClass]  # If we had a policy...
)

In [44]:
analysis = ray.tune.run(
    MarketLinUCBTrainer,
    config=market_config,
    stop=stop,
    num_samples=3,    
    checkpoint_at_end=True,
    verbose=2,            # Change to 0 or 1 to reduce the output.
    ray_auto_init=False,    # Don't allow Tune to initialize Ray.
)

Trial name,status,loc
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,
MarketLinUCBTrainer_MarketBandit_00001,PENDING,
MarketLinUCBTrainer_MarketBandit_00002,PENDING,


[2m[36m(pid=76475)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/solutions/../market.tsv (config: {})
[2m[36m(pid=76472)[0m 2020-06-13 10:14:38,586	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76472)[0m 2020-06-13 10:14:38,590	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76472)[0m 2020-06-13 10:14:38,607	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76472)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/solutions/../market.tsv (config: {})
[2m[36m(pid=76475)[0m 2020-06-13 10:14:38,586	INFO trainer.py:421 -- Tip: set '

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,1.0,0.474065,100.0,342.59
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,,,,,
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,,,,,


Result for MarketLinUCBTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-13_10-14-39
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 342.5899999999999
  episode_reward_min: 342.5899999999999
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: f7f614d6c85c4fe0aa31e38c17e3ee8b
  experiment_tag: '2'
  grad_time_ms: 0.29
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.29
    learner:
      cumulative_regret: 1206.0
      update_latency: 0.00017786026000976562
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 3447.279
    opt_samples: 1.0
    sample_peak_throughput: 603.306
    sample_time_ms: 1.658
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 1206.0
    update_latency: 0.00017786026000976562
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimator: {}
  opt_peak_t

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,19,5.25636,1900,339.417
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,192.168.1.149:76475,18,4.90777,1800,339.426
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,192.168.1.149:76472,19,5.24229,1900,339.417


Result for MarketLinUCBTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-13_10-14-44
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 339.4169999999999
  episode_reward_min: 339.2499999999999
  episodes_this_iter: 1
  episodes_total: 20
  experiment_id: 9bdf32c76324416b9b1f0ed734f7137b
  experiment_tag: '1'
  grad_time_ms: 0.455
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.455
    learner:
      cumulative_regret: 21446.0
      update_latency: 0.00021505355834960938
    num_steps_sampled: 1900
    num_steps_trained: 1900
    opt_peak_throughput: 2196.316
    opt_samples: 1.0
    sample_peak_throughput: 415.64
    sample_time_ms: 2.406
    update_time_ms: 0.002
  iterations_since_restore: 19
  learner:
    cumulative_regret: 21446.0
    update_latency: 0.00021505355834960938
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1900
  num_steps_trained: 1900
  off_policy_estimator: {}
  o

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,35,10.1404,3500,339.338
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,192.168.1.149:76475,35,9.97874,3500,339.338
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,192.168.1.149:76472,36,10.2242,3600,339.336


Result for MarketLinUCBTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-13_10-14-49
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 339.33564102564094
  episode_reward_min: 339.2499999999999
  episodes_this_iter: 1
  episodes_total: 39
  experiment_id: 9bdf32c76324416b9b1f0ed734f7137b
  experiment_tag: '1'
  grad_time_ms: 0.414
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.414
    learner:
      cumulative_regret: 40693.0
      update_latency: 0.00019884109497070312
    num_steps_sampled: 3600
    num_steps_trained: 3600
    opt_peak_throughput: 2416.074
    opt_samples: 1.0
    sample_peak_throughput: 536.143
    sample_time_ms: 1.865
    update_time_ms: 0.002
  iterations_since_restore: 36
  learner:
    cumulative_regret: 40693.0
    update_latency: 0.00019884109497070312
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 3600
  num_steps_trained: 3600
  off_policy_estimator: {}
 

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,52,15.0298,5200,339.309
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,192.168.1.149:76475,52,14.8944,5200,339.309
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,192.168.1.149:76472,53,15.0954,5300,339.308


Result for MarketLinUCBTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-13_10-14-54
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 339.3075862068965
  episode_reward_min: 339.2499999999999
  episodes_this_iter: 1
  episodes_total: 58
  experiment_id: 9bdf32c76324416b9b1f0ed734f7137b
  experiment_tag: '1'
  grad_time_ms: 0.371
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.371
    learner:
      cumulative_regret: 59810.0
      update_latency: 0.00028896331787109375
    num_steps_sampled: 5300
    num_steps_trained: 5300
    opt_peak_throughput: 2697.475
    opt_samples: 1.0
    sample_peak_throughput: 506.565
    sample_time_ms: 1.974
    update_time_ms: 0.001
  iterations_since_restore: 53
  learner:
    cumulative_regret: 59810.0
    update_latency: 0.00028896331787109375
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 5300
  num_steps_trained: 5300
  off_policy_estimator: {}
  

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,69,19.9528,6900,339.295
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,192.168.1.149:76475,69,19.8022,6900,339.295
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,192.168.1.149:76472,70,20.0296,7000,339.294


Result for MarketLinUCBTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-13_10-14-59
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 339.293947368421
  episode_reward_min: 339.2499999999999
  episodes_this_iter: 1
  episodes_total: 76
  experiment_id: 9bdf32c76324416b9b1f0ed734f7137b
  experiment_tag: '1'
  grad_time_ms: 0.403
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.403
    learner:
      cumulative_regret: 78938.0
      update_latency: 0.00023221969604492188
    num_steps_sampled: 7000
    num_steps_trained: 7000
    opt_peak_throughput: 2482.277
    opt_samples: 1.0
    sample_peak_throughput: 547.574
    sample_time_ms: 1.826
    update_time_ms: 0.002
  iterations_since_restore: 70
  learner:
    cumulative_regret: 78938.0
    update_latency: 0.00023221969604492188
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7000
  num_steps_trained: 7000
  off_policy_estimator: {}
  o

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,RUNNING,192.168.1.149:76471,87,24.9684,8700,339.285
MarketLinUCBTrainer_MarketBandit_00001,RUNNING,192.168.1.149:76475,88,25.2606,8800,339.285
MarketLinUCBTrainer_MarketBandit_00002,RUNNING,192.168.1.149:76472,87,24.8037,8700,339.285


Result for MarketLinUCBTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-13_10-15-05
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 342.5899999999999
  episode_reward_mean: 339.2844329896907
  episode_reward_min: 339.2499999999999
  episodes_this_iter: 1
  episodes_total: 97
  experiment_id: f7f614d6c85c4fe0aa31e38c17e3ee8b
  experiment_tag: '2'
  grad_time_ms: 0.41
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.41
    learner:
      cumulative_regret: 100414.0
      update_latency: 0.00023174285888671875
    num_steps_sampled: 8900
    num_steps_trained: 8900
    opt_peak_throughput: 2441.246
    opt_samples: 1.0
    sample_peak_throughput: 547.416
    sample_time_ms: 1.827
    update_time_ms: 0.002
  iterations_since_restore: 89
  learner:
    cumulative_regret: 100414.0
    update_latency: 0.00023174285888671875
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 8900
  num_steps_trained: 8900
  off_policy_estimator: {}
  

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinUCBTrainer_MarketBandit_00000,TERMINATED,,100,29.5877,10000,339.25
MarketLinUCBTrainer_MarketBandit_00001,TERMINATED,,100,29.4127,10000,339.25
MarketLinUCBTrainer_MarketBandit_00002,TERMINATED,,100,29.4084,10000,339.25


In [45]:
df_ts = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df_ts = df_ts.append(df_trial, ignore_index=True)
    
df_ts.head()

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/update_time_ms,info/opt_peak_throughput,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency
0,342.59,342.59,342.59,91.0,1,100,100,1.64,0.319,0.001,...,0.001,3138.04,609.575,1.0,1206.0,0.000166,28.4,65.9,1206.0,0.000166
1,342.59,339.25,340.92,91.0,1,200,200,1.879,0.342,0.002,...,0.002,2921.435,532.259,1.0,2334.0,0.000171,57.0,64.7,2334.0,0.000171
2,342.59,339.25,340.363333,91.0,1,300,300,5.626,0.872,0.002,...,0.002,1146.925,177.75,1.0,3493.0,0.000295,,,3493.0,0.000295
3,342.59,339.25,340.085,91.0,1,400,400,1.556,0.291,0.001,...,0.001,3435.42,642.736,1.0,4656.0,0.000154,,,4656.0,0.000154
4,342.59,339.25,339.918,91.0,1,500,500,1.604,0.315,0.003,...,0.003,3177.262,623.41,1.0,5763.0,0.000187,25.0,64.7,5763.0,0.000187


In [46]:
rewards = df_ts \
    .groupby("num_steps_trained")["episode_reward_mean"] \
    .aggregate(["mean", "max", "min", "std"])

rewards

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,342.590000,342.590000,342.590000,0.0
200,340.920000,340.920000,340.920000,0.0
300,340.363333,340.363333,340.363333,0.0
400,340.085000,340.085000,340.085000,0.0
500,339.918000,339.918000,339.918000,0.0
...,...,...,...,...
9600,339.250000,339.250000,339.250000,0.0
9700,339.250000,339.250000,339.250000,0.0
9800,339.250000,339.250000,339.250000,0.0
9900,339.250000,339.250000,339.250000,0.0


In [47]:
regrets = df_ts \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

regrets

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,1206.0,1206.0,1206.0,0.0
200,2334.0,2334.0,2334.0,0.0
300,3493.0,3493.0,3493.0,0.0
400,4656.0,4656.0,4656.0,0.0
500,5763.0,5763.0,5763.0,0.0
...,...,...,...,...
9600,108363.0,108363.0,108363.0,0.0
9700,109489.0,109489.0,109489.0,0.0
9800,110555.0,110555.0,110555.0,0.0
9900,111693.0,111693.0,111693.0,0.0


The results for _LinTS_ were ~570 for reward mean and the regret stayed under 10000. So, training with _LinUCB_ isn't as successful.

In [48]:
plot_line_with_stddev(rewards, x_col='num_steps_trained', y_col='mean', stddev_col='std', 
                      title='Rewards vs. Steps', x_axis_label='step', y_axis_label='reward')

In [49]:
plot_cumulative_regret(regrets)

[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,454	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,456	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76836)[0m 2020-06-13 10:25:09,483	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,448	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,449	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=76835)[0m 2020-06-13 10:25:09,465	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=76834)[0m 2020-06-13 10:25:09,455	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eage

What's the annualized return?

In [54]:
print("{:5.2f}% optimized return annualized".format(max(rewards["mean"]) / n_years))

 3.72% optimized return annualized


The result is almost the same as the completely random choices investigated in the lesson!!

The market that we're modeling doesn't exhibit a linear relationship between the context, inflation in our case, and the rewards. Hence, it's not too surprising that a linear algorithm would fail to model the behavior perfectly. What's interesting here is that Thompson Sampling did a noticeably better job than Upper Confidence Bound.