# Ray RLlib Multi-Armed Bandits - Market Bandit Example

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Now that we've learned about multi-armed bandits and methods for optimizing rewards, let's look at real-world applications, starting with a stock market example. We'll also learn a little more about configuring RLlib trainers.

We'll load a dataset derived from this [NYU Stern table](http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/histretSP.html) that shows returns for nearly a century of market data, including dividends and adjustments for inflation. The `market.tsv` file in this folder contains the data.

In [1]:
import pandas as pd
import numpy as np
import os, sys

In [2]:
# Some properties we'll need:
DEFAULT_MAX_INFLATION = 100.0
DEFAULT_TICKERS = ["sp500", "t.bill", "t.bond", "corp"]
DEFAULT_DATA_FILE = os.path.abspath(os.path.curdir) + '/market.tsv'  # full path

In [3]:
def load_market_data (file_name):
    with open(file_name, "r") as f:
        return pd.read_table(f)

In [4]:
df = load_market_data(DEFAULT_DATA_FILE)
df

Unnamed: 0,year,inflation,sp500,t.bill,t.bond,corp
0,1928,-1.15,45.49,4.28,2.01,4.42
1,1929,0.00,-8.30,3.16,4.20,3.02
2,1930,-2.67,-23.07,7.42,7.41,3.30
3,1931,-8.93,-38.33,12.34,7.00,-7.41
4,1932,-10.30,1.85,12.68,21.28,37.78
...,...,...,...,...,...,...
87,2015,0.12,1.26,-0.07,1.16,-0.82
88,2016,1.26,10.38,-0.93,-0.56,8.99
89,2017,2.13,19.07,-1.17,0.66,7.44
90,2018,2.44,-6.51,-0.49,-2.40,-5.08


As you can see the data spans 92 years, from 1928 to 2019. The columns represent:
  * the year
  * inflation rate at the time
  * [S&P500](https://en.wikipedia.org/wiki/S%26P_500_Index) (composite stock index)
  * [Treasury Bills](https://www.investopedia.com/terms/t/treasurybill.asp) (short-term gov bonds)
  * [Treasury Bonds](https://www.investopedia.com/terms/t/treasurybond.asp) (long-term gov bonds)
  * [Moody's Baa Corporate Bonds](https://en.wikipedia.org/wiki/Moody%27s_Investors_Service#Moody's_credit_ratings) (moderate risk)

In [5]:
df.describe()

Unnamed: 0,year,inflation,sp500,t.bill,t.bond,corp
count,92.0,92.0,92.0,92.0,92.0,92.0
mean,1973.5,3.041957,8.413261,0.434239,2.166413,4.21663
std,26.70206,3.803579,19.619605,3.573035,8.126432,8.625809
min,1928.0,-10.3,-38.9,-12.05,-14.57,-14.85
25%,1950.75,1.415,-2.74,-1.185,-2.62,-1.3225
50%,1973.5,2.75,10.515,0.59,1.07,3.91
75%,1996.25,4.275,20.6225,2.1175,7.0375,9.2875
max,2019.0,14.39,58.2,12.68,25.14,37.78


"Corp" refers to corporate bonds.

## Analysis of the Data

What are the worst case and best case scenarios? In other words, if one could predict the future market performance, what are the possible ranges of total failure vs. total success over the past century? By "total", we mean what if you had all your money in a given year invested in the worst performing _sector_ (S&P500 or T bills or ...) or the best performing sector for that year.

In [6]:
n_years = len(df)
min_list = []
max_list = []

for i in range(n_years):
    row = df.iloc[i, 2:]
    min_list.append(min(row))
    max_list.append(max(row))
    
print("{:5.2f}% worst case annualized".format(sum(min_list) / n_years))
print("{:5.2f}% best case annualized".format(sum(max_list) / n_years))

-5.64% worst case annualized
15.18% best case annualized


In [7]:
sys.path.append('../..')
from util.line_plots import plot_line, plot_line_with_stddev, plot_between_lines
from bokeh_util import plot_cumulative_regret

In [8]:
import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [9]:
min_max = pd.DataFrame.from_dict({'year': df['year'], 'min':min_list, 'max':max_list})
min_max

Unnamed: 0,year,min,max
0,1928,2.01,45.49
1,1929,-8.30,4.20
2,1930,-23.07,7.42
3,1931,-38.33,12.34
4,1932,1.85,37.78
...,...,...,...
87,2015,-0.82,1.26
88,2016,-0.93,10.38
89,2017,-1.17,19.07
90,2018,-6.51,-0.49


In [10]:
plot_between_lines(min_max, x_col='year', lower_col='min', upper_col='max', 
                   title='Best to Worst', x_axis_label='year', y_axis_label='%')

## Defining an Environment

Now let's define a Gym environment so that we can train a contextual bandit to optimize annual investments over that period.

In [11]:
import gym
from gym.spaces import Discrete, Box
from gym.utils import seeding
import numpy as np
import random

This is the bandit we'll use to represent the market "environment".

In [12]:
class MarketBandit (gym.Env):
    
    def __init__ (self, config={}):
        self.max_inflation = config.get('max-inflation', DEFAULT_MAX_INFLATION)
        self.tickers = config.get('tickers', DEFAULT_TICKERS)
        self.data_file = config.get('data-file', DEFAULT_DATA_FILE)
        print(f"MarketBandit: max_inflation: {self.max_inflation}, tickers: {self.tickers}, data file: {self.data_file} (config: {config})")

        self.action_space = Discrete(4)
        self.observation_space = Box(
            low  = -self.max_inflation,
            high =  self.max_inflation,
            shape=(1, )
        )
        self.df = load_market_data(self.data_file)
        self.cur_context = None


    def reset (self):
        self.year = self.df["year"].min()
        self.cur_context = self.df.loc[self.df["year"] == self.year]["inflation"][0]
        self.done = False
        self.info = {}

        return [self.cur_context]


    def step (self, action):
        if self.done:
            reward = 0.
            regret = 0.
        else:
            row = self.df.loc[self.df["year"] == self.year]

            # calculate reward
            ticker = self.tickers[action]
            reward = float(row[ticker])

            # calculate regret
            max_reward = max(map(lambda t: float(row[t]), self.tickers))
            regret = round(max_reward - reward)

            # update the context
            self.cur_context = float(row["inflation"])

            # increment the year
            self.year += 1

            if self.year >= self.df["year"].max():
                self.done = True

        context = [self.cur_context]
        #context = self.observation_space.sample()

        self.info = {
            "regret": regret,
            "year": self.year
        }
         
        return [context, reward, self.done, self.info]


    def seed (self, seed=None):
        """Sets the seed for this env's random number generator(s).
        Note:
            Some environments use multiple pseudorandom number generators.
            We want to capture all such seeds used in order to ensure that
            there aren't accidental correlations between multiple generators.
        Returns:
            list<bigint>: Returns the list of seeds used in this env's random
              number generators. The first value in the list should be the
              "main" seed, or the value which a reproducer should pass to
              'seed'. Often, the main seed equals the provided 'seed', but
              this won't be true if seed=None, for example.
        """
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

Let's see it in action:

In [13]:
bandit = MarketBandit()
bandit.reset()

for i in range(10):
    action = bandit.action_space.sample()
    obs = bandit.step(action)
    print(action, obs)

MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/paco/src/academy/ray-rllib/multi-armed-bandits/market.tsv (config: {})
1 [[-1.15], 4.28, False, {'regret': 41, 'year': 1929}]
1 [[0.0], 3.16, False, {'regret': 1, 'year': 1930}]
0 [[-2.67], -23.07, False, {'regret': 30, 'year': 1931}]
2 [[-8.93], 7.0, False, {'regret': 5, 'year': 1932}]
3 [[-10.3], 37.78, False, {'regret': 0, 'year': 1933}]
3 [[-5.19], 19.15, False, {'regret': 39, 'year': 1934}]
3 [[3.48], 14.82, False, {'regret': 0, 'year': 1935}]
2 [[2.55], 1.87, False, {'regret': 41, 'year': 1936}]
1 [[1.03], -0.85, False, {'regret': 31, 'year': 1937}]
0 [[3.73], -37.66, False, {'regret': 35, 'year': 1938}]




We can use this environment in a kind of *monte carlo simulation* to measure a baseline for what the rewards would be over a long period if you always used a random action.

In [14]:
done = 1
reward_list = []
iterations = 10000 #50000

for i in range(iterations):
    if done == 1:
        bandit.reset()

    action = bandit.action_space.sample()
    obs = bandit.step(action)
    context, reward, done, info = obs
    reward_list.append(reward)
    #print(action, context, reward, done, info)

In [15]:
df_mc = pd.DataFrame(reward_list, columns=["reward"])
df_mc.mean()

reward    3.726313
dtype: float64

Depending on the number of iterations, you'll probably get a value approaching 3.75% as a baseline for random actions. That's more than the -5.64% worst case and must less than 15.18% best case for the reward!

In [16]:
plot_line(df_mc, x_col='index', y_col='reward', title='Reward Over Time')

([image](../../images/rllib/MarketReward-Random.png))

Yes, it looks quite random...

## Training a policy in RLlib

Now let's train a policy using our contextual bandit, specifically using _Linear Thompson Sampling_ in RLlib. Hopefully it will do better than the random results we just computed!

Recall in the `__init__()` method for `MarketBandit` that we set some parameters from the passed in `config` object (with defaults). We don't construct this explicitly ourselves. Rather, RLlib will do this. So, we need to construct the canonical `config` object we want to use. To do this, we use the idioms shown in the next several cells:

In [17]:
from ray.rllib.agents.trainer import with_base_config, with_common_config
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.agents.lin_ts import LinTSTrainer
import ray

Initialize Ray as required:

In [18]:
!../../tools/start-ray.sh --check --verbose

INFO: Ray is already running.


In [19]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.244',
 'raylet_ip_address': '192.168.1.244',
 'redis_address': '192.168.1.244:42572',
 'object_store_address': '/tmp/ray/session_2020-06-14_18-38-05_207638_28375/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-14_18-38-05_207638_28375/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-14_18-38-05_207638_28375'}

We need a custom config object with our parameters for `MarketBandit`. We do this building on the default `TS_CONFIG` object for _LinTS_:

In [20]:
market_config = with_base_config(TS_CONFIG, {
    "env":           MarketBandit,
    'max-inflation': DEFAULT_MAX_INFLATION,
    'tickers':       DEFAULT_TICKERS,
    'data-file':     DEFAULT_DATA_FILE
})

stop = {
    "training_iteration": 100
}

Also, we'll define a custom trainer, which builds on the `LinTSTrainer`, with "updates". Note that it's the first argument we'll pass to `tune.run()` in the following cell. When all we need is `LinTSTrainer`, as is, and no extra custom config settings, we can just pass the string `contrib/LinTS` to `tune.run()`.  

In [21]:
MarketLinTSTrainer = LinTSTrainer.with_updates(
    name="MarketLinTSTrainer",
    default_config=market_config,      # Will be merged with Trainer.COMMON_CONFIG (rllib/agent/trainer.py)
    #default_policy=[somePolicyClass]  # If we had a policy...
)

In [22]:
analysis = ray.tune.run(
    MarketLinTSTrainer,
    config=market_config,
    stop=stop,
    num_samples=3,    
    checkpoint_at_end=True,
    verbose=2,              # Change to 0 or 1 to reduce the output.
    ray_auto_init=False,    # Don't allow Tune to initialize Ray.
)

Trial name,status,loc
MarketLinTSTrainer_MarketBandit_00000,RUNNING,
MarketLinTSTrainer_MarketBandit_00001,PENDING,
MarketLinTSTrainer_MarketBandit_00002,PENDING,


[2m[36m(pid=33093)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/paco/src/academy/ray-rllib/multi-armed-bandits/market.tsv (config: {})
[2m[36m(pid=33094)[0m 2020-06-15 10:19:46,152	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33094)[0m 2020-06-15 10:19:46,156	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33094)[0m 2020-06-15 10:19:46,174	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=33092)[0m 2020-06-15 10:19:46,151	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=33092)[0m 2020-06-15 10:19:46,158	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=33092)[0m 2020-06-15 

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,1.0,0.341164,100.0,699.75
MarketLinTSTrainer_MarketBandit_00001,RUNNING,,,,,
MarketLinTSTrainer_MarketBandit_00002,RUNNING,,,,,


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-15_10-19-46
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 605.5599999999998
  episode_reward_mean: 605.5599999999998
  episode_reward_min: 605.5599999999998
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: e679e611363a4675acdd0fa84fb5b2b1
  experiment_tag: '2'
  grad_time_ms: 0.421
  hostname: derwen
  info:
    grad_time_ms: 0.421
    learner:
      cumulative_regret: 903.0
      update_latency: 0.00022292137145996094
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 2373.552
    opt_samples: 1.0
    sample_peak_throughput: 371.964
    sample_time_ms: 2.688
    update_time_ms: 0.002
  iterations_since_restore: 1
  learner:
    cumulative_regret: 903.0
    update_latency: 0.00022292137145996094
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimator: {}
  opt_peak_throughput: 237

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,13,4.95179,1300,640.989
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,13,5.16736,1300,596.343
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,13,5.05258,1300,595.9


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-15_10-19-51
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 769.2
  episode_reward_mean: 639.266
  episode_reward_min: 511.5000000000001
  episodes_this_iter: 1
  episodes_total: 15
  experiment_id: 5064292cd79a42f1a44c7212014c26f8
  experiment_tag: '0'
  grad_time_ms: 0.524
  hostname: derwen
  info:
    grad_time_ms: 0.524
    learner:
      cumulative_regret: 11216.0
      update_latency: 0.0002949237823486328
    num_steps_sampled: 1400
    num_steps_trained: 1400
    opt_peak_throughput: 1906.935
    opt_samples: 1.0
    sample_peak_throughput: 293.503
    sample_time_ms: 3.407
    update_time_ms: 0.003
  iterations_since_restore: 14
  learner:
    cumulative_regret: 11216.0
    update_latency: 0.0002949237823486328
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 1400
  num_steps_trained: 1400
  off_policy_estimator: {}
  opt_peak_throughput: 1906.935
  opt_sa

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,27,10.0337,2700,633.594
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,26,9.80288,2600,607.083
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,26,9.76779,2600,604.045


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-15_10-19-57
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 825.3800000000001
  episode_reward_mean: 639.987
  episode_reward_min: 506.83000000000004
  episodes_this_iter: 1
  episodes_total: 30
  experiment_id: 5064292cd79a42f1a44c7212014c26f8
  experiment_tag: '0'
  grad_time_ms: 0.533
  hostname: derwen
  info:
    grad_time_ms: 0.533
    learner:
      cumulative_regret: 22602.0
      update_latency: 0.0003528594970703125
    num_steps_sampled: 2800
    num_steps_trained: 2800
    opt_peak_throughput: 1876.478
    opt_samples: 1.0
    sample_peak_throughput: 373.704
    sample_time_ms: 2.676
    update_time_ms: 0.002
  iterations_since_restore: 28
  learner:
    cumulative_regret: 22602.0
    update_latency: 0.0003528594970703125
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 2800
  num_steps_trained: 2800
  off_policy_estimator: {}
  opt_peak_throughput: 1876

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,42,15.0553,4200,622.961
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,41,14.7031,4100,578.259
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,41,14.6984,4100,607.488


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-15_10-20-02
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 825.3800000000001
  episode_reward_mean: 623.5225531914894
  episode_reward_min: 433.5900000000002
  episodes_this_iter: 1
  episodes_total: 47
  experiment_id: 5064292cd79a42f1a44c7212014c26f8
  experiment_tag: '0'
  grad_time_ms: 0.529
  hostname: derwen
  info:
    grad_time_ms: 0.529
    learner:
      cumulative_regret: 35230.0
      update_latency: 0.00031685829162597656
    num_steps_sampled: 4300
    num_steps_trained: 4300
    opt_peak_throughput: 1889.071
    opt_samples: 1.0
    sample_peak_throughput: 303.138
    sample_time_ms: 3.299
    update_time_ms: 0.002
  iterations_since_restore: 43
  learner:
    cumulative_regret: 35230.0
    update_latency: 0.00031685829162597656
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 4300
  num_steps_trained: 4300
  off_policy_estimator: {}
  opt_peak_throu

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,55,19.844,5500,617.358
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,55,19.8123,5500,573.928
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,55,19.8577,5500,608.255


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-15_10-20-07
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 825.3800000000001
  episode_reward_mean: 617.0150000000001
  episode_reward_min: 412.7800000000001
  episodes_this_iter: 1
  episodes_total: 62
  experiment_id: 5064292cd79a42f1a44c7212014c26f8
  experiment_tag: '0'
  grad_time_ms: 0.51
  hostname: derwen
  info:
    grad_time_ms: 0.51
    learner:
      cumulative_regret: 46987.0
      update_latency: 0.0005507469177246094
    num_steps_sampled: 5700
    num_steps_trained: 5700
    opt_peak_throughput: 1961.055
    opt_samples: 1.0
    sample_peak_throughput: 414.814
    sample_time_ms: 2.411
    update_time_ms: 0.002
  iterations_since_restore: 57
  learner:
    cumulative_regret: 46987.0
    update_latency: 0.0005507469177246094
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 5700
  num_steps_trained: 5700
  off_policy_estimator: {}
  opt_peak_throughpu

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,70,24.7657,7000,618.022
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,69,24.4407,6900,581.528
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,69,24.4407,6900,604.847


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-15_10-20-12
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 825.3800000000001
  episode_reward_mean: 616.9360759493669
  episode_reward_min: 412.7800000000001
  episodes_this_iter: 1
  episodes_total: 79
  experiment_id: 5064292cd79a42f1a44c7212014c26f8
  experiment_tag: '0'
  grad_time_ms: 0.643
  hostname: derwen
  info:
    grad_time_ms: 0.643
    learner:
      cumulative_regret: 59485.0
      update_latency: 0.0002827644348144531
    num_steps_sampled: 7200
    num_steps_trained: 7200
    opt_peak_throughput: 1555.116
    opt_samples: 1.0
    sample_peak_throughput: 389.606
    sample_time_ms: 2.567
    update_time_ms: 0.002
  iterations_since_restore: 72
  learner:
    cumulative_regret: 59485.0
    update_latency: 0.0002827644348144531
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 7200
  num_steps_trained: 7200
  off_policy_estimator: {}
  opt_peak_through

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,85,29.9496,8500,617.886
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,84,29.3452,8400,585.452
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,84,29.3447,8400,603.84


Result for MarketLinTSTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-15_10-20-17
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 841.56
  episode_reward_mean: 585.3155319148938
  episode_reward_min: 332.6600000000001
  episodes_this_iter: 1
  episodes_total: 94
  experiment_id: 2cd344d2667849aa9fdbdef0f2c31603
  experiment_tag: '1'
  grad_time_ms: 0.707
  hostname: derwen
  info:
    grad_time_ms: 0.707
    learner:
      cumulative_regret: 74084.0
      update_latency: 0.0003807544708251953
    num_steps_sampled: 8600
    num_steps_trained: 8600
    opt_peak_throughput: 1414.939
    opt_samples: 1.0
    sample_peak_throughput: 340.311
    sample_time_ms: 2.938
    update_time_ms: 0.002
  iterations_since_restore: 86
  learner:
    cumulative_regret: 74084.0
    update_latency: 0.0003807544708251953
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 8600
  num_steps_trained: 8600
  off_policy_estimator: {}
  opt_peak_throughput: 1414.9

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.244:33094,96,34.5845,9600,614.769
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.244:33093,97,34.8873,9700,589.489
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.244:33092,96,34.5518,9600,603.514


Result for MarketLinTSTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-15_10-20-22
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 841.56
  episode_reward_mean: 591.0752000000001
  episode_reward_min: 332.6600000000001
  episodes_this_iter: 1
  episodes_total: 107
  experiment_id: 2cd344d2667849aa9fdbdef0f2c31603
  experiment_tag: '1'
  grad_time_ms: 0.612
  hostname: derwen
  info:
    grad_time_ms: 0.612
    learner:
      cumulative_regret: 83817.0
      update_latency: 0.0002999305725097656
    num_steps_sampled: 9800
    num_steps_trained: 9800
    opt_peak_throughput: 1635.079
    opt_samples: 1.0
    sample_peak_throughput: 380.868
    sample_time_ms: 2.626
    update_time_ms: 0.002
  iterations_since_restore: 98
  learner:
    cumulative_regret: 83817.0
    update_latency: 0.0002999305725097656
  node_ip: 192.168.1.244
  num_healthy_workers: 0
  num_steps_sampled: 9800
  num_steps_trained: 9800
  off_policy_estimator: {}
  opt_peak_throughput: 1635.

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,TERMINATED,,100,36.2721,10000,611.694
MarketLinTSTrainer_MarketBandit_00001,TERMINATED,,100,36.1511,10000,591.395
MarketLinTSTrainer_MarketBandit_00002,TERMINATED,,100,36.1943,10000,603.947


## Analyzing the results

Let's analyze the rewards and cumulative regrets of these trials.

In [23]:
df_ts = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df_ts = df_ts.append(df_trial, ignore_index=True)
    
df_ts.head()

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/update_time_ms,info/opt_peak_throughput,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency
0,699.75,699.75,699.75,91.0,1,100,100,2.489,0.479,0.002,...,0.002,2085.784,401.699,1.0,773.0,0.000194,20.9,69.3,773.0,0.000194
1,769.2,699.75,734.475,91.0,1,200,200,2.685,0.459,0.002,...,0.002,2179.086,372.46,1.0,1505.0,0.000275,87.5,69.3,1505.0,0.000275
2,769.2,669.49,712.813333,91.0,1,300,300,3.401,0.791,0.002,...,0.002,1263.573,294.007,1.0,2236.0,0.000204,,,2236.0,0.000204
3,769.2,592.71,682.7875,91.0,1,400,400,2.791,0.516,0.002,...,0.002,1938.487,358.288,1.0,3066.0,0.000273,88.6,69.3,3066.0,0.000273
4,769.2,592.71,672.922,91.0,1,500,500,3.408,0.757,0.003,...,0.003,1320.749,293.46,1.0,3852.0,0.000301,,,3852.0,0.000301


In [24]:
rewards = df_ts \
    .groupby("num_steps_trained")["episode_reward_mean"] \
    .aggregate(["mean", "max", "min", "std"])

rewards

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,582.633333,699.750000,442.590000,130.103960
200,631.551667,734.475000,544.485000,95.982423
300,640.688889,712.813333,581.656667,66.551277
400,652.943333,682.787500,630.000000,27.061895
500,630.269333,672.922000,587.094000,42.916387
...,...,...,...,...
9600,602.146867,614.769400,588.157100,13.358728
9700,601.768167,613.269900,589.488600,11.909716
9800,602.187433,613.681800,591.075200,11.308144
9900,601.724133,612.312700,590.199900,11.086054


In [25]:
plot_line_with_stddev(rewards, x_col='num_steps_trained', y_col='mean', stddev_col='std', 
                      title='Rewards vs. Steps', x_axis_label='step', y_axis_label='reward')

([image](../../images/rllib/Market-Bandit-Rewards-vs-Steps.png))

The rewards reach what appears to be nearly optimal by 3000 steps, then shows some slow improvement beyond 8000.

In [26]:
regrets = df_ts \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

regrets

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,902.000000,1030.0,773.0,128.502918
200,1683.333333,1859.0,1505.0,177.015065
300,2427.666667,2554.0,2236.0,168.767098
400,3265.666667,3421.0,3066.0,181.604882
500,4092.666667,4249.0,3852.0,211.509653
...,...,...,...,...
9600,80717.000000,82349.0,79222.0,1567.995217
9700,81543.000000,83076.0,80171.0,1459.176823
9800,82379.666667,83817.0,81052.0,1385.758396
9900,83170.000000,84673.0,81810.0,1436.846895


In [27]:
plot_cumulative_regret(regrets)

([image](../../images/rllib/Market-Bandit-Cumulative-Regret.png))

## Evaluating the Trained Policy

Overall, how well did the trained policy perform? The results should be better than random, but less than the best case.

In [28]:
print("{:5.2f}% optimized return annualized".format(max(rewards["mean"]) / n_years))

 7.10% optimized return annualized


That's better than the random action baseline of 3.75%, but no where near the best case scenario of 15.18% return. Hence, our regrets grow...

Note that investing solely in the S&P stock index which would have produced better than 8% return over that period -- that is, if one could wait 92 years. However, investing one's entire portfolio into stocks can become quite a risky policy in the short-term, so we were exploring how to balance a portfolio given only limited information.

In any case, the contextual bandit performed well considering that it could only use *inflation* for the context of its decisions, and could only take actions once each year.

## Exercise 1

Try using a `LinUCBTrainer`-based trainer. How does the annualized return compare?

---

## Extra - Restoring from a Checkpoint

In the previous lesson, [05 Thompson Sampling](05-Thompson-Sampling.ipynb), we showed how to restore a trainer from a checkpoint, but almost "in passing". Let's use this feature again, this time with our custom trainer class `MarketLinTSTrainer`.

In [29]:
trial = analysis.trials[0]
path = trial.checkpoint.value
print(f'checkpoint_path: {path}')

checkpoint_path: /Users/paco/ray_results/MarketLinTSTrainer/MarketLinTSTrainer_MarketBandit_0_2020-06-15_10-19-40ufbjjnn4/checkpoint_100/checkpoint-100


In [30]:
trainer = MarketLinTSTrainer(market_config)  # create instance and then restore from checkpoint
trainer.restore(path)

2020-06-15 10:24:14,450	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-15 10:24:14,463	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-15 10:24:14,476	INFO trainable.py:217 -- Getting current IP.
2020-06-15 10:24:14,486	INFO trainable.py:217 -- Getting current IP.
2020-06-15 10:24:14,488	INFO trainable.py:423 -- Restored on 192.168.1.244 from checkpoint: /Users/paco/ray_results/MarketLinTSTrainer/MarketLinTSTrainer_MarketBandit_0_2020-06-15_10-19-40ufbjjnn4/checkpoint_100/checkpoint-100
2020-06-15 10:24:14,489	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 100, '_timesteps_total': 10000, '_time_total': 36.272128105163574, '_episodes_total': 109}


MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/paco/src/academy/ray-rllib/multi-armed-bandits/market.tsv (config: {})


Access the model, to review the distribution of arm weights

In [32]:
model = trainer.get_policy().model
means = [model.arms[i].theta.numpy() for i in range(3)]
covs = [model.arms[i].covariance.numpy() for i in range(3)]
means, covs, model.arms[0].theta.numpy()

([array([1.3029345], dtype=float32),
  array([-0.08178909], dtype=float32),
  array([0.17635827], dtype=float32)],
 [array([[6.5596405e-06]], dtype=float32),
  array([[3.9466147e-05]], dtype=float32),
  array([[2.8418384e-05]], dtype=float32)],
 array([1.3029345], dtype=float32))

A final note: when you checkpoint it will change how the training performs in this notebook, if you rerun the training! So be sure to start from scratch when doing experiments here, if that's what you intend!