# Ray RLlib Multi-Armed Bandits - Market Bandit Example

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Now that we've learned about multi-armed bandits and methods for optimizing rewards, let's look at real-world applications, starting with a stock market example. We'll also learn a little more about configuring RLlib trainers.

We'll load a dataset derived from this [NYU Stern table](http://pages.stern.nyu.edu/~adamodar/New_Home_Page/datafile/histretSP.html) that shows returns for nearly a century of market data, including dividends and adjustments for inflation. The `market.tsv` file in this folder contains the data.

In [31]:
import pandas as pd
import numpy as np
import os

In [33]:
# Some properties we'll need:
DEFAULT_MAX_INFLATION = 100.0
DEFAULT_TICKERS = ["sp500", "t.bill", "t.bond", "corp"]
DEFAULT_DATA_FILE = os.path.abspath(os.path.curdir) + '/market.tsv'  # full path

In [32]:
def load_market_data (file_name):
    with open(file_name, "r") as f:
        return pd.read_table(f)

In [34]:
df = load_market_data(DEFAULT_DATA_FILE)
df

Unnamed: 0,year,inflation,sp500,t.bill,t.bond,corp
0,1928,-1.15,45.49,4.28,2.01,4.42
1,1929,0.00,-8.30,3.16,4.20,3.02
2,1930,-2.67,-23.07,7.42,7.41,3.30
3,1931,-8.93,-38.33,12.34,7.00,-7.41
4,1932,-10.30,1.85,12.68,21.28,37.78
...,...,...,...,...,...,...
87,2015,0.12,1.26,-0.07,1.16,-0.82
88,2016,1.26,10.38,-0.93,-0.56,8.99
89,2017,2.13,19.07,-1.17,0.66,7.44
90,2018,2.44,-6.51,-0.49,-2.40,-5.08


As you can see the data spans 92 years, from 1928 to 2019. The columns represent:
  * the year
  * inflation rate at the time
  * [S&P500](https://en.wikipedia.org/wiki/S%26P_500_Index) (composite stock index)
  * [Treasury Bills](https://www.investopedia.com/terms/t/treasurybill.asp) (short-term gov bonds)
  * [Treasury Bonds](https://www.investopedia.com/terms/t/treasurybond.asp) (long-term gov bonds)
  * [Moody's Baa Corporate Bonds](https://en.wikipedia.org/wiki/Moody%27s_Investors_Service#Moody's_credit_ratings) (moderate risk)

In [35]:
df.describe()

Unnamed: 0,year,inflation,sp500,t.bill,t.bond,corp
count,92.0,92.0,92.0,92.0,92.0,92.0
mean,1973.5,3.041957,8.413261,0.434239,2.166413,4.21663
std,26.70206,3.803579,19.619605,3.573035,8.126432,8.625809
min,1928.0,-10.3,-38.9,-12.05,-14.57,-14.85
25%,1950.75,1.415,-2.74,-1.185,-2.62,-1.3225
50%,1973.5,2.75,10.515,0.59,1.07,3.91
75%,1996.25,4.275,20.6225,2.1175,7.0375,9.2875
max,2019.0,14.39,58.2,12.68,25.14,37.78


"Corp" refers to corporate bonds.

## Analysis of the Data

What are the worst case and best case scenarios? In other words, if one could predict the future market performance, what are the possible ranges of total failure vs. total success over the past century? By "total", we mean what if you had all your money in a given year invested in the worst performing _sector_ (S&P500 or T bills or ...) or the best performing sector for that year.

In [36]:
n_years = len(df)
min_list = []
max_list = []

for i in range(n_years):
    row = df.iloc[i, 2:]
    min_list.append(min(row))
    max_list.append(max(row))
    
print("{:5.2f}% worst case annualized".format(sum(min_list) / n_years))
print("{:5.2f}% best case annualized".format(sum(max_list) / n_years))

-5.64% worst case annualized
15.18% best case annualized


In [37]:
from bokeh_util import plot_line, plot_line_with_stddev, plot_between_lines, plot_cumulative_regret

import bokeh
bokeh.io.reset_output()
bokeh.io.output_notebook()

In [38]:
min_max = pd.DataFrame.from_dict({'year': df['year'], 'min':min_list, 'max':max_list})
min_max

Unnamed: 0,year,min,max
0,1928,2.01,45.49
1,1929,-8.30,4.20
2,1930,-23.07,7.42
3,1931,-38.33,12.34
4,1932,1.85,37.78
...,...,...,...
87,2015,-0.82,1.26
88,2016,-0.93,10.38
89,2017,-1.17,19.07
90,2018,-6.51,-0.49


In [39]:
plot_between_lines(min_max, x_col='year', lower_col='min', upper_col='max', 
                   title='Best to Worst', x_axis_label='year', y_axis_label='%')

## Defining an Environment

Now let's define a Gym environment so that we can train a contextual bandit to optimize annual investments over that period.

In [40]:
import gym
from gym.spaces import Discrete, Box
from gym.utils import seeding
import numpy as np
import random

This is the bandit we'll use to represent the market "environment".

In [41]:
class MarketBandit (gym.Env):
    
    def __init__ (self, config={}):
        self.max_inflation = config.get('max-inflation', DEFAULT_MAX_INFLATION)
        self.tickers = config.get('tickers', DEFAULT_TICKERS)
        self.data_file = config.get('data-file', DEFAULT_DATA_FILE)
        print(f"MarketBandit: max_inflation: {self.max_inflation}, tickers: {self.tickers}, data file: {self.data_file} (config: {config})")

        self.action_space = Discrete(4)
        self.observation_space = Box(
            low  = -self.max_inflation,
            high =  self.max_inflation,
            shape=(1, )
        )
        self.df = load_market_data(self.data_file)
        self.cur_context = None


    def reset (self):
        self.year = self.df["year"].min()
        self.cur_context = self.df.loc[self.df["year"] == self.year]["inflation"][0]
        self.done = False
        self.info = {}

        return [self.cur_context]


    def step (self, action):
        if self.done:
            reward = 0.
            regret = 0.
        else:
            row = self.df.loc[self.df["year"] == self.year]

            # calculate reward
            ticker = self.tickers[action]
            reward = float(row[ticker])

            # calculate regret
            max_reward = max(map(lambda t: float(row[t]), self.tickers))
            regret = round(max_reward - reward)

            # update the context
            self.cur_context = float(row["inflation"])

            # increment the year
            self.year += 1

            if self.year >= self.df["year"].max():
                self.done = True

        context = [self.cur_context]
        #context = self.observation_space.sample()

        self.info = {
            "regret": regret,
            "year": self.year
        }
         
        return [context, reward, self.done, self.info]


    def seed (self, seed=None):
        """Sets the seed for this env's random number generator(s).
        Note:
            Some environments use multiple pseudorandom number generators.
            We want to capture all such seeds used in order to ensure that
            there aren't accidental correlations between multiple generators.
        Returns:
            list<bigint>: Returns the list of seeds used in this env's random
              number generators. The first value in the list should be the
              "main" seed, or the value which a reproducer should pass to
              'seed'. Often, the main seed equals the provided 'seed', but
              this won't be true if seed=None, for example.
        """
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

Let's see it in action:

In [42]:
bandit = MarketBandit()
bandit.reset()

for i in range(10):
    action = bandit.action_space.sample()
    obs = bandit.step(action)
    print(action, obs)

MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/market.tsv (config: {})
2 [[-1.15], 2.01, False, {'regret': 43, 'year': 1929}]
1 [[0.0], 3.16, False, {'regret': 1, 'year': 1930}]
0 [[-2.67], -23.07, False, {'regret': 30, 'year': 1931}]
0 [[-8.93], -38.33, False, {'regret': 51, 'year': 1932}]
0 [[-10.3], 1.85, False, {'regret': 36, 'year': 1933}]
0 [[-5.19], 58.2, False, {'regret': 0, 'year': 1934}]
1 [[3.48], -3.09, False, {'regret': 18, 'year': 1935}]
0 [[2.55], 43.09, False, {'regret': 0, 'year': 1936}]
3 [[1.03], 10.25, False, {'regret': 20, 'year': 1937}]
1 [[3.73], -3.33, False, {'regret': 1, 'year': 1938}]




We can use this environment in a kind of *monte carlo simulation* to measure a baseline for what the rewards would be over a long period if you always used a random action.

In [43]:
done = 1
reward_list = []
iterations = 10000 #50000

for i in range(iterations):
    if done == 1:
        bandit.reset()

    action = bandit.action_space.sample()
    obs = bandit.step(action)
    context, reward, done, info = obs
    reward_list.append(reward)
    #print(action, context, reward, done, info)

In [44]:
df_mc = pd.DataFrame(reward_list, columns=["reward"])
df_mc.mean()

reward    3.775347
dtype: float64

Depending on the number of iterations, you'll probably get a value approaching 3.75% as a baseline for random actions. That's more than the -5.64% worst case and must less than 15.18% best case for the reward!

In [45]:
from bokeh_util import plot_line, plot_line_with_stddev, plot_cumulative_regret

In [46]:
plot_line(df_mc, x_col='index', y_col='reward', title='Reward Over Time')

([image](../../images/rllib/MarketReward-Random.png))

Yes, it looks quite random...

## Training a policy in RLlib

Now let's train a policy using our contextual bandit, specifically using _Linear Thompson Sampling_ in RLlib. Hopefully it will do better than the random results we just computed!

Recall in the `__init__()` method for `MarketBandit` that we set some parameters from the passed in `config` object (with defaults). We don't construct this explicitly ourselves. Rather, RLlib will do this. So, we need to construct the canonical `config` object we want to use. To do this, we use the idioms shown in the next several cells:

In [47]:
from ray.rllib.agents.trainer import with_base_config, with_common_config
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.agents.lin_ts import LinTSTrainer
from ray import tune

We need a custom config object with our parameters for `MarketBandit`. We do this building on the default `TS_CONFIG` object for _LinTS_:

In [48]:
market_config = with_base_config(TS_CONFIG, {
    "env":           MarketBandit,
    'max-inflation': DEFAULT_MAX_INFLATION,
    'tickers':       DEFAULT_TICKERS,
    'data-file':     DEFAULT_DATA_FILE
})

stop = {
    "training_iteration": 100
}

Also, we'll define a custom trainer, which builds on the `LinTSTrainer`, with "updates". Note that it's the first argument we'll pass to `tune.run()` in the following cell. When all we need is `LinTSTrainer`, as is, and no extra custom config settings, we can just pass the string `contrib/LinTS` to `tune.run()`.  

In [49]:
MarketLinTSTrainer = LinTSTrainer.with_updates(
    name="MarketLinTSTrainer",
    default_config=market_config,      # Will be merged with Trainer.COMMON_CONFIG (rllib/agent/trainer.py)
    #default_policy=[somePolicyClass]  # If we had a policy...
)

In [50]:
analysis = tune.run(
    MarketLinTSTrainer,
    config=market_config,
    stop=stop,
    num_samples=3,    
    checkpoint_at_end=True,
    verbose=2            # Change to 0 or 1 to reduce the output.
)

Trial name,status,loc
MarketLinTSTrainer_MarketBandit_00000,RUNNING,
MarketLinTSTrainer_MarketBandit_00001,PENDING,
MarketLinTSTrainer_MarketBandit_00002,PENDING,


[2m[36m(pid=10036)[0m 2020-06-11 16:27:08,924	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=10036)[0m 2020-06-11 16:27:08,930	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=10036)[0m 2020-06-11 16:27:08,948	INFO trainable.py:217 -- Getting current IP.
[2m[36m(pid=10036)[0m MarketBandit: max_inflation: 100.0, tickers: ['sp500', 't.bill', 't.bond', 'corp'], data file: /Users/deanwampler/projects/anyscale/academy/academy-git/ray-rllib/multi-armed-bandits/market.tsv (config: {})
[2m[36m(pid=10042)[0m 2020-06-11 16:27:08,924	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=10042)[0m 2020-06-11 16:27:08,930	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,,,,,
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,1.0,0.501466,100.0,450.83
MarketLinTSTrainer_MarketBandit_00002,RUNNING,,,,,


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-11_16-27-09
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 544.54
  episode_reward_mean: 544.54
  episode_reward_min: 544.54
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: 9819da7db08c414fabc6dcbe36c4937e
  experiment_tag: '0'
  grad_time_ms: 0.462
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.462
    learner:
      cumulative_regret: 953.0
      update_latency: 0.00028014183044433594
    num_steps_sampled: 100
    num_steps_trained: 100
    opt_peak_throughput: 2163.239
    opt_samples: 1.0
    sample_peak_throughput: 396.523
    sample_time_ms: 2.522
    update_time_ms: 0.004
  iterations_since_restore: 1
  learner:
    cumulative_regret: 953.0
    update_latency: 0.00028014183044433594
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 100
  num_steps_trained: 100
  off_policy_estimator: {}
  opt_peak_throughput: 2163.239
  opt_samples:

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,13,5.45295,1300,624.999
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,12,5.21273,1200,502.278
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,12,5.24629,1200,603.065


Result for MarketLinTSTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-11_16-27-14
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 713.95
  episode_reward_mean: 505.6835714285714
  episode_reward_min: 306.9
  episodes_this_iter: 1
  episodes_total: 14
  experiment_id: ae9bbd69d8d0484796a0660223d0ffd8
  experiment_tag: '1'
  grad_time_ms: 0.398
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.398
    learner:
      cumulative_regret: 12343.0
      update_latency: 0.0002391338348388672
    num_steps_sampled: 1300
    num_steps_trained: 1300
    opt_peak_throughput: 2514.42
    opt_samples: 1.0
    sample_peak_throughput: 424.804
    sample_time_ms: 2.354
    update_time_ms: 0.001
  iterations_since_restore: 13
  learner:
    cumulative_regret: 12343.0
    update_latency: 0.0002391338348388672
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 1300
  num_steps_trained: 1300
  off_policy_estimator: {}
  opt_peak_throughput: 2514.4

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,26,10.0907,2600,599.461
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,27,10.408,2700,528.242
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,26,10.2704,2600,597.301


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-11_16-27-19
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 756.1500000000001
  episode_reward_mean: 598.6675862068964
  episode_reward_min: 382.44000000000005
  episodes_this_iter: 1
  episodes_total: 29
  experiment_id: 9819da7db08c414fabc6dcbe36c4937e
  experiment_tag: '0'
  grad_time_ms: 0.724
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.724
    learner:
      cumulative_regret: 22915.0
      update_latency: 0.00036716461181640625
    num_steps_sampled: 2700
    num_steps_trained: 2700
    opt_peak_throughput: 1381.569
    opt_samples: 1.0
    sample_peak_throughput: 339.851
    sample_time_ms: 2.942
    update_time_ms: 0.003
  iterations_since_restore: 27
  learner:
    cumulative_regret: 22915.0
    update_latency: 0.00036716461181640625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 2700
  num_steps_trained: 2700
  off_policy_estimator: {}
  

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,39,15.3247,3900,591.615
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,39,15.0944,3900,534.462
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,38,15.2163,3800,609.973


Result for MarketLinTSTrainer_MarketBandit_00001:
  custom_metrics: {}
  date: 2020-06-11_16-27-25
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 768.78
  episode_reward_mean: 531.5581395348838
  episode_reward_min: 306.9
  episodes_this_iter: 1
  episodes_total: 43
  experiment_id: ae9bbd69d8d0484796a0660223d0ffd8
  experiment_tag: '1'
  grad_time_ms: 0.44
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.44
    learner:
      cumulative_regret: 36696.0
      update_latency: 0.0003590583801269531
    num_steps_sampled: 4000
    num_steps_trained: 4000
    opt_peak_throughput: 2272.966
    opt_samples: 1.0
    sample_peak_throughput: 442.479
    sample_time_ms: 2.26
    update_time_ms: 0.002
  iterations_since_restore: 40
  learner:
    cumulative_regret: 36696.0
    update_latency: 0.0003590583801269531
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 4000
  num_steps_trained: 4000
  off_policy_estimator: {}
  opt_peak_throughput: 2272.966

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,49,19.9764,4900,584.736
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,49,19.7996,4900,529.925
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,49,20.1261,4900,608.746


Result for MarketLinTSTrainer_MarketBandit_00000:
  custom_metrics: {}
  date: 2020-06-11_16-27-30
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 756.1500000000001
  episode_reward_mean: 583.86
  episode_reward_min: 382.44000000000005
  episodes_this_iter: 1
  episodes_total: 54
  experiment_id: 9819da7db08c414fabc6dcbe36c4937e
  experiment_tag: '0'
  grad_time_ms: 3.698
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 3.698
    learner:
      cumulative_regret: 42958.0
      update_latency: 0.0013041496276855469
    num_steps_sampled: 5000
    num_steps_trained: 5000
    opt_peak_throughput: 270.447
    opt_samples: 1.0
    sample_peak_throughput: 66.01
    sample_time_ms: 15.149
    update_time_ms: 0.004
  iterations_since_restore: 50
  learner:
    cumulative_regret: 42958.0
    update_latency: 0.0013041496276855469
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 5000
  num_steps_trained: 5000
  off_policy_estimator: {}
  opt_peak_throug

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,56,24.7966,5600,583.021
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,56,24.6287,5600,533.374
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,56,25.1073,5600,606.694


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-11_16-27-35
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 832.9500000000002
  episode_reward_mean: 608.0277419354837
  episode_reward_min: 442.8600000000001
  episodes_this_iter: 1
  episodes_total: 62
  experiment_id: 2aa5ef21334943059f0e63c12ac81339
  experiment_tag: '2'
  grad_time_ms: 0.679
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.679
    learner:
      cumulative_regret: 47683.0
      update_latency: 0.00031685829162597656
    num_steps_sampled: 5700
    num_steps_trained: 5700
    opt_peak_throughput: 1473.495
    opt_samples: 1.0
    sample_peak_throughput: 342.44
    sample_time_ms: 2.92
    update_time_ms: 0.002
  iterations_since_restore: 57
  learner:
    cumulative_regret: 47683.0
    update_latency: 0.00031685829162597656
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 5700
  num_steps_trained: 5700
  off_policy_estimator: {}
  opt

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,64,29.8362,6400,583.793
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,64,29.639,6400,537.168
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,64,30.1051,6400,610.69


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-11_16-27-40
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 832.9500000000002
  episode_reward_mean: 611.9404166666667
  episode_reward_min: 386.4100000000001
  episodes_this_iter: 1
  episodes_total: 72
  experiment_id: 2aa5ef21334943059f0e63c12ac81339
  experiment_tag: '2'
  grad_time_ms: 0.569
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.569
    learner:
      cumulative_regret: 54898.0
      update_latency: 0.00044083595275878906
    num_steps_sampled: 6600
    num_steps_trained: 6600
    opt_peak_throughput: 1757.218
    opt_samples: 1.0
    sample_peak_throughput: 372.933
    sample_time_ms: 2.681
    update_time_ms: 0.002
  iterations_since_restore: 66
  learner:
    cumulative_regret: 54898.0
    update_latency: 0.00044083595275878906
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 6600
  num_steps_trained: 6600
  off_policy_estimator: {}
  o

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,75,34.678,7500,590.148
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,75,34.6964,7500,540.027
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,74,34.6881,7400,609.31


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-11_16-27-46
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 832.9500000000002
  episode_reward_mean: 610.8549999999999
  episode_reward_min: 386.4100000000001
  episodes_this_iter: 1
  episodes_total: 84
  experiment_id: 2aa5ef21334943059f0e63c12ac81339
  experiment_tag: '2'
  grad_time_ms: 0.53
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.53
    learner:
      cumulative_regret: 64131.0
      update_latency: 0.0003941059112548828
    num_steps_sampled: 7700
    num_steps_trained: 7700
    opt_peak_throughput: 1886.098
    opt_samples: 1.0
    sample_peak_throughput: 387.046
    sample_time_ms: 2.584
    update_time_ms: 0.002
  iterations_since_restore: 77
  learner:
    cumulative_regret: 64131.0
    update_latency: 0.0003941059112548828
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 7700
  num_steps_trained: 7700
  off_policy_estimator: {}
  opt_p

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,RUNNING,192.168.1.149:10044,87,39.563,8700,583.133
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,87,39.6216,8700,541.389
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,86,39.4748,8600,608.763


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-11_16-27-51
  done: false
  episode_len_mean: 91.0
  episode_reward_max: 832.9500000000002
  episode_reward_mean: 609.7514285714286
  episode_reward_min: 386.4100000000001
  episodes_this_iter: 1
  episodes_total: 98
  experiment_id: 2aa5ef21334943059f0e63c12ac81339
  experiment_tag: '2'
  grad_time_ms: 0.513
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.513
    learner:
      cumulative_regret: 75057.0
      update_latency: 0.00030112266540527344
    num_steps_sampled: 9000
    num_steps_trained: 9000
    opt_peak_throughput: 1950.295
    opt_samples: 1.0
    sample_peak_throughput: 443.316
    sample_time_ms: 2.256
    update_time_ms: 0.002
  iterations_since_restore: 90
  learner:
    cumulative_regret: 75057.0
    update_latency: 0.00030112266540527344
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 9000
  num_steps_trained: 9000
  off_policy_estimator: {}
  o

Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,TERMINATED,,100,44.4563,10000,578.19
MarketLinTSTrainer_MarketBandit_00001,RUNNING,192.168.1.149:10042,100,44.4556,10000,551.019
MarketLinTSTrainer_MarketBandit_00002,RUNNING,192.168.1.149:10036,99,44.6541,9900,610.316


Result for MarketLinTSTrainer_MarketBandit_00002:
  custom_metrics: {}
  date: 2020-06-11_16-27-55
  done: true
  episode_len_mean: 91.0
  episode_reward_max: 801.4599999999999
  episode_reward_mean: 608.0228999999999
  episode_reward_min: 386.4100000000001
  episodes_this_iter: 1
  episodes_total: 109
  experiment_id: 2aa5ef21334943059f0e63c12ac81339
  experiment_tag: '2'
  grad_time_ms: 0.859
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.859
    learner:
      cumulative_regret: 83460.0
      update_latency: 0.0008792877197265625
    num_steps_sampled: 10000
    num_steps_trained: 10000
    opt_peak_throughput: 1164.438
    opt_samples: 1.0
    sample_peak_throughput: 327.304
    sample_time_ms: 3.055
    update_time_ms: 0.002
  iterations_since_restore: 100
  learner:
    cumulative_regret: 83460.0
    update_latency: 0.0008792877197265625
  node_ip: 192.168.1.149
  num_healthy_workers: 0
  num_steps_sampled: 10000
  num_steps_trained: 10000
  off_policy_estimator: {}


Trial name,status,loc,iter,total time (s),ts,reward
MarketLinTSTrainer_MarketBandit_00000,TERMINATED,,100,44.4563,10000,578.19
MarketLinTSTrainer_MarketBandit_00001,TERMINATED,,100,44.4556,10000,551.019
MarketLinTSTrainer_MarketBandit_00002,TERMINATED,,100,45.1381,10000,608.023


## Analyzing the results

Let's analyze the rewards and cumulative regrets of these trials.

In [51]:
df_ts = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df_ts = df_ts.append(df_trial, ignore_index=True)
    
df_ts.head()

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/update_time_ms,info/opt_peak_throughput,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency
0,544.54,544.54,544.54,91.0,1,100,100,2.522,0.462,0.004,...,0.004,2163.239,396.523,1.0,953.0,0.00028,34.9,59.6,953.0,0.00028
1,606.54,544.54,575.54,91.0,1,200,200,2.688,0.484,0.002,...,0.002,2064.126,372.05,1.0,1752.0,0.000417,68.5,59.6,1752.0,0.000417
2,756.15,544.54,635.743333,91.0,1,300,300,22.347,4.848,0.003,...,0.003,206.292,44.748,1.0,2419.0,0.000212,63.6,60.8,2419.0,0.000212
3,756.15,544.54,651.355,91.0,1,400,400,3.789,0.505,0.002,...,0.002,1980.407,263.942,1.0,3233.0,0.000361,,,3233.0,0.000361
4,756.15,544.54,664.086,91.0,1,500,500,2.137,0.465,0.002,...,0.002,2151.587,467.973,1.0,3979.0,0.000324,0.0,60.8,3979.0,0.000324


In [52]:
rewards = df_ts \
    .groupby("num_steps_trained")["episode_reward_mean"] \
    .aggregate(["mean", "max", "min", "std"])

rewards

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,516.520000,554.190000,450.830000,57.093456
200,546.313333,575.540000,498.490000,41.755869
300,563.842222,635.743333,505.873333,66.046456
400,563.542500,651.355000,484.955000,83.582686
500,594.742000,664.086000,530.754000,66.827170
...,...,...,...,...
9600,578.368667,610.757800,544.759100,33.016272
9700,578.594400,610.432500,544.575400,32.982672
9800,578.200667,609.538900,546.632300,31.453931
9900,579.157400,610.316400,548.290800,31.013834


In [53]:
plot_line_with_stddev(rewards, x_col='num_steps_trained', y_col='mean', stddev_col='std', 
                      title='Rewards vs. Steps', x_axis_label='step', y_axis_label='reward')

The rewards reach what appears to be nearly optimzal by 3000 steps, then shows some slow improvement above 8000.

In [54]:
regrets = df_ts \
    .groupby("num_steps_trained")["learner/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

regrets

Unnamed: 0_level_0,mean,max,min,std
num_steps_trained,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100,999.000000,1080.0,953.0,70.363343
200,1850.000000,1988.0,1752.0,122.979673
300,2696.333333,2947.0,2419.0,265.008176
400,3530.666667,3790.0,3233.0,280.471627
500,4286.333333,4614.0,3979.0,317.987945
...,...,...,...,...
9600,83159.666667,86845.0,80011.0,3448.463039
9700,84116.000000,87744.0,80956.0,3418.114100
9800,84837.000000,88417.0,81686.0,3385.944624
9900,85625.333333,89232.0,82500.0,3391.713038


In [55]:
plot_cumulative_regret(regrets)

## Evaluating the Trained Policy

Overall, how well did the trained policy perform? The results should be better than random, but less than the best case.

In [56]:
print("{:5.2f}% optimized return annualized".format(max(rewards["mean"]) / n_years))

 6.46% optimized return annualized


That's better than the random action baseline of 3.75%, but no where near the best case scenario of 15.18% return. Hence, our regrets grow...

Note that investing solely in the S&P stock index which would have produced better than 8% return over that period -- that is, if one could wait 92 years. However, investing one's entire portfolio into stocks can become quite a risky policy in the short-term, so we were exploring how to balance a portfolio given only limited information.

In any case, the contextual bandit performed well considering that it could only use *inflation* for the context of its decisions, and could only take actions once each year.

## Exercise 1

Try using a `LinUCBTrainer`-based trainer. How does the annualized return compare?

---

## Extra - Restoring from a Checkpoint

In the previous lesson, [05 Thompson Sampling](05-Thompson-Sampling.ipynb), we showed how to restore a trainer from a checkpoint, but almost "in passing". Let's look at this feature in a bit more detail. Our `MarketLinTSTrainer` trainer just extends the built-in `LinTSTrainer` with additional configuration information, which is restored from the checkpoint. Hence, we can use the following approach:

In [None]:
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG

TS_CONFIG["env"] = MarketBandit

trial = analysis.trials[0]
trainer = LinTSTrainer(config=TS_CONFIG)
trainer.restore(trial.checkpoint.value)

Get model to plot arm weights distribution

In [None]:
model = trainer.get_policy().model
means = [model.arms[i].theta.numpy() for i in range(3)]
covs = [model.arms[i].covariance.numpy() for i in range(3)]
means, covs, model.arms[0].theta.numpy()

A final note; when you checkpoint the model, it will change how the training performs in this notebook, if you rerun the training! So, when doing experiments, be sure you are starting from scratch when that is desired!