# Ray RLlib - Multi-Armed Bandits - Exercise Solutions

© 2019-2021, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademyLogo.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [None]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import pandas as pd
import os
import time
import random
import ray
from ray.tune.progress_reporter import JupyterNotebookReporter

In [None]:
ray.init(ignore_reinit_error=True)

## 03: Simple Multi-Armed Bandits - Exercise 1

## Exercise 1

Change the the `step` method to randomly change the `current_context` on each invocation:

```python
def step(self, action):
    result = super().step(action)
    self.current_context = random.choice([-1.,1.])
    return (np.array([-self.current_context, self.current_context]), reward, True,
            {
                "regret": 10 - reward
            })
```

Repeat the training and analysis. Does the training behavior change in any appreciable way? Why or why not?

See the [solutions notebook](solutions/Multi-Armed-Bandits-Solutions.ipynb) for discussion of this and the following exercises.

In [None]:
class SimpleContextualBandit2 (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        self.current_context = random.choice([-1.,1.])
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBandit2(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'
    

In [None]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

In [None]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleContextualBandit2,
}

In [None]:
analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
    verbose=1)

In [None]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

In [None]:
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
df

It trains just as easily as the original implementation that didn't switch contexts between steps. Is this surprising? Probably not, because the relationship between the reward and the context remains linear, so what LinUCB learns for one context is correct for the second context, too. Also, _Tune_ runs many episodes, so it studies both contexts.

## 03: Simple Multi-Armed Bandits - Exercise 2

Recall the `rewards_for_context` we used:

```python
self.rewards_for_context = {
    -1.: [-10, 0, 10],
    1.: [10, 0, -10],
}
```

We said that Linear Upper Confidence Bound assumes a linear dependency between the expected reward of an action and its context. It models the representation space using a set of linear predictors.

Change the values for the rewards as follows, so they no longer have the same simple linear relationship:

```python
self.rewards_for_context = {
    -1.: [-10, 10, 0],
    1.: [0, 10, -10],
}
```

Also remove the change made for exercise 1, the line `self.current_context = random.choice([-1.,1.])` in the `step` method.

Run the training again and look at the results for the reward mean in TensorBoard. How successful was the training? How smooth is the plot for `episode_reward_mean`? How many steps were taken in the training?

In [None]:
class SimpleContextualBanditNonlinear (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None
        self.rewards_for_context = {   # Changed here:
            -1.: [-10, 10, 0],
            1.: [0, 10, -10],
        }

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        reward = self.rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleContextualBanditNonlinear(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context}, rewards per context={self.rewards_for_context})'

In [None]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

In [None]:
print(f'current_context = {bandit.current_context}')
for i in range(10):
    action = bandit.action_space.sample()
    observation, reward, done, info = bandit.step(action)
    print(f'observation = {observation}, action = {action}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

In [None]:
# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

In [None]:
analysis = ray.tune.run("contrib/LinUCB", config=config, stop=stop, 
    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
    verbose=1)

In [None]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

In [None]:
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
df

It ran the maximum of 20,000 steps and the best it does (for different runs) is well below 10.0. the `episode_reward_mean` is chaotic:

![Nonlinear model with LinUCB](../../../images/rllib/TensorBoard2.png).

Because LinUCB expcts a linear relationship between the context and each reward, it's not surprising that it fails to converge to the desired reward mean.

## 03: Simple Multi-Armed Bandits - Exercise 3

We briefly discussed another algorithm for selecting the next action, _Thompson Sampling_, in the [previous lesson](../02-Exploration-vs-Exploitation-Strategies.ipynb). Repeat exercises 1 and 2 using linear version, called _Linear Thompson Sampling_ ([RLlib documentation](https://docs.ray.io/en/latest/rllib-algorithms.html?highlight=greedy#linear-thompson-sampling-contrib-lints)). To make this change, look at this code we used above:

```python
analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.
```

Change `contrib/LinUCB` to `contrib/LinTS`.  

In [None]:
bandit = SimpleContextualBandit2()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBandit2,
}

analysis = ray.tune.run("contrib/LinTS", config=config, stop=stop, 
    progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
    verbose=1)

stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

In [None]:
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
df

As before, the training only takes 200 steps and converge to the desired reward mean of `10.0`.

Now let's try the nonlinear bandit:

In [None]:
bandit = SimpleContextualBanditNonlinear()
observation = bandit.reset()

# `stop` defined above is unchanged.

config = {
    "env": SimpleContextualBanditNonlinear,
}

start_time = time.time()

analysis = ray.tune.run("contrib/LinTS", config=config, stop=stop, 
                        progress_reporter=JupyterNotebookReporter(overwrite=False),  # This is the default, actually.
                        verbose=1)

print("The trials took", time.time() - start_time, "seconds\n")

In [None]:
df = analysis.dataframe(metric="episode_reward_mean", mode="max")
df

This run with Thompson sampling yields similar results with the reward mean between 4.5 and 5.0, with somewhat chaotic results over the 20000 steps, if you look at the `episode_reward_mean` graph in TensorBoard.

## 04: Linear Upper Confidence Bound - Exercise 1

Change the `training_iterations` from 20 to 40. Does the characteristic behavior of cumulative regret change at higher steps?

In [None]:
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.envs import ParametricItemRecoEnv

In [None]:
UCB_CONFIG["env"] = ParametricItemRecoEnv

# Actual training_iterations will be 40 * timesteps_per_iteration (100 by default) = 4,000
training_iterations = 40

print("Running training for %s time steps" % training_iterations)

In [None]:
analysis = ray.tune.run(
    "contrib/LinUCB",
    config=UCB_CONFIG,
    stop={"training_iteration": training_iterations},
    num_samples=5,
    checkpoint_at_end=False,
    verbose=1
)

stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

In [None]:
frame = pd.DataFrame()

for key, df in analysis.trial_dataframes.items():
    frame = frame.append(df, ignore_index=True)

df = frame.groupby("info/num_steps_trained")[
    "info/learner/default_policy/cumulative_regret"].aggregate(["mean", "max", "min", "std"])
df

In [None]:
df.plot(y="mean", yerr="std")

([image](../../../images/rllib/LinUCB-cumulative-regret2.png))

The slope appears to stop flattening, suggesting that the previous number of steps, 2000, was sufficient to get the optimal behavior. Beyond that, regret continues to accumulate, but it's linear in the number of steps, neither getting better or worse.  

## 05: Linear Thompson Sampling - Exercise 1

Experiment with different $\delta$ values, for example 0.7 and 0.9. What do the cumulative regret and weights graphs look like? 

You can set the $\delta$ value like this:

```python
TS_CONFIG["delta"] = 0.7
```

In [None]:
from ray.rllib.contrib.bandits.agents import LinTSTrainer
from ray.rllib.contrib.bandits.agents.lin_ts import TS_CONFIG
from ray.rllib.contrib.bandits.envs import WheelBanditEnv

In [None]:
TS_CONFIG["env"] = WheelBanditEnv

training_iterations = 20
print("Running training for %s time steps" % training_iterations)

In [None]:
def run_ts (delta):
    TS_CONFIG["delta"] = delta

    start_time = time.time()

    analysis = ray.tune.run(
        LinTSTrainer,
        config=TS_CONFIG,
        stop={"training_iteration": training_iterations},
        num_samples=2,
        checkpoint_at_end=True,
        verbose=1)

    print("The trials took", time.time() - start_time, "seconds\n")

    df = pd.DataFrame()

    for key, df_trial in analysis.trial_dataframes.items():
        df = df.append(df_trial, ignore_index=True)

    return df, analysis

In [None]:
def process_df (df, analysis):
    ts_regrets = df \
        .groupby("info/num_steps_trained")["info/learner/default_policy/cumulative_regret"] \
        .aggregate(["mean", "max", "min", "std"])
    
    trial = analysis.trials[0]
    trainer = LinTSTrainer(config=TS_CONFIG)
    trainer.restore(trial.checkpoint.value)
    
    model = trainer.get_policy().model
    means = [model.arms[i].theta.numpy() for i in range(5)]
    covs = [model.arms[i].covariance.numpy() for i in range(5)]

    return ts_regrets, model, means, covs

In [None]:
delta = 0.7
ts_df7, analysis7 = run_ts(delta)

In [None]:
ts_regrets7, model7, means7, covs7 = process_df(ts_df7, analysis7)
ts_regrets7.head()

In [None]:
ts_regrets7.plot(y="mean", yerr="std")

([image](../../../images/rllib/LinTS-Cumulative-Regret-07.png))

The cumulative regret values are much higher than for $\delta = 0.5$ in the lesson, and the standard deviation may diverge. We mentioned in the lesson that the problem becomes harder for higher $\delta$, which fits this result.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

colors  = ["blue", "black", "green", "red", "yellow"]

for i in range(0, 5):
    x, y = np.random.multivariate_normal(means7[i] / 30, covs7[i], 5000).T
    plt.scatter(x, y, color=colors[i])
    
plt.show()

([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-07.png))

Compare to the separation of the clusters compared to $\delta = 0.5$:

![image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-05.png)


In [None]:
delta = 0.9
ts_df9, analysis9 = run_ts(delta)

In [None]:
ts_regrets9, model9, means9, covs9 = process_df(ts_df9, analysis9)

In [None]:
ts_regrets9.plot(y="mean", yerr="std")

([image](../../../images/rllib/LinTS-Cumulative-Regret-09.png))

Qualitatively the same as for $\delta = 0.7$, but the size of the cumulative regret values are even higher. 

In [None]:
colors  = ["blue", "black", "green", "red", "yellow"]

for i in range(0, 5):
    x, y = np.random.multivariate_normal(means9[i] / 30, covs9[i], 5000).T
    plt.scatter(x, y, color=colors[i])
    
plt.show()

([image](../../../images/rllib/LinTS-Weight-Distribution-of-Arms-09.png))

## 06 Market Example - Exercise 1

Try using a `LinUCBTrainer`-based trainer. How does the annualized return compare?

In [None]:
# Some properties we'll need:
DEFAULT_MAX_INFLATION = 100.0
DEFAULT_TICKERS = ["sp500", "t.bill", "t.bond", "corp"]
DEFAULT_DATA_FILE = os.path.abspath(os.path.curdir) + "/../market.tsv"  # full path

def load_market_data (file_name):
    with open(file_name, "r") as f:
        return pd.read_table(f)

In [None]:
df = load_market_data(DEFAULT_DATA_FILE)
df

In [None]:
n_years = len(df)

In [None]:
from ray.rllib.contrib.bandits.agents.lin_ucb import UCB_CONFIG
from ray.rllib.contrib.bandits.agents.lin_ucb import LinUCBTrainer
import ray

In [None]:
class MarketBandit (gym.Env):

    def __init__ (self, config={}):
        self.max_inflation = config.get('max-inflation', DEFAULT_MAX_INFLATION)
        self.tickers = config.get('tickers', DEFAULT_TICKERS)
        self.data_file = config.get('data-file', DEFAULT_DATA_FILE)
        print(f"MarketBandit: max_inflation: {self.max_inflation}, tickers: {self.tickers}, data file: {self.data_file} (config: {config})")

        self.action_space = Discrete(4)
        self.observation_space = Box(
            low  = -self.max_inflation,
            high =  self.max_inflation,
            shape=(1, )
        )
        self.df = load_market_data(self.data_file)
        self.cur_context = None


    def reset (self):
        self.year = self.df["year"].min()
        self.cur_context = self.df.loc[self.df["year"] == self.year]["inflation"][0]
        self.done = False
        self.info = {}

        return [self.cur_context]


    def step (self, action):
        if self.done:
            reward = 0.
            regret = 0.
        else:
            row = self.df.loc[self.df["year"] == self.year]

            # calculate reward
            ticker = self.tickers[action]
            reward = float(row[ticker])

            # calculate regret
            max_reward = max(map(lambda t: float(row[t]), self.tickers))
            regret = round(max_reward - reward)

            # update the context
            self.cur_context = float(row["inflation"])

            # increment the year
            self.year += 1

            if self.year >= self.df["year"].max():
                self.done = True

        context = [self.cur_context]
        #context = self.observation_space.sample()

        self.info = {
            "regret": regret,
            "year": self.year
        }

        return [context, reward, self.done, self.info]


    def seed (self, seed=None):
        """Sets the seed for this env's random number generator(s).
        Note:
            Some environments use multiple pseudorandom number generators.
            We want to capture all such seeds used in order to ensure that
            there aren't accidental correlations between multiple generators.
        Returns:
            list<bigint>: Returns the list of seeds used in this env's random
              number generators. The first value in the list should be the
              "main" seed, or the value which a reproducer should pass to
              'seed'. Often, the main seed equals the provided 'seed', but
              this won't be true if seed=None, for example.
        """
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

In [None]:
import copy

market_config = copy.deepcopy(UCB_CONFIG)

market_config["env"] = MarketBandit
market_config["max-inflation"] = DEFAULT_MAX_INFLATION;
market_config["tickers"] = DEFAULT_TICKERS;
market_config["data-file"] = DEFAULT_DATA_FILE;

stop = {
    "training_iteration": 100
}

In [None]:
MarketLinUCBTrainer = LinUCBTrainer.with_updates(
    name="MarketLinUCBTrainer",
    default_config=market_config,      # Will be merged with Trainer.COMMON_CONFIG (rllib/agent/trainer.py)
    #default_policy=[somePolicyClass]  # If we had a policy...
)

In [None]:
analysis = ray.tune.run(
    MarketLinUCBTrainer,
    config=market_config,
    stop=stop,
    num_samples=3,    
    checkpoint_at_end=True,
    verbose=1
)

In [None]:
stats = analysis.stats()
secs = stats["timestamp"] - stats["start_time"]
print(f'{secs:7.2f} seconds, {secs/60.0:7.2f} minutes')

In [None]:
df_ts = pd.DataFrame()

for key, df_trial in analysis.trial_dataframes.items():
    df_ts = df_ts.append(df_trial, ignore_index=True)
    
df_ts.head()

In [None]:
rewards = df_ts \
    .groupby("info/num_steps_trained")["episode_reward_mean"] \
    .aggregate(["mean", "max", "min", "std"])

rewards

In [None]:
regrets = df_ts \
    .groupby("info/num_steps_trained")["info/learner/default_policy/cumulative_regret"] \
    .aggregate(["mean", "max", "min", "std"])

regrets

The results for _LinTS_ were ~340 for reward mean. So, training with _LinUCB_ isn't as successful.

In [None]:
rewards.plot(y="mean", yerr="std")

([image](../../../images/rllib/Market-Bandit-Rewards-vs-Steps-LinUCB.png))

In [None]:
regrets.plot(y="mean", yerr="std")

([image](../../../images/rllib/Market-Bandit-Cumulative-Regret-LinUCB.png))

What's the annualized return?

In [None]:
print("{:5.2f}% optimized return annualized".format(max(rewards["mean"]) / n_years))

The result is likely the same as the completely random choices investigated in the lesson or worse!!

The market that we're modeling doesn't exhibit a linear relationship between the context, inflation in our case, and the rewards. Hence, it's not too surprising that a linear algorithm would fail to model the behavior perfectly. What's interesting here is that Thompson Sampling did a noticeably better job than Upper Confidence Bound.

In [None]:
ray.shutdown()