# Ray RLlib - Simple Multi-Armed Bandits Example

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

Let's explore a very simple contextual bandit example with 3 arms. We'll run trials using RLlib and [Tune](http://tune.io), Ray's hyperparameter tuning library. 

In [6]:
import gym
from gym.spaces import Discrete, Box
import numpy as np
import random
from ray import tune
import time

We define the bandit as a subclass of an OpenAI Gym environment. We set the action space to have three discrete variables, one action for each arm, and an observation space (the context) in the range[-1.0, 1.0].

Note that we'll randomly pick the context when `reset` is called, but it stays fixed (static) throughout the episode (the set of steps between calls to `reset`). Hence, this is not a context bandit.

In [7]:
class SimpleBandit (gym.Env):
    def __init__ (self, config=None):
        self.action_space = Discrete(3)     # 3 arms
        self.observation_space = Box(low=-1., high=1., shape=(2, ), dtype=np.float64)  # Random (x,y), where x,y from -1 to 1
        self.current_context = None

    def reset (self):
        self.current_context = random.choice([-1., 1.])
        return np.array([-self.current_context, self.current_context])

    def step (self, action):
        rewards_for_context = {
            -1.: [-10, 0, 10],
            1.: [10, 0, -10],
        }
        reward = rewards_for_context[self.current_context][action]
        return (np.array([-self.current_context, self.current_context]), reward, True,
                {
                    "regret": 10 - reward
                })

    def __repr__(self):
        return f'SimpleBandit(action_space={self.action_space}, observation_space={self.observation_space}, current_context={self.current_context})'

Try repeating the next two code cells enough times to see the current context set to `1.0` and `-1.0`.

In [8]:
bandit = SimpleBandit()
observation = bandit.reset()
f'Initial observation = {observation}, bandit = {repr(bandit)}'

'Initial observation = [ 1. -1.], current context = -1.0'

The `bandit.current_context` and the observation of the current environment will remain fixed through the episode.

In [10]:
for i in range(10):
    observation, reward, done, info = bandit.step(bandit.action_space.sample())
    print(f'observation = {observation}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =  -10, done = True , info = {'regret': 20}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}


Now use Tune to train the 

In [11]:
stop = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config = {
    "env": SimpleBandit,
}

In [12]:
from ray.tune.progress_reporter import JupyterNotebookReporter

In [13]:
start_time = time.time()

analysis = tune.run("contrib/LinUCB", config=config, stop=stop, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # this is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

2020-06-04 14:14:30,481	INFO resource_spec.py:212 -- Starting Ray with 4.39 GiB memory available for workers and up to 2.21 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-04 14:14:30,801	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


Trial name,status,loc
contrib_LinUCB_SimpleBandit_00000,RUNNING,


[2m[36m(pid=57907)[0m 2020-06-04 14:14:38,676	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=57907)[0m 2020-06-04 14:14:38,679	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=57907)[0m 2020-06-04 14:14:38,694	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleBandit_00000:
  custom_metrics: {}
  date: 2020-06-04_14-14-38
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.7
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 2b8c7c1def084ab0ae5094cab7b30c3f
  experiment_tag: '0'
  grad_time_ms: 0.277
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.277
    learner:
      cumulative_regret: 30.0
      update_latency: 0.00015020370483398438
    num_steps_sampled: 100
    num_steps_trained: 100
    opt

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleBandit_00000,TERMINATED,,2,0.216743,200,10


The trials took 8.493189096450806 seconds



We can see some of the final data as a dataframe:

In [14]:
df = analysis.dataframe()
df

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.746,0.26,0.001,...,1341.276,1.0,30.0,0.000152,,,30.0,0.000152,<class '__main__.SimpleBandit'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


The easiest way to inspect the progression of training is to use TensorBoard.

1. If you are runnng on the Anyscale Platform, click the _Tensorboard_ link. 
2. If you running this notebook on a laptop, open a terminal window using the `+` under the _Edit_ menu, run the following command, then open the URL shown.

```
tensorboard --logdir ~/ray_results 
```

You may have many data sets plotted from previous tutorial lessons. In the _Runs_ on the left, look for one named something like this:

```
contrib/LinUCB/contrib_LinUCB_SimpleBandit_0_YYYY-MM-DD_HH-MM-SSxxxxxxxx  
```

If you have several of them, you want the one with the latest timestamp. To select just that one, click _toggler all runs_ below the list of runs, then select the one you want. You should see something like the following image:

![TensorBoard for SimpleBandit](../../images/rllib/TensorBoard-for-SimpleBandit.png)

The graph for the metric we were optimizing, the mean reward, is shown with a rectangle surrounding it. It improved steadily during the training runs.

## Simple Context Bandit

`SimpleBandit` had a fixed context through entire episodes. What if we made it contextual, and allowed the `current_context` to randomly change at each step? To do that, we can simply subclass `SimpleBandit` and override `step` to change the `current_context` after calling `SimpleBandit.step()`. (We do this afterwards to set up the _next_ step, but whether we do this before or after doesn't affect the results.)

In [24]:
class SimpleContextBandit(SimpleBandit):
    def __init__ (self, config=None):
        super().__init__(config)
        
    def step(self, action):
        result = super().step(action)
        self.current_context = random.choice([-1.,1.])
        return result

In [25]:
bandit2 = SimpleContextBandit()
observation2 = bandit2.reset()
f'Initial observation = {observation2}, bandit = {repr(bandit2)}'

Now the `bandit.current_context` and the observation of the current environment will _change_ through the episode.

In [29]:
for i in range(10):
    observation, reward, done, info = bandit2.step(bandit2.action_space.sample())
    print(f'observation = {observation}, reward = {reward:4d}, done = {str(done):5s}, info = {info}')

observation = [-1.  1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], reward =    0, done = True , info = {'regret': 10}
observation = [-1.  1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [-1.  1.], reward =  -10, done = True , info = {'regret': 20}
observation = [-1.  1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =   10, done = True , info = {'regret': 0}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}
observation = [ 1. -1.], reward =    0, done = True , info = {'regret': 10}


Train with Tune again:

In [30]:
stop2 = {
    "training_iteration": 200,
    "timesteps_total": 100000,
    "episode_reward_mean": 10.0,
}

config2 = {
    "env": SimpleContextBandit,
}

In [31]:
start_time = time.time()

analysis2 = tune.run("contrib/LinUCB", config=config2, stop=stop2, 
                    progress_reporter=JupyterNotebookReporter(overwrite=False),  # this is the default, actually.
                    verbose=2)  # Change to 0 or 1 to reduce the output.

print("The trials took", time.time() - start_time, "seconds\n")

Trial name,status,loc
contrib_LinUCB_SimpleContextBandit_00000,RUNNING,


[2m[36m(pid=57910)[0m 2020-06-04 14:30:36,705	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
[2m[36m(pid=57910)[0m 2020-06-04 14:30:36,708	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=57910)[0m 2020-06-04 14:30:36,716	INFO trainable.py:217 -- Getting current IP.
Result for contrib_LinUCB_SimpleContextBandit_00000:
  custom_metrics: {}
  date: 2020-06-04_14-30-36
  done: false
  episode_len_mean: 1.0
  episode_reward_max: 10.0
  episode_reward_mean: 9.7
  episode_reward_min: -10.0
  episodes_this_iter: 100
  episodes_total: 100
  experiment_id: 256312a17af2442bb2a44add4858c552
  experiment_tag: '0'
  grad_time_ms: 0.238
  hostname: DWAnyscaleMBP.local
  info:
    grad_time_ms: 0.238
    learner:
      cumulative_regret: 30.0
      update_latency: 0.000125885009765625
    num_steps_sampled: 100
    num_steps_trained: 100
  

Trial name,status,loc,iter,total time (s),ts,reward
contrib_LinUCB_SimpleContextBandit_00000,TERMINATED,,2,0.193764,200,10


The trials took 4.615687847137451 seconds



Let's look at the analysis dataframe:

In [32]:
analysis2.dataframe()

Unnamed: 0,episode_reward_max,episode_reward_min,episode_reward_mean,episode_len_mean,episodes_this_iter,num_steps_trained,num_steps_sampled,sample_time_ms,grad_time_ms,update_time_ms,...,info/sample_peak_throughput,info/opt_samples,learner/cumulative_regret,learner/update_latency,perf/cpu_util_percent,perf/ram_util_percent,info/learner/cumulative_regret,info/learner/update_latency,config/env,logdir
0,10.0,10.0,10.0,1.0,100,200,200,0.641,0.212,0.001,...,1560.904,1.0,30.0,0.000124,,,30.0,0.000124,<class '__main__.SimpleContextBandit'>,/Users/deanwampler/ray_results/contrib/LinUCB/...


It's similar to what we saw before. Switch to your TensorBoard window and look at the plots for this run. The name will be similar to the following:

```
contrib/LinUCB/contrib_LinUCB_SimpleContextBandit_0_YYYY-MM-DD_HH-MM-SSxxxxxxxx  
```

Toggle this one on by itself, then also turn on (check) a run for `SimpleBandit`. The plots for `tune/episode_reward_mean` are probably identical for both, while this won't be true for some of the other graphs.