# Exercise 05. (Take-home) Advanced Topic: Adding an in-game Recommender using RLlib

© 2019-2022, Anyscale. All Rights Reserved <br>
📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
➡️ [Next notebook](./ex_06_rllib_end_to_end_demo.ipynb) <br>
⬅️ [Previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

### Learning objectives
In this this tutorial, you will learn how to:

 * [Intro RecSys with RL](#recsys_rl)
 * [Create a RecSys RL environment](#recsys_env)
 * [Train a Contextual Bandit on the environment](#cb)
 * [Train using a RL Online algorithm on the environment](#online)
 * [Train using a RL Offline algorithm on the environment](#offline)
 
 
 find RLlib algos to train policy models on environments.
 

## Intro RecSys with RL <a class="anchor" id="recsys_rl"></a>

A Recommender System <b>(RecSys)</b> suggests items that are most pertinent to a particular user.  Examples of recommender systems include:
<ul>
    <li>Video recommendations (e.g. YouTube, Netflix)</li>
    <li>Online shopping recommendations (e.g. Amazon)</li>
    <li>Advertisements on a website</li>
</ul>

<b>Two main approaches to training algorithms</b> for RecSys are: 
<ol>
    <li>Traditional Machine Learning <b>(ML)</b></li>
    <li>Reinforcement Learning <b>(RL)</b></li>
    </ol>

<b>In traditional ML</b>, data is gathered about users and products (features or X's), and the views or actions by users of those products (dependent variable or y's). A ranking algorithm is trained on all the data at once as if all the actions occurred in one time step (e.g. collaborative filtering).  Such a <b><i>static</i> model</b> is useful when there are millions of items and users, since learning from all data at once is efficient.

<b>In RL</b>, users interact with offers repeatedly over time.  Per iteration, we recommend items to a user, observe the user's behaviour, and receive rewards based on the user's actions.  The <b><i>dynamic</i> model</b> is iteratively trained based on 
the last observation of recommendation, action, reward.  One caveat with RL, since a recommendation needs to be calculated at every time step in RL, only a pre-selected handful of top candidate items per user (from the traditional ML ranking model) is presented in the simulation environment.  

<b>Offline RL is particularly relevant in a RecSys context.</b>

<div class="alert alert-block alert-success">    
    <b>💡 Online vs Offline RL, when algorithm learning from an environment is: </b> <br><br>
    ✔️ in a live fashion (typically gaming platforms or complex systems simulations), this is called <b>online RL</b> and evaluation during training is <b>on-policy</b>. <br><br>
    ✔️ gathered from log files (RecSys: of user offers and actions), this is called <b>offline RL</b> and evaluation during training is <b>off-policy</b>, because the policy (RL word for model) used to log the data is different from the policy used to explore the data. </b> 
</div>

Through the log files of historic user offers and user actions, offline RL in a RecSys context implicitly explores the last Recommender model put into production.  “Serendipitous” aspects of user experience can be explored through offline RL, since random actions the user did not historically take can be tried in the simulation.

This additional offline RL step after logging is an important part of modern Recommender Systems, to ensure current models do not propagate errors or bias.

TODO create an overall picture of RecSys system with offline RL.

<br>

## Create a RecSys RL Environment <a class="anchor" id="recsys_env"></a>

As we learned in the first 2 lessons, the first step to training a RL RecSys policy model is to create a live <b>environment</b> that can interact with a RL Algorithm to train a recommender agent. 

In this notebook, we will use <b><a href="https://github.com/google-research/recsim">Google's RecSim environment</a></b>, which was developed for the YouTube recommendation problem.  The environment is <i>Timelimit-based</i>, meaning the termination condition for an episode will be after a fixed number (60) of videos are watched. The RecSim environment consists of:

<img src="./images/recsim_environment.png" width="90%" />

* <b>Document Model</b>, in the range [0, 1].  
<ul>
    <li>On the 0-end of the scale, <b>"sweet"</b> documents lead to large amounts of <b>"click bait"</b> or immediate engagement. Sweetness values are drawn from ln Normal(μsweet, σsweet).</li>
    <li>On the 1-end of the scale, documents termed <b>kale</b>, are less click-bait, but tend to <b>increase user long-term satisfaction</b>. Kale values are drawn from ln Normal(μkale, σkale)</li>
    <li>Mixed doc values are drawn from linear interpolation between parameters of the two distributions in proportion to their kaleness.</li>
    </ul>
* <b>User Model</b>, simulated as having: 
<ul>
    <li><i>evolving, unknown contexts</i> (interests, preferences, satisfaction, activity, mood)</li>
    <li><i>unobservable events</i> that could impact user behavior (personalized promotions, interuptions that cause turning off a video such as because someone rang their doorbell)</li>
    </ul>
* <b>Rewards</b>, or user satisfaction after their choice, modeled in the range [0, 1] that stochastically (and slowly) increases or decreases with the consumption of different types of content; kale or sweetness.  


<b>RLlib comes with 3 RecSim environments</b>  <br>
<div class="alert alert-block alert-success">    
👉 <b>Long Term Satisfaction</b> (used in this tutorial) <br>
- Interest Evolution <br>
- Interest Exploration <br>
</div>

<br>

In [1]:
# import libraries
import numpy as np
from scipy.stats import linregress, sem
import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

# silence the many tensorflow warnings
import logging, os
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
import gym
import recsim

print(f"tensorflow: {tf.__version__}")
print(f"gym: {gym.__version__}")

# Import the built-in RecSim exapmle environment: "Long Term Satisfaction", ready to be trained by RLlib.
from ray.rllib.examples.env.recommender_system_envs_with_recsim import LongTermSatisfactionRecSimEnv


ray: 3.0.0.dev0
tensorflow: 2.7.0
gym: 0.21.0


In [2]:
# Create a RecSim instance using the following config parameters 
lts_10_1_env = LongTermSatisfactionRecSimEnv({
    "num_candidates": 10,  # The number of possible documents/videos/candidates that we can recommend
    "slate_size": 1, # The number of recommendations that we will be making
    # Set to False for re-using the same candidate doecuments each timestep.
    "resample_documents": False,
    # Convert MultiDiscrete actions to Discrete (flatten action space).
    # e.g. slate_size=2 and num_candidates=10 -> MultiDiscrete([10, 10]) -> Discrete(100)  # 10x10
    "convert_to_discrete_action_space": True,
})

# # What are our spaces?
# pretty_print(f"observation space = {lts_10_1_env.observation_space}")
# pretty_print(f"action space = {lts_10_1_env.action_space}")


In [3]:
# Start a new episode and look at initial observation.
obs = lts_10_1_env.reset()
pretty_print(obs)

"doc:\n  '0':\n  - 0.54881352186203\n  '1':\n  - 0.7151893377304077\n  '2':\n  - 0.6027633547782898\n  '3':\n  - 0.5448831915855408\n  '4':\n  - 0.42365479469299316\n  '5':\n  - 0.6458941102027893\n  '6':\n  - 0.4375872015953064\n  '7':\n  - 0.891772985458374\n  '8':\n  - 0.9636627435684204\n  '9':\n  - 0.3834415078163147\nresponse:\n- click: 1\n  engagement: 32.88135528564453\nuser: []\n"

In [4]:
# Let's send our first action (1-slate back into the env) using the env's `step()` method.
action = 3  # Discrete(10): 0-9 are all valid actions

# This method returns 4 items:
# - next observation (after having applied the action)
# - reward (after having applied the action)
# - `done` flag; if True, the episode is terminated and the environment needs to be `reset()` again.
# - info dict (we'll ignore this)
next_obs, reward, done, _ = lts_10_1_env.step(action)

# Print out the next observation.
# We expect the "doc" and "user" items to be the same as in the previous observation
# b/c we set "resample_documents" to False.
pretty_print(next_obs)
# Print out rewards and the vlaue of the `done` flag.
print(f"reward = {reward:.2f}; done = {done}")

reward = 1.80; done = False


In [5]:
# Modifying wrapper around the LTS (Long Term Satisfaction) env:
# - allows us to tweak the user model (and thus: reward behavior)
# - adds user's current satisfaction value to observation

class LTSWithStrongerDissatisfactionEffect(gym.ObservationWrapper):

    def __init__(self, env):
        # Tweak incoming environment.
        env.environment._user_model._user_sampler._state_parameters.update({
            "sensitivity": 0.058,
            "time_budget": 120,
            "choc_stddev": 0.1,
            "kale_stddev": 0.1,
            #"innovation_stddev": 0.01,
            #"choc_mean": 1.25,
            #"kale_mean": 1.0,
            #"memory_discount": 0.9,
        })

        super().__init__(env)

        # Adjust observation space.
        if "response" in self.observation_space.spaces:
            self.observation_space.spaces["user"] = gym.spaces.Box(0.0, 1.0, (1, ), dtype=np.float32)
            for r in self.observation_space["response"]:
                if "engagement" in r.spaces:
                    r.spaces["watch_time"] = r.spaces["engagement"]
                    del r.spaces["engagement"]
                    break

    def observation(self, observation):
        if "response" in self.observation_space.spaces:
            observation["user"] = np.array([self.env.environment._user_model._user_state.satisfaction])
            for r in observation["response"]:
                if "engagement" in r:
                    r["watch_time"] = r["engagement"]
                    del r["engagement"]
        return observation


# Add the wrapping around 
tune.register_env("modified-lts", lambda env_config: LTSWithStrongerDissatisfactionEffect(LongTermSatisfactionRecSimEnv(env_config)))

print("ok; registered the string 'modified-lts' to be used in RLlib configs (see below)")


ok; registered the string 'modified_lts' to be used in RLlib configs (see below)


In [6]:
# This cell should help you with your own analysis of the two above "suspicions":
# Always chosing the highest/lowest-valued action will lead to a decrease/increase in rewards over the course of an episode.
modified_lts_10_1_env = LTSWithStrongerDissatisfactionEffect(lts_10_1_env)

# Capture slopes of all trendlines over all episodes.
slopes = []
# Run 1000 episodes.
for _ in range(1000):
    obs = modified_lts_10_1_env.reset()  # Reset environment to get initial observation:

    # Compute actions that pick doc with highest/lowest feature value.
    action_sweetest = np.argmax([value for _, value in obs["doc"].items()])
    action_kaleiest = np.argmin([value for _, value in obs["doc"].items()])

    # Play one episode.
    done = False
    rewards = []
    while not done:
        #action = action_sweetest
        action = action_kaleiest
        #action = np.random.choice([action_kaleiest, action_sweetest])

        obs, reward, done, _ = modified_lts_10_1_env.step(action)
        rewards.append(reward)

    # Create linear model of rewards over time.
    reward_linreg = linregress(np.array((range(len(rewards)))), np.array(rewards))
    slopes.append(reward_linreg.slope)

print(np.mean(slopes))

KeyboardInterrupt: 

In [7]:
# Inspect the modified (1-slate back into the env) using the env's `step()` method.
action = 4  # Discrete(10): 0-9 are all valid actions

# This method returns 4 items:
# - next observation (after having applied the action)
# - reward (after having applied the action)
# - `done` flag; if True, the episode is terminated and the environment needs to be `reset()` again.
# - info dict (we'll ignore this)
next_obs, reward, done, _ = modified_lts_10_1_env.step(action)

# Print out the next observation.
# We expect the "doc" and "user" items to be the same as in the previous observation
# b/c we set "resample_documents" to False.
pretty_print(next_obs)
# Print out rewards and the vlaue of the `done` flag.
print(f"reward = {reward:.2f}; done = {done}")


reward = 9.47; done = True


In [8]:
# Function that measures and outputs the random baseline reward.
# This is the expected accumulated reward per episode, if we act randomly (recommend random items) at each time step.
def measure_random_performance_for_env(env, episodes=1000, verbose=False):

    # Reset the env.
    env.reset()

    # Number of episodes already done.
    num_episodes = 0
    # Current episode's accumulated reward.
    episode_reward = 0.0
    # Collect all episode rewards here to be able to calculate a random baseline reward.
    episode_rewards = []

    # Enter while loop (to step through the episode).
    while num_episodes < episodes:
        # Produce a random action.
        action = env.action_space.sample()

        # Send the action to the env's `step()` method to receive: obs, reward, done, and info.
        obs, reward, done, _ = env.step(action)
        episode_reward += reward

        # Check, whether the episde is done, if yes, reset and increase episode counter.
        if done:
            if verbose:
                print(f"Episode done - accumulated reward={episode_reward}")
            elif num_episodes % 100 == 0:
                print(f" {num_episodes} ", end="")
            elif num_episodes % 10 == 0:
                print(".", end="")
            num_episodes += 1
            env.reset()
            episode_rewards.append(episode_reward)
            episode_reward = 0.0

    # Print out and return mean episode reward (and standard error of the mean).
    env_mean_random_reward = np.mean(episode_rewards)

    print(f"\n\nMean episode reward when acting randomly: {env_mean_random_reward:.2f}+/-{sem(episode_rewards):.2f}")

    return env_mean_random_reward, sem(episode_rewards)


In [9]:
# TODO uncomment later - this takes too much time for now

# # Let's create a somewhat tougher version of this with 20 candidates (instead of 10) and a slate-size of 2.
# # We'll also keep using our wrapper from above to strengthen the dissatisfaction effect on the engagement:
# lts_20_2_env = LTSWithStrongerDissatisfactionEffect(LongTermSatisfactionRecSimEnv(config={
#     "num_candidates": 20,
#     "slate_size": 2,  # MultiDiscrete([20, 20]) -> Discrete(400)
#     "resample_documents": True,
#     # Convert to Discrete action space.
#     "convert_to_discrete_action_space": True,
#     # Wrap observations for RLlib bandit: Only changes dict keys ("item" instead of "doc").
#     "wrap_for_bandits": True,
# }))

# lts_20_2_env_mean_random_reward, _ = \
#     measure_random_performance_for_env(lts_20_2_env, episodes=1000)


## Train a Contextual Bandit on the environment <a class="anchor" id="cb"></a>

A Bandit session is one where we have an opportunity to recommend the user an item and observe their behaviour. We receive a reward if they click.

<ol>
    <li>Open RLlib docs <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html">and navigate to the Algorithms page.</a></li>
    <li>Scroll down and click url of algo you want to use, e.g. <i><b>Bandits</b></i></li>
    <li>On the <a href="https://docs.ray.io/en/master/rllib/rllib-algorithms.html#bandits">algo docs page </a>, click on the link <i><b>Implementation</b></i>.  This will open the <a href="https://github.com/ray-project/ray/blob/master/rllib/algorithms/bandit/bandit.py">algo code file on github</a>.</li>
    <li>Search the github code file for the word <i><b>config</b></i></li>
    <li>Typically the docstring example will show: </li>
    <ol>
        <li>Example code implementing RLlib API, then </li>
        <li>Example code implementing Ray Tune API.</li>
    </ol>
    <li>Scroll down to the config <b>__init()__</b> method</li>
    <ol>
            <li>Algorithm default hyperparameter values are here.</li>
    </ol>
    </ol>

In [7]:
# Select RLlib Bandit algorithm w/Upper Confidence Bound (UCB) exploration
# and find that algorithm's config class

# config is an object instead of a dictionary since Ray version >= 1.13
from ray.rllib.algorithms.bandit import BanditLinUCBConfig

In [8]:
# Not sure how to pass in all these params...

BanditLinUCBConfig()\
    .environment(env_config={
        "num_candidates": 20,  # 20x19 = ~400 unique slates (arms)
        "slate_size": 2,
        "resample_documents": True,
        "convert_to_discrete_action_space": True,
        # Convert "doc" key into "item" key.
        "wrap_for_bandits": True,})


# bandit_config = {
#     "env": "modified_lts",
#     "env_config": {
#         "num_candidates": 20,  # 20x19 = ~400 unique slates (arms)
#         "slate_size": 2,
#         "resample_documents": True,

#         # Bandit-specific flags:
#         "convert_to_discrete_action_space": True,
#         # Convert "doc" key into "item" key.
#         "wrap_for_bandits": True,
#         # Use consistent seeds for the environment ...
#         "seed": 0,

<ray.rllib.algorithms.bandit.bandit.BanditLinUCBConfig at 0x7f82c03f8310>

In [9]:
# # uncomment below to see the long list of specifically PPO default config values
# print(f"Bandit's default config is:")
# print(pretty_print(BanditLinUCBConfig().to_dict()))

# Choose your config settings and instantiate a config object with those settings
# Define algorithm config values
env_name = "modified-lts"
evaluation_interval = 2   #100, num training episodes to run between eval steps
evaluation_duration = 20  #100, num eval episodes to run for the eval step
num_workers = 4          # +1 for head node, num parallel workers or actors for rollouts
num_gpus = 0             # num gpus to use in the cluster
num_envs_per_worker = 1  #1, no vectorization of environments to run at same time

# Define trainer runtime config values
checkpoint_freq = evaluation_interval # freq save checkpoints >= evaulation_interval
checkpoint_at_end = True                # always save last checkpoint
relative_checkpoint_dir = "my_LinUCB_logs" # redirect logs instead of ~/ray_results/
random_seed = 415
# Set the log level to DEBUG, INFO, WARN, or ERROR 
log_level = "ERROR"

# Create a new training config
# override certain default algorithm config values
bandit_config = (
    BanditLinUCBConfig()
    .framework(framework='torch')
    # .environment(env=env_name, disable_env_checking=False)
    .environment(
        env=env_name, 
        env_config={
            "num_candidates": 20,  # 20x19 = ~400 unique slates (arms)
            "slate_size": 2,
            "resample_documents": True,
            "convert_to_discrete_action_space": True,
            # Convert "doc" key into "item" key.
            "wrap_for_bandits": True,})
    .rollouts(num_rollout_workers=num_workers, num_envs_per_worker=num_envs_per_worker)
    .resources(num_gpus=num_gpus, )
#     .training(gamma=0.9, lr=0.01, kl_coeff=0.3)  # do not override defaults
    .evaluation(evaluation_interval=evaluation_interval, 
                evaluation_duration=evaluation_duration)
    .debugging(seed=random_seed, log_level=log_level)
)

print(type(bandit_config))


<class 'ray.rllib.algorithms.bandit.bandit.BanditLinUCBConfig'>


In [11]:
# To start fresh, restart Ray in case it is already running
if ray.is_initialized():
    ray.shutdown()

# Use the config object's `build()` method for generating
# an RLlib Algorithm instance that we can then train.
linucb_algo = bandit_config.build()
print(f"Algorithm type: {type(linucb_algo)}")

# train the Bandit Algorithm instance
for i in range(300):
    # Call its `train()` method
    result = linucb_algo.train()
    print(f"Iteration={i}, Mean Reward={result['episode_reward_mean']}")

# To stop the Algorithm and release its blocked resources, use:
linucb_algo.stop()
print()


[2m[36m(RolloutWorker pid=5450)[0m 2022-08-01 22:18:24,018	ERROR worker.py:754 -- Exception raised in creation task: The actor died because of an error raised in its creation task, [36mray::RolloutWorker.__init__()[39m (pid=5450, ip=127.0.0.1, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x7f8119541fd0>)
[2m[36m(RolloutWorker pid=5450)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/gym/envs/registration.py", line 235, in make
[2m[36m(RolloutWorker pid=5450)[0m     return registry.make(id, **kwargs)
[2m[36m(RolloutWorker pid=5450)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/gym/envs/registration.py", line 128, in make
[2m[36m(RolloutWorker pid=5450)[0m     spec = self.spec(path)
[2m[36m(RolloutWorker pid=5450)[0m   File "/Users/sven/opt/anaconda3/envs/rllib_tutorial/lib/python3.9/site-packages/gym/envs/registration.py", line 151, in spec
[2m[36m(RolloutWorker pid=545

EnvError: The env string you provided ('modified-lts') is:
a) Not a supported/installed environment.
b) Not a tune-registered environment creator.
c) Not a valid env class string.

Try one of the following:
a) For Atari support: `pip install gym[atari] autorom[accept-rom-license]`.
   For VizDoom support: Install VizDoom
   (https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
   `pip install vizdoomgym`.
   For PyBullet support: `pip install pybullet`.
b) To register your custom env, do `from ray import tune;
   tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
   Then in your config, do `config['env'] = [name]`.
c) Make sure you provide a fully qualified classpath, e.g.:
   `ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`


### Exercises

1. 

### Homework

1. 

### References

* 

📖 [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
➡️ [Next notebook](./ex_06_rllib_end_to_end_demo.ipynb) <br>
⬅️ [Previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>