# Exercise 05. (Take-home) Advanced Topic: Adding an in-game Recommender using RLlib

¬© 2019-2022, Anyscale. All Rights Reserved <br>
üìñ [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
‚û°Ô∏è [Next notebook](./ex_06_rllib_end_to_end_demo.ipynb) <br>
‚¨ÖÔ∏è [Previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>

### Learning objectives
In this this tutorial, you will learn how to:

 * [Create a RecSys RL environment](#recsys_env)
 * [Train a Contextual Bandit on the environment](#cb)
 * [Train using a RL Online algorithm on the environment](#online)
 * [Train using a RL Offline algorithm on the environment](#offline)
 
 
 find RLlib algos to train policy models on environments.
 

## Create a RecSys RL Environment <a class="anchor" id="recsys_env"></a>

Recommender Systems (RecSys) ... TODO write a few sentences intro.

The first step to training a RL RecSys policy model is to create a live environment that can interact with contextual bandits (or RL algorithms) to train a recommender agent using the RL simulation feedback loop ([discussed in notebook_01](./ex_01_intro_gym_and_rllib.ipynb)). 

<a href="https://github.com/google-research/recsim">Google's RecSim environment</a> was developed for the YouTube recommendation problem.  The environment is <i>Timelimit-based</i>, meaning the termination condition for an episode will be after a fixed number (60) of videos are watched. 

<img src="./images/recsim_environment.png" width="90%" />

The RecSim environment consists of:

* <b>Document Model</b>, in the range [0, 1].  
<ul>
    <li>On the 0-end of the scale, <b>"sweet"</b> documents lead to large amounts of <b>"click bait"</b> or immediate engagement. Sweetness values are drawn from ln Normal(Œºsweet, œÉsweet).</li>
    <li>On the 1-end of the scale, documents termed <b>kale</b>, are less click-bait, but tend to <b>increase user long-term satisfaction</b>. Kale values are drawn from ln Normal(Œºkale, œÉkale)</li>
    <li>Mixed doc values are drawn from linear interpolation between parameters of the two distributions in proportion to their kaleness.</li>
    </ul>
* <b>User Model</b>, simulated as having: 
<ul>
    <li><i>evolving, unknown contexts</i> (interests, preferences, satisfaction, activity, mood)</li>
    <li><i>unobservable events</i> that could impact user behavior (personalized promotions, interuptions that cause turning off a video such as because someone rang their doorbell)</li>
    </ul>
* <b>Rewards</b>, or user satisfaction after their choice, modeled in the range [0, 1] that stochastically (and slowly) increases or decreases with the consumption of different types of content; kale or sweetness.  


<b>RLlib comes with 3 RecSim environments</b>  <br>
<div class="alert alert-block alert-success">    
üëâ - <b>Long Term Satisfaction</b> <- used in this tutorial <br>
- Interest Evolution <br>
- Interest Exploration <br>
</div>

<br>

In [13]:
# import libraries
import numpy as np
from scipy.stats import linregress, sem
import ray
from ray import tune
from ray.tune.logger import pretty_print
print(f"ray: {ray.__version__}")

# silence the many tensorflow warnings
import logging, os
logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
import gym
import recsim

# Import the built-in RecSim exapmle environment: "Long Term Satisfaction", ready to be trained by RLlib.
from ray.rllib.examples.env.recommender_system_envs_with_recsim import LongTermSatisfactionRecSimEnv


ray: 3.0.0.dev0


In [2]:
# Create a RecSim instance using the following config parameters 
lts_10_1_env = LongTermSatisfactionRecSimEnv({
    "num_candidates": 10,  # The number of possible documents/videos/candidates that we can recommend
    "slate_size": 1, # The number of recommendations that we will be making
    # Set to False for re-using the same candidate doecuments each timestep.
    "resample_documents": False,
    # Convert MultiDiscrete actions to Discrete (flatten action space).
    # e.g. slate_size=2 and num_candidates=10 -> MultiDiscrete([10, 10]) -> Discrete(100)  # 10x10
    "convert_to_discrete_action_space": True,
})

# # What are our spaces?
# pretty_print(f"observation space = {lts_10_1_env.observation_space}")
# pretty_print(f"action space = {lts_10_1_env.action_space}")


  deprecation(


In [4]:
# Start a new episode and look at initial observation.
obs = lts_10_1_env.reset()
pretty_print(obs)

"doc:\n  '0':\n  - 0.54881352186203\n  '1':\n  - 0.7151893377304077\n  '2':\n  - 0.6027633547782898\n  '3':\n  - 0.5448831915855408\n  '4':\n  - 0.42365479469299316\n  '5':\n  - 0.6458941102027893\n  '6':\n  - 0.4375872015953064\n  '7':\n  - 0.891772985458374\n  '8':\n  - 0.9636627435684204\n  '9':\n  - 0.3834415078163147\nresponse:\n- click: 0\n  engagement: 43.58918380737305\nuser: []\n"

In [6]:
# Let's send our first action (1-slate back into the env) using the env's `step()` method.
action = 3  # Discrete(10): 0-9 are all valid actions

# This method returns 4 items:
# - next observation (after having applied the action)
# - reward (after having applied the action)
# - `done` flag; if True, the episode is terminated and the environment needs to be `reset()` again.
# - info dict (we'll ignore this)
next_obs, reward, done, _ = lts_10_1_env.step(action)

# Print out the next observation.
# We expect the "doc" and "user" items to be the same as in the previous observation
# b/c we set "resample_documents" to False.
pretty_print(next_obs)
# Print out rewards and the vlaue of the `done` flag.
print(f"reward = {reward:.2f}; done = {done}")

reward = 48.11; done = False


In [9]:
# Modifying wrapper around the LTS (Long Term Satisfaction) env:
# - allows us to tweak the user model (and thus: reward behavior)
# - adds user's current satisfaction value to observation

class LTSWithStrongerDissatisfactionEffect(gym.ObservationWrapper):

    def __init__(self, env):
        # Tweak incoming environment.
        env.environment._user_model._user_sampler._state_parameters.update({
            "sensitivity": 0.058,
            "time_budget": 120,
            "choc_stddev": 0.1,
            "kale_stddev": 0.1,
            #"innovation_stddev": 0.01,
            #"choc_mean": 1.25,
            #"kale_mean": 1.0,
            #"memory_discount": 0.9,
        })

        super().__init__(env)

        # Adjust observation space.
        if "response" in self.observation_space.spaces:
            self.observation_space.spaces["user"] = gym.spaces.Box(0.0, 1.0, (1, ), dtype=np.float32)
            for r in self.observation_space["response"]:
                if "engagement" in r.spaces:
                    r.spaces["watch_time"] = r.spaces["engagement"]
                    del r.spaces["engagement"]
                    break

    def observation(self, observation):
        if "response" in self.observation_space.spaces:
            observation["user"] = np.array([self.env.environment._user_model._user_state.satisfaction])
            for r in observation["response"]:
                if "engagement" in r:
                    r["watch_time"] = r["engagement"]
                    del r["engagement"]
        return observation


# Add the wrapping around 
tune.register_env("modified_lts", lambda env_config: LTSWithStrongerDissatisfactionEffect(LongTermSatisfactionRecSimEnv(env_config)))

print("ok; registered the string 'modified_lts' to be used in RLlib configs (see below)")


ok; registered the string 'modified_lts' to be used in RLlib configs (see below)


In [14]:
# This cell should help you with your own analysis of the two above "suspicions":
# Always chosing the highest/lowest-valued action will lead to a decrease/increase in rewards over the course of an episode.
modified_lts_10_1_env = LTSWithStrongerDissatisfactionEffect(lts_10_1_env)

# Capture slopes of all trendlines over all episodes.
slopes = []
# Run 1000 episodes.
for _ in range(1000):
    obs = modified_lts_10_1_env.reset()  # Reset environment to get initial observation:

    # Compute actions that pick doc with highest/lowest feature value.
    action_sweetest = np.argmax([value for _, value in obs["doc"].items()])
    action_kaleiest = np.argmin([value for _, value in obs["doc"].items()])

    # Play one episode.
    done = False
    rewards = []
    while not done:
        #action = action_sweetest
        action = action_kaleiest
        #action = np.random.choice([action_kaleiest, action_sweetest])

        obs, reward, done, _ = modified_lts_10_1_env.step(action)
        rewards.append(reward)

    # Create linear model of rewards over time.
    reward_linreg = linregress(np.array((range(len(rewards)))), np.array(rewards))
    slopes.append(reward_linreg.slope)

print(np.mean(slopes))

0.000675220388627034


In [17]:
# Inspect the modified (1-slate back into the env) using the env's `step()` method.
action = 4  # Discrete(10): 0-9 are all valid actions

# This method returns 4 items:
# - next observation (after having applied the action)
# - reward (after having applied the action)
# - `done` flag; if True, the episode is terminated and the environment needs to be `reset()` again.
# - info dict (we'll ignore this)
next_obs, reward, done, _ = modified_lts_10_1_env.step(action)

# Print out the next observation.
# We expect the "doc" and "user" items to be the same as in the previous observation
# b/c we set "resample_documents" to False.
pretty_print(next_obs)
# Print out rewards and the vlaue of the `done` flag.
print(f"reward = {reward:.2f}; done = {done}")


reward = 8.32; done = True


In [18]:
# Function that measures and outputs the random baseline reward.
# This is the expected accumulated reward per episode, if we act randomly (recommend random items) at each time step.
def measure_random_performance_for_env(env, episodes=1000, verbose=False):

    # Reset the env.
    env.reset()

    # Number of episodes already done.
    num_episodes = 0
    # Current episode's accumulated reward.
    episode_reward = 0.0
    # Collect all episode rewards here to be able to calculate a random baseline reward.
    episode_rewards = []

    # Enter while loop (to step through the episode).
    while num_episodes < episodes:
        # Produce a random action.
        action = env.action_space.sample()

        # Send the action to the env's `step()` method to receive: obs, reward, done, and info.
        obs, reward, done, _ = env.step(action)
        episode_reward += reward

        # Check, whether the episde is done, if yes, reset and increase episode counter.
        if done:
            if verbose:
                print(f"Episode done - accumulated reward={episode_reward}")
            elif num_episodes % 100 == 0:
                print(f" {num_episodes} ", end="")
            elif num_episodes % 10 == 0:
                print(".", end="")
            num_episodes += 1
            env.reset()
            episode_rewards.append(episode_reward)
            episode_reward = 0.0

    # Print out and return mean episode reward (and standard error of the mean).
    env_mean_random_reward = np.mean(episode_rewards)

    print(f"\n\nMean episode reward when acting randomly: {env_mean_random_reward:.2f}+/-{sem(episode_rewards):.2f}")

    return env_mean_random_reward, sem(episode_rewards)


In [19]:
# Let's create a somewhat tougher version of this with 20 candidates (instead of 10) and a slate-size of 2.
# We'll also keep using our wrapper from above to strengthen the dissatisfaction effect on the engagement:
lts_20_2_env = LTSWithStrongerDissatisfactionEffect(LongTermSatisfactionRecSimEnv(config={
    "num_candidates": 20,
    "slate_size": 2,  # MultiDiscrete([20, 20]) -> Discrete(400)
    "resample_documents": True,
    # Convert to Discrete action space.
    "convert_to_discrete_action_space": True,
    # Wrap observations for RLlib bandit: Only changes dict keys ("item" instead of "doc").
    "wrap_for_bandits": True,
}))

lts_20_2_env_mean_random_reward, _ = \
    measure_random_performance_for_env(lts_20_2_env, episodes=1000)


 0 ......... 100 ......... 200 ......... 300 ......... 400 ......... 500 ......... 600 ......... 700 ......... 800 ......... 900 .........

Mean episode reward when acting randomly: 1157.88+/-0.36


### Exercises

1. 

### Homework

1. 

### References

* 

üìñ [Back to Table of Contents](./ex_00_rllib_notebooks_table_of_contents.ipynb)<br>
‚û°Ô∏è [Next notebook](./ex_06_rllib_end_to_end_demo.ipynb) <br>
‚¨ÖÔ∏è [Previous notebook](./ex_04_offline_rl_with_rllib.ipynb) <br>