© 2019-2022, Anyscale. All Rights Reserved

![Anyscale Academy](../../images/AnyscaleAcademyLogo.png)

# Tutorial Notebook for ODSC West 2022

### Learning objectives
In this this tutorial, you will learn about:
 * [Defining a MDP for recommendation system](#recsys) -10 min
 * [Offline Bandit](#offline-bandit) -10min
 * [Offline RL](#offline-rl) -10 min

[Link to slides](https://github.com/anyscale/academy/blob/main/ray-rllib/odsc_west_workshop_2022/slides.pdf)

In [None]:
# uncomment for running on anyscale
#import os; os.environ["PATH"] = "/home/ray/anaconda3/bin:" + os.environ["PATH"]
#!pip uninstall -y torch
#!python -m pip install torch==1.12.1
#!python -m pip install seaborn
#import torch
#print(f"torch: {torch.__version__}")

#!pip uninstall -y -r matplotlib
#!python3 -m pip install matplotlib==3.5.3
#import matplotlib
#print(f"matplotlib: {matplotlib.__version__}")

In [None]:
import time
import random
import numpy as np

import pickle
import seaborn as sns
import matplotlib.pyplot as plt

from pprint import pprint
from rllib_recsim.utils import pretty_print_configs

In [None]:
from ray import tune, rllib, data, air
from ray.tune import register_env

In [None]:
import recsim 
# import custom long-term satisfaction recommendation system gym env
from rllib_recsim.rllib_recsim import ModifiedLongTermSatisfactionRecSimEnv

## RecSim <a class="anchor" id="recsys"></a>

In this notebook, we will use data generated with <b><a href="https://github.com/google-research/recsim">Google's RecSim environment</a></b>, which was developed for the YouTube recommendation problem.  It is a configurable environment, where ideally you would plug in your own users, products, and embedding features.

**Some further readings**

- <a href="https://github.com/google-research/recsim">RecSim github</a>
- <a href="https://arxiv.org/pdf/1909.04847.pdf">RecSim paper</a>

The following image depicts all the components of the recsim packages:

<img src="./images/recsim_environment.png" width="70%" />

The environment is <i>Timelimit-based</i>, meaning the termination condition for an episode will be after a fixed number (10) of videos are watched. 

### Document Model

<img src="./images/document_model.png" width="70%" />

Documents represent the candidate pool of items that need to be recommended with features sampled in the range [0, 1].  In this tutorial, we use <b>1 single feature "sweetness"</b> drawn from a uniform distribution between [0.8, 1.0] to represent "chocolaty" items and [0, 0.2] for the "kaley" options. 
- The documents can be different at each step (produced by some "candidate generation" process), or fixed throughout the simulation.
- The recommendation algorithm observes the D candidate documents.  It then makes a selection (possibly ordered) of k documents and presents them in a "slate" to the user. We will focus on **slate size of 1** in this tutorial. 

In [None]:
env = ModifiedLongTermSatisfactionRecSimEnv()
print(env)

The printed environment shows the main RecSim environment which follows [gym API](https://www.gymlibrary.dev/api/core/). For more info on how the environment is actually implemented we refer you to the source code acompanied with this tutorial [link](https://github.com/anyscale/academy/blob/main/ray-rllib/acm_recsys_tutorial_2022/rllib_recsim/rllib_recsim.py).

Let's use `env.reset()` to begin the interaction with the environment:

```
gym.Env.reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None) → Tuple[ObsType, dict]
Resets the environment to an initial state and returns the initial observation.
```

In [None]:
obs = env.reset()
pprint(obs)

### User Model

<img src="./images/user_model.png" width="70%" />

In RecSim users are representation by a set of features some of which could be latent hidden variables not observed by our recommendation system during live interaction. In this tutorial we assume that a sampled user from the world does not own any observable features like age, gender, etc. and our AI should infer the latent state of the user from its history of interactions in the current session. 
- The user examines a "slate" of recommended items and makes a choice of one item. After making their choice, the user emits an engagement score which indicates how much that user was engaged with that particular item they chose. The agent has to learn to estimate the latent states of the user that shape their choice model in the future. 

In [None]:
# observable user features
obs['user']

In [None]:
# latent user features (HACK: we can cheat and access the user state within the env)
pprint(env.get_user_state())

### User Choice

<img src="./images/user_choice_model.png" width="70%" />

This module controls how a particular user would **respond** to a set of recommendations and how its **latent state would evolve** as a function of this interaction. 
In the environment in this tutorial, engagement is assumed to be a function of two competing phenomenas:
-   The love of the user for sweet items ($sweetness(item_t)$)
-   The long term satisfaction which cares about healthier options. It is inversely correlated with the sweetness of items suggested so far ($satisfaction_{t-1}$)

$$satisfaction_t := satisfaction_{t-1} * \sigma(-sweetness(item_{t}))$$
$$r(item_t) \propto sweetness(item_t) * satisfaction_{t-1}$$

i.e. If a user who loves chocolates have not had chocolate for a while, a chocolate item would be more engaging than a kaley item. On the other hand if we keep recommending chocolate to the same person, they may lose interest and not use our recommendations. 

In [None]:
obs = env.reset()
pprint(obs)

The observation space is a dictionary of four keys. 
- The `user` key is empty, because there is no observable user features. 
- The `doc` key contains a set of documents presented with numerical keys. Each doc item has a scalar score representing its sweetness. 
- The `response` key includes a single record of `click` and `engagement` from the immediate previous interaction. 
- The `time` shows a normalized timestep within the session for this user. -0.5 corresponds to the beginnig of the interaction and +0.5 corresponds to the end of the interaction.

Let's now checkout environment's reward behavior. Let's see what happens if we always pick the sweetest item and plot the reward over time.

In [None]:
# Let's checkout the reward space
obs = env.reset()
rewards = []
done = False
while not done:
    action = np.argmax(obs["doc"])
    obs, reward, done, info = env.step(action)
    rewards.append(reward)


In [None]:
plt.plot(rewards)
plt.ylabel('engagement per step for maximum greedy policy')
plt.xlabel('step')

We can note a couple of things here already:

1. The immediate engagement would be high initially when the sweetest item is suggested for the first time.
2. As we keep recommending the sweetest items, the user satisfaction significantly tampers off and as a result engagement quickly drops.
3. Episodes seem to last for 10 timesteps.

#### Exercise (2 min):
Instead of picking the item with the highest feature, pick the item with the lowest feature and see what happens?

- What do your observations imply about this environment?
- What policy maximizes engagement with the user?

In [None]:
# Let's checkout the reward space
obs = env.reset()
rewards = []
done = False
while not done:
    action = # TODO (exercise): code here
    obs, reward, done, info = env.step(action)
    rewards.append(reward)

In [None]:
plt.plot(rewards)
plt.ylabel('engagement per step for minimum greedy policy')
plt.xlabel('step')

## Getting some baselines on trying to maximize engagement
Next we will run some simple baselines to get a feeling of the reward we can accumulate in these environments using simple policies.

- Greedy minimum feature value (recommending the kaliest option)
- Greedy maximum feature value (recommending the chocoletiest option)
- random policy (recommending random items from the pool)


In [None]:
# Function that measures and outputs the random baseline reward.
# This computes  the expected accumulated reward per episode, if we act randomly (recommend random items) at each time step.
def calc_baseline(baseline_type="random",
                  episodes=100):

    env = ModifiedLongTermSatisfactionRecSimEnv()
    # Reset the env.
    obs = env.reset()

    # Number of episodes already done.
    num_episodes = 0
    # Current episode's accumulated reward.
    episode_reward = 0.0
    epsiode_satisfaction = []
    # Collect all episode rewards here to be able to calculate a random baseline reward.
    episode_rewards = []
    episode_satisfactions = []
    
    # Enter while loop (to step through the episode).
    time_step = 0
    while num_episodes < episodes:
        # Produce an action
        random_action = env.action_space.sample()
        argmax_action = int(max(obs['doc'], key=lambda x: obs['doc'][x]))
        argmin_action = int(min(obs['doc'], key=lambda x: obs['doc'][x]))

        action_dict = {
            'argmax': argmax_action, # greedy choc
            'argmin': argmin_action, # greedy kale
            'random': random_action,
        }

        action = action_dict[baseline_type]
        
        # Send the action to the env's `step()` method to receive: obs, reward, done, and info.
        obs, reward, done, _ = env.step(action)
        
        # Accumulate the rewards
        episode_reward += reward
        
        # Append satisfaction to episode_satiscation
        epsiode_satisfaction.append(
            env.environment._user_model._user_state.satisfaction
        )

        time_step += 1
        # Check, whether the episde is done, if yes, reset and increase episode counter.
        if done:
            if num_episodes % 99 == 0:
                print(f" {num_episodes} ", end="")
            elif num_episodes % 9 == 0:
                print(".", end="")
                
            # increment on end of episode
            num_episodes += 1
            time_step = 0
            obs = env.reset()
            episode_rewards.append(episode_reward)
            episode_reward = 0.0
            episode_satisfactions.append(np.mean(epsiode_satisfaction))

    # Print out and return mean episode reward (and standard error of the mean).
    env_mean_reward = np.mean(episode_rewards)
    env_sd_reward = np.std(episode_rewards)

    # Print out the satisfaction over the episodes
    env_mean_satisfaction = np.mean(episode_satisfactions)
    env_sd_satisfaction = np.std(episode_satisfactions)
    
    print(f"\nMean {baseline_type} baseline reward: {env_mean_reward:.2f}+/-{env_sd_reward:.2f}, satisfaction: {env_mean_satisfaction:.2f}+/-{env_sd_satisfaction:.2f}")

    return env_mean_reward, episode_rewards

In [None]:
num_episodes = 1000
kaliest_baseline, _ = calc_baseline(baseline_type="argmin", episodes=num_episodes)
sweetest_baseline,  _ = calc_baseline(baseline_type="argmax", episodes=num_episodes)
random_baseline, _ = calc_baseline(baseline_type="random", episodes=num_episodes)

### Discussion about the baselines

For every baseline we have printed out not only the engagement score of the entire user session but also the average of the satisfaction term over the entire session as well.

The question is whether we automatically learn an optimal policy in this recommendation enviornement that is better than random?


## Questions (2 min)

- Any questions so far?

# Offline RL with RecSys <a class="anchor" id="offline-rl"></a>

<img src="images/offline_rl.png">


### If we don't have a live environment, how do we know, how well our trained policy will perform?

One of the challenges in offline RL is the evaluation of the trained policy. In online RL (when a simulator
is available), one can either use the data collected for training to compute episode total rewards. Remember
that observations, actions, rewards, and done flags are all part of this training data. Alternatively,
one could run a separate worker (with the same trained policy) and run it on a fresh evaluation-only environment.
In this latter case, we would also have the chance to switch off any exploratory behavior (e.g. stochasticity used
for better action entropy).

In offline RL, no such data from a live environment is available to us. There are two common ways of addressing this dilemma:

1) We deploy the learned policies into production, or maybe just a portion of our production system (similar to A/B testing), and see what happens.

2) We use a method called "off policy evaluation" (OPE) to compute an estimate on how the new policy would perform if we were to deploy it into a real environment. There are different OPE methods available in RLlib off-the-shelf.

3) The third option - which we will use here - is kind of cheating and only possible if you actually do have a simulator available (but you only want to use it for evaluation, not for training, because you want to see how cool offline RL is :) )

In this tutorial, we will use the third option to show the effectiveness of offline RL in improving over existing policies running in production. We will also see how much benefit we can get by improving our dataset quality, starting from a totally random policy all the way to 20% expert demonstrations adn 80% random. 

We can use the currently running policies in production to collect some "historical data" that we can use to train RL agents with. Offline RL can be used to improve upon the existing policies deployed in production. We have prepared some datasets in advance for the purpose of this tutorial. They were all generated using `<path to the script>`. In this script we can mix the percentage of the "expert" data vs. random data to investigate the effect of dataset quality on the final performance of our models.   

Let's look at an exemplar dataset we have prepared before:



In [None]:
prefix = "s3://air-example-data/rllib/offline_rl_recsim_data/"
train_data_path = prefix + "sampled_data_train_random_transitions_small"

dset = data.read_json(train_data_path)
df = dset.to_pandas()
print('Colimns: ', df.columns)
print('Number of rows: ', len(df))
print('Value of the first row')
print('-'*20)
print(df.iloc[0])
print('Value of the second row:')
print('-'*20)
print(df.iloc[1])


From the dataset schema, we can see that RLlib always expects a `type` column that is `SampleBatch`. It will have the normal transition entities per each row (i.e. observation, next_observation, action, reward, done values). It will also contain an episode_id, a timestep indicator, and an action_prob that show the probablity of the action that we chosen at the time of data collection. For random policy the action prob will always be 1/20 (0.05). 

Now that we have an understaning of the dataset example format, we can use RLlib to train an offline RL algorithm. RLlib provides several out of the box offline RL algorithms that you can use. Beside those offline-RL-specific algorithms, we can also use any off-policy algorithm (e.g. DQN) to do offline-RL. The only difference between online and offline version is that instead of using an enviornement sampler, we use a dataset sampler to get the data from. In the next section, we will use DQN, with the difference that we pass a dataset path to the input config.

### Offline Bandit

Below is an example script that trains a DQN bandit agent using the dataset we looked at above.

In [None]:
from ray.rllib.algorithms.dqn import DQN, DQNConfig

register_env("modified-lts", lambda config: ModifiedLongTermSatisfactionRecSimEnv())

In [None]:
env = ModifiedLongTermSatisfactionRecSimEnv()
action_space = env.action_space
observation_space = env.observation_space

bandit_config_offline = (
    DQNConfig()
    .framework("torch")
    .offline_data(
        input_='dataset',
        input_config={
            'format': 'json',
            'paths': train_data_path,
        }
    )
    .environment(
        action_space=action_space,
        observation_space=observation_space,
    )
    .evaluation(
        evaluation_interval=1,
        evaluation_config={
            "input": "sampler",
            "explore": False,
            "env": "modified-lts",
        },
    )
    .training(gamma=0)
)

Some explanations of the script above:

- Input is configured by `.offline_data()` API:

```python
    .offline_data(
        input_='dataset',
        input_config={
            'format': 'json',
            'paths': train_data_path
        }
    )
```

- The environment is not passed to the enviornement. Instead we pass in the expected action and observation space to construct the policies. We can create them manually based on our knowledge of our system. In this case we actually cheat and use the environment attributes to get the correct action and observatin spaces.

```python 
    .environment(
        action_space=action_space,
        observation_space=observation_space,
    )
```

- evaluation config: Since during the evaluation we still need to use the enviroenement simulations, we need to specify it explicitly here.

```python
    .evaluation(
        evaluation_interval=1,
        evaluation_config={
            "input": "sampler",
            "explore": False,
            "env": "modified-lts"
        },
    )
```

In [None]:
bandit_tuner = tune.Tuner(
    DQN,
    param_space=bandit_config_offline.to_dict(),
    run_config=air.RunConfig(
        local_dir="./results_notebook/offline_bandits/",
        stop={"training_iteration": 30},
    )
)
offline_bandits_results = bandit_tuner.fit()

Let's now look at the offline training results. From the plot below we can see that by running offline RL on randomly collected transitions we can improve over the random policy. This is extremely useful in practical scenarios where our goal is to improve over existing production policies.

In [None]:
print('Mean Bandit Episode reward:')
offline_bandits_results[0].metrics['evaluation']['episode_reward_mean']

### Offline RL

Now let's make a simple change to train the same DQN recommendation agent using offline RL

In [None]:
dqn_config_offline = (
    DQNConfig()
    .offline_data(
        input_='dataset',
        input_config={
            'format': 'json',
            'paths': train_data_path,
        }
    )
    .environment(
        action_space=action_space,
        observation_space=observation_space,
    )
    .evaluation(
        evaluation_interval=1,
        evaluation_config={
            "input": "sampler",
            "explore": False,
            "env": "modified-lts",
        },
    )
    .training(gamma=0.99)
)

In [None]:
dqn_tuner = tune.Tuner(
    DQN,
    param_space=dqn_config_offline.to_dict(),
    run_config=air.RunConfig(
        local_dir="./results_notebook/offline_rl/",
        stop={"training_iteration": 30},
    )
)
offline_rl_results = dqn_tuner.fit()

In [None]:
print('Mean DQN Episode reward:')
offline_rl_results[0].metrics['evaluation']['episode_reward_mean']

In [None]:
import pandas as pd

# plot the results and compare to baselines
offline_rl_df = pd.read_csv("saved_runs/offline_rl/progress.csv")
offline_bandits_df = pd.read_csv("saved_runs/offline_bandits/progress.csv")

In [None]:

sns.lineplot(data=offline_rl_df, x="training_iteration", y="evaluation/episode_reward_mean", label="Offline_DQN")
sns.lineplot(data=offline_bandits_df, x="training_iteration", y="evaluation/episode_reward_mean", label="Offline_Bandits")
plt.axhline(random_baseline, color="red", linestyle='--', label="random baseline")
plt.legend()
plt.title('Offline RL vs. Baselines training performance')

## Conclusion

- Bandits converges to a short-sighted solution
    -  Optimizes imediate reward. 
- DQN takes long-term reward into account
    -  Achieves a policy better than random.
- Offline RL is a viable option that works when we don't have simulators for training
    -  Since it doesn't explore freely it won't perform as good as an online RL method. 
    -  Its performance is bounded to the quality of the dataset.


### References

1. 📖 [Ray summit tutorials on RLlib](https://github.com/anyscale/ray-summit-2022-training/tree/main/ray-rllib)
2. 👩‍ [RLlib Documentation](https://docs.ray.io/en/latest/rllib/index.html)

## Thank you!

<a href="https://bit.ly/rllib_odsc_west_survey">Survey</a> - Please let us know how useful you have found this workshop.

**We would love to connect with you!**

**Twitter** - @anyscalecompute | @raydistributed <br>
<b><a href="https://github.com/ray-project/ray">Github</a></b> - 😜 give us a star!<br>
<b><a href="https://www.ray.io/community">Slack</a></b> - [+invitation link](https://docs.google.com/forms/d/e/1FAIpQLSfAcoiLCHOguOm8e7Jnn-JJdZaCxPGjgVCvFijHB5PLaQLeig/viewform)<br>
<b><a href="https://discuss.ray.io/">Discuss</a></b> - searchable questions <br>