https://learning.oreilly.com/scenarios/define-and-understand/9781098121587/

The goal of this set is to introduce you to fundamental concepts of reinforcement learning (RL) in the context of recommendation systems.

Typical applications include:

Product recommendation in online shops
Content recommendation on news websites
Next best offer in personalized email campaigns


Learning Goals
Understand the core concepts of RL systems and how to apply them
Understand how RL environments work and how to interact with them
Get familiar with Google RecSim for easy environment development

What Is an Environment?
In an RL scenario, the computer (our agent) has to make a choice (action) from a possible set of action items (candidate documents).

Depending on which documents the agent recommends, the user will take an action and send a response back to the agent. This response might include a reward, such as the play time of a video or the view time of a news article.

All of these components form a so-called environment for our RL system, which is an abstraction of the world our agent (which can be powered by an RL algorithm) operates within.

In practice, these environments are very complex, and the real user behavior is never known and can only be abstracted using a model.

Instead of building such an environment from scratch, we will use the Python package RecSim from Google.

RecSim allows us to define the different building blocks of a recommender system in an easy way and comes with pre-built environments.

You can install RecSim with the following pip command:

pip install recsim
As you see from the "Requirement already satisfied" messages, RecSim is already installed on this machine.

Clear the terminal output, run Python, and load the RecSim library with the following code:

clear
python
import recsim
from recsim import environments
from pprint import pprint # for better print formatting

What Is RecSim?
The RecSim documentation on GitHub provides a figure that shows the different classes needed to build an environment in RecSim. The blue classes refer to the documents, the green classes to the user, and the red classes to the agent. The various boxes represent conditional probability distributions. Therefore, a RecSim environment can be seen as a dynamic Bayesian network (DBN).

We won't cover all of them in detail. Instead, we will rely on pre-built environments, with some customizations.

One customization is to define how many candidate documents should be observed by the agent (NUM_CANDIDATES) and how many of those should be recommended (SLATE_SIZE).

Run the following code to define that our agent will get 10 candidate documents of which he has to recommend 3:

NUM_CANDIDATES = 10
SLATE_SIZE = 3

Build the Environment
RecSim comes with three pre-built environments:

Long term satisfaction (the one we will use in this lab)
Interest evolution
Interest exploration
Each of these environments makes different assumptions about the user behavior.

We will continue with the long term satisfaction (LTS) environment. If you are interested in the other ones, check out the resource links at the end of this lab.

The LTS environment simulates a situation where a user of an online service interacts with items of content, which are characterized by their level of clickbaitiness (on a scale of 0 to 1).

Clickbaity items (choc) generate engagement but lead to a decrease in long-term satisfaction.

Nonclickbaity items (kale) increase satisfaction but do not generate as much engagement.

The challenge is to balance the two in order to achieve some long-term optimal trade-off, a very common scenario for content recommendation systems.

Let's initialize the environment using the following code and plug in our previously defined NUM_CANDIDATES and SLATE_SIZE:

lts_env = recsim.environments.long_term_satisfaction.create_environment({
    "num_candidates": NUM_CANDIDATES,
    "slate_size": SLATE_SIZE,
    "resample_documents": False
})
We will come back to what resample_documents means in a bit.

Observation Space
Let's find out how our fresh environment looks.

Run the following code to reset the environment to an initial state and print the initial observation:

observation_space = lts_env.reset()
pprint(observation_space)
Take a look at the output.

That's what an agent within our environment would see.

You can see that there are ten documents available for the agent to choose from. Each document has one feature—the clickbaitiness (on a scale of 0 to 1).

We can't see a response because it is still the initial state.

We can see a user, but without any features describing the user.

Let's reset the environment by running the code again.

Try running the following code multiple times:

observation_space = lts_env.reset()
pprint(observation_space)
You will see that the documents do not change; it will always be the same list of ten documents. That's because we set resample_documents to False when we set up the environment. That means the agent will always see the same documents for this environment.

Let's change the parameter and allow our environment to resample documents by running the following code:

lts_env_resample = recsim.environments.long_term_satisfaction.create_environment({
    "num_candidates": NUM_CANDIDATES,
    "slate_size": SLATE_SIZE,
    "resample_documents": True
})
Run the following code multiple times:

observation_space = lts_env_resample.reset()
pprint(observation_space)
You can see that the documents are now changing every time you run the code. That's because every time we observe a new state of the environment, we get a fresh sample of documents from the overall document store.

In RecSim, you can't manually define documents; they always need to be sampled from a larger repository.

To learn more about documents and how to set up your own documents model, check out the link in the resources at the end of this lab.

So far, we have reset the environment a couple of times, but we haven't actually taken action in it. Let's do that now!

In [None]:
Action Space
As you can see from the previous outputs, the response variable and the user features did not change as we ran our code. That's because we didn't actually take action inside our environment but just initialized the environment over and over again.

What actions can we take? Let's find out by exploring the action space.

Run the following code:

print(lts_env.action_space)
As you can see from the output, our agent faces a MultiDiscrete action space.

If you remember, we defined our environment in such a way that the agent can choose from 10 documents for 3 different slots (slates).

And that is exactly what we can observe here—10 possible actions for each of the 3 slots.

Let's interact with our environment manually, pretending we are the agent that tries to make the best recommendation to create long-term user satisfaction.

Let's say we will always recommend the first three items from the document candidates.

RECOMMENDATIONS = [0, 1, 2]
Let's plug this into our environment and see what happens.

We can apply an action using the step method.

Run the following code:

observation_0 = lts_env.reset()
observation_1, reward, done, _ = lts_env.step(RECOMMENDATIONS)
pprint(observation_1)
You can see that this observation provides more detail than the ones before—it actually contains feedback from the user. Let's find out what that means.

User Behavior
The step method that we just called on the environment for returned a tuple of four items (observation, reward, done, info), where:

observation is the agent's observation that includes user's state features, documents, user responses.
reward is the amount of reward returned after the previous action.
done is a boolean whether the episode has ended (time budget of the user is empty).
info is a dictionary containing information for debugging/learning.
In our case, the reward is the engagement that the user has shown for the documents that we suggested to him.

In a real scenario, this reward could be the view time, for example.

You should see from the output which of our suggestions was clicked by the user ('click': 1) and the engagement for it.

Run this code to step through the environment until the user's time budget has been depleted, which means the end of our current episode:

done = False
total_reward = 0
step_count = 0
lts_env.reset()
while not done:
  observation, reward, done, _ = lts_env.step(RECOMMENDATIONS)
  total_reward += reward
  step_count += 1

print("Episode has ended after %(a)s steps with an accumulated reward of %(r)d. The latest observation was:" % {'a': step_count, 'r': total_reward})
As you can see, always selecting the first three documents has yielded an accumulated reward as printed in the output.

Setup Simulated User Interactions

Import Packages
As shown in the previous lab Define and Understand a Reinforcement Learning Environment, we will use an environment for a recommendation scenario using Google RecSim.

You can install RecSim with pip install recsim, which is already done on this machine.

Click on the Copy to Editor button to create a file interactions.py and add the necessary import statements for some utility packages of this scenario.

Copy to Editor# Imports
from pprint import pprint # for better print formatting
import numpy as np
import gym 
gym is a popular Python library for developing and comparing reinforcement learning algorithms.

Create the Environment
To load the long term satisfaction (LTS) environment from RecSim, add the following code to interactions.py:

Copy to Editorfrom recsim.environments.long_term_satisfaction import create_environment as LongTermSatisfactionRecSimEnv

Add the following code to create an environment with a slate_size of 3 and 20 document candidates:

Copy to Editor# Environment
lts_20_3 = LongTermSatisfactionRecSimEnv({
    "num_candidates": 20, 
    "slate_size": 3,
    "resample_documents": True,
    "convert_to_discrete_action_space": True
})

print(lts_20_3.action_space)

Run python interactions.py to take a look at the action space.

As you can see from the output MultiDiscrete([20 20 20], our action space lets us select 1 out of 20 candidate documents for 3 different slots.

This already gives us a complexity of 20 * 20 * 20 = 8000 unique possible combinations.

These kinds of recommendation problems will scale really fast in complexity!

To keep things simple and computation time short, let's pretend we are dealing with only 100 candidate documents for 1 slot.

To create the environment lts_100_1, add the following code:

Copy to Editorlts_100_1 = LongTermSatisfactionRecSimEnv({
    "num_candidates": 100, 
    "slate_size": 1,
    "resample_documents": True
})
print(lts_100_1.action_space)

Run python interactions.py again.

As you can see from the output MultiDiscrete([100]), we are now dealing with an environment with 100 possible documents. This is much less complex than the 8,000 possible combinations before and makes the calculations much easier for selecting the best documents.

Customize the Environment
Remember, the LTS environment simulates a situation where a user of an online service interacts with items of content, which are characterized by their level of clickbaitiness (on a scale of 0 to 1).

Clickbaity items (choc) generate engagement but lead to a decrease in long-term satisfaction.

Nonclickbaity items (kale) increase satisfaction but do not generate as much engagement.

The challenge is to balance the two in order to achieve some long-term optimal trade-off, a very common scenario for content recommendation systems.

This effect is, per default, very small.

Inspect the settings of the user model in a RecSim environment using the following code:

Copy to Editorpprint(lts_100_1.environment._user_model._user_sampler._state_parameters)

Run python interactions.py to take a look.

We will now update some of these parameters to make the satisfaction effect even larger for demonstration purposes (or we assume that our users are even less likely to become disengaged by too many clickbaity recommendations).

Run the following code to update the sensitivity, time_budget, choc_stddev, and kale_stddev in the user model:

Copy to Editorlts_100_1.environment._user_model._user_sampler._state_parameters.update({
            "sensitivity": 0.1,
            "time_budget": 200,
            "choc_stddev": 0.01,
            "kale_stddev": 0.01
        })

pprint(lts_100_1.environment._user_model._user_sampler._state_parameters)

Inspect the updated settings by running python interactions.py.

Perform Random Actions
At first, we want to explore how to interact with our environment using an agent that performs random actions.

Use the following code to define a function that steps through the environment until the user's time budget is depleted. We call this process one episode. The agent collects the rewards based on random recommendations.

Add the following code to define the function that runs one episode and call that function for the environment lts_100_1:

Copy to Editordef run_one_episode_random(env, verbose = False):
  # Runs one episode until user time budget is depleted.
  # Returns all rewards in a list.

  # Sum of rewards for one episode
  episode_rewards = []

  # Step counter
  step = 0

  # Reset environment before each episode
  env.reset()

  # User time budget
  done = False

  # Run episode until user time budget is depleted
  while not done:
    # Pick random action
    action = env.action_space.sample()
    # Perform action   
    obs, reward, done, _ = env.step(action)
    # Gather reward
    episode_rewards.append(reward)
    # Count step
    step += 1

    if verbose:
      print("Step %(s)d completed with a reward of %(r)d." % {'s': step, 'r': reward})

  return(episode_rewards)

random_rewards = run_one_episode_random(lts_100_1, verbose = True)

Run python interactions.py to observe the rewards.

As you can see from the output, the reward after each step (the user engagement—e.g., the watch time of a video) goes up and down without any real pattern.

We can also visualize these rewards using a simple plot. The plot will show each reward as a point and draws a trendline over time. Add the following code to create the plot random-rewards.png:

Copy to Editorimport matplotlib.pyplot as plt
#create scatterplot
y = random_rewards
x = range(0,len(y))
plt.scatter(x, y, c="lightgrey")

#calculate equation for trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)

#add trendline to plot
plt.plot(x, p(x))
plt.savefig("random-rewards.png")
Run python interactions.py again and open random-rewards.png.

You should see a relatively steady trendline.

Run python interactions.py again and open random-rewards.png.

If you do this a couple times, you should see that the trend line sometimes goes up and sometimes goes down—an indication of the random effect here.

Our ultimate goal is to create an agent where the rewards get bigger over time—i.e., a trend line that goes mostly up instead of being flat.

Perform Greedy Actions
Let's run the same experiment using a greedy agent. That means our algorithm should always pick the most clickbaity item. What will happen? Let's find out!

The following code works just as the run_one_episode_random function; the only difference is that np.argmax([value for _, value in obs["doc"].items()]) will select the highest clickbaity item.

Copy to Editordef run_one_episode_greedy(env, verbose = False):
  # Runs one episode until user time budget is depleted.
  # Returns all rewards in a list.

  # Sum of rewards for one episode
  episode_rewards = []

  # Step counter
  step = 0

  # Reset environment after each episode
  obs = env.reset()
  done = False

  # Run episode until user time budget is depleted
  while not done:
    # Pick most clickbaity item
    sweetest = np.argmax([value for _, value in obs["doc"].items()])
    # Perform action   
    obs, reward, done, _ = env.step([sweetest])
    # Gather reward
    episode_rewards.append(reward)
    # Count step
    step += 1

    if verbose:
      print("Step %(s)d completed with a reward of %(r)d." % {'s': step, 'r': reward})

  return(episode_rewards)

greedy_rewards = run_one_episode_greedy(lts_100_1, verbose = False)

We will also create a plot for the greedy behavior that is stored under greedy-rewards.png.

Add the following code to the file:

Copy to Editory = greedy_rewards
x = range(0,len(y))
plt.clf() # clear plot
plt.scatter(x, y, c="lightgrey")

#calculate equation for trendline
z = np.polyfit(x, y, 1)
p = np.poly1d(z)

#add trendline to plot
plt.plot(x, p(x))
plt.savefig("greedy-rewards.png")
Run python interactions.py again and open greedy-rewards.png.

You should see that the trendline always points down and the rewards get smaller over time.

This shows that the user satisfaction goes down over time when they only consume clickbaity items.

Calculate a Random Baseline
Finally, before we continue to create more sophisticated agents using reinforcement learning, we want to find out which baseline these agents actually need to beat.

For this purpose, we will take over 1,000 random episodes to get a reliable mean of the average total reward values.

The following formula takes the run_one_episode_random function and executes it within a while loop that runs for a number of 1,000 episodes.

The result of this loop will be the mean of all episode rewards, which is our random_baseline for this environment.

Add the following code to create the function and call over 500 episodes:

Copy to Editordef get_random_baseline_for_env(env, episodes=500, verbose=False):

  # counts number of elapsed episodes
  episodes_count = 0
  # Lists all sum of rewards over all episodes
  episodes_all_rewards = [] 

  while episodes_count < episodes:
    episode_rewards = run_one_episode_random(env, verbose = False)
    episode_reward_sum = np.sum(episode_rewards)

    if verbose:
      print("Episode %(e)d elapsed with accumulated reward of %(r)d." % {'e': episodes_count, 'r': episode_reward_sum})
    elif episodes_count % 100 == 0:
      print(f" {episodes_count} ", end="")
    elif episodes_count % 10 == 0:
      print(".", end="")

    # Count one episode
    episodes_count += 1
    episodes_all_rewards.append(episode_reward_sum) 

  random_baseline = np.mean(episodes_all_rewards)

  print("%(E)d Episodes ended with a total accumulated reward of %(r)d." % {'E': episodes, 'r': random_baseline})
  return(random_baseline)

random_baseline = get_random_baseline_for_env(lts_100_1, verbose = False)
print(random_baseline)

Run python interactions.py, which might take a couple of minutes to complete.

Inspect the output to see which reward is shown for the random baseline.

Now, whatever we do with reinforcement learning or any other technique, it must be able to beat at least this baseline if we are operating in this environment.

Build a Contextual Bandit Using RLLib

What Is RLlib?
RecSim provides us with a library that gives us easy access to customizable RL environments for recommendation scenarios.

To interact with an environment, we can code up primitive agents ourselves, simulating random or greedy behavior, for example (see the lab Set Up Simulated User Interactions).

There are different ways and technologies to build more complex agents and train them using RL.

One popular way is to use a Python library called RLlib.

RLlib is a native library of the Ray project—an open-source software developed at UC Berkeley that allows compute-intensive Python workloads, including deep RL.

RLlib offers support for production-level RL workloads while providing simple APIs for very different kinds of applications.

You can install RLlib using the command pip install "ray[rllib]" (already installed on this machine).

In addition to RLlib, we also need RecSim for the environment and PyTorchas the deep learning backend; you can install both using the following pip command: pip install recsim torch. Again, no need to run it here as both are already preinstalled.

Click on the following Copy to Editor button to create a file train.py and add the necessary import statements for this lab:

Copy to Editor# Imports
import ray
import gym
import numpy as np
from ray import tune
from pprint import pprint
import matplotlib.pyplot as plt
import progressbar
from os.path import exists
ray.init(object_store_memory=78643200) # Limit for this machine, delete when run on your own system

Create the Environment
RLlib comes with RecSim environments included so we can import them directly from there.

Add the following code to import the LongTermSatisfactionRecSimEnv:

Copy to Editor# Environment
from ray.rllib.examples.env.recommender_system_envs_with_recsim import LongTermSatisfactionRecSimEnv

Again, we will make some customizations to make the satisfaction effect even stronger. But this time we are going to write a small wrapper function, which makes it a bit more flexible for us to pass these config parameters to a RecSim environment.

Add the following code to define the StrongLTS wrapper function:

Copy to Editordef StrongLTS(env):
    env.environment._user_model._user_sampler._state_parameters.update({
            "sensitivity": 0.06,
            "time_budget": 120,
            "choc_stddev": 0.1,
            "kale_stddev": 0.1
        })
    return(env)

Then add the following code to create the environment strong_lts_20_2 using the wrapper:

Copy to Editorlts_20_2 = LongTermSatisfactionRecSimEnv({
    "num_candidates": 20, 
    "slate_size": 2,
    "resample_documents": True,
    "convert_to_discrete_action_space": True
})

strong_lts_20_2 = StrongLTS(lts_20_2)

This will create an environment with 2 slates and 20 candidate documents, giving a possible 20 * 20 = 400 unique choices.

Calculate Random Baseline
Open the file random_baseline.py.

This file contains essentially the same functionality we have seen in the lab Set Up Simulated User Interactions.

The function get_random_baseline_for_env runs a custom number of episodes and returns the accumulated reward baseline for an agent acting with random behavior.

To access this function, add the following two lines to the script train.py:

Copy to Editor# Calculate random baseline
from random_baseline import get_random_baseline_for_env
random_baseline = get_random_baseline_for_env(strong_lts_20_2, verbose = False)

Now run train.py to calculate the random baseline over 500 episodes (note that this can take a few minutes):

python train.py
The baseline should be around 1157. For convenience this value has been stored in a file called random_baseline.txt. You can always access it from there.

Use Trainers in RLlib
Let's go beyond the random agent.

Comment out the random_baseline calculation by clicking the Copy to Editor button below:

Copy to Editor# random_baseline = get_random_baseline_for_env(strong_lts_20_2, verbose = False)
RLlib comes with a large amount of prebuilt algorithms (trainers). You can find a complete list of these algorithms in the resources at the end.

In order to use one of these algorithms, you have to instantiate its associated trainer class.

Add the following code to import a multi-armed bandit trainer with Upper Confidence Bound (UCB) exploration:

Copy to Editor# Trainer
from ray.rllib.agents.bandit import BanditLinUCBTrainer
from ray.rllib.agents.bandit.bandit import DEFAULT_CONFIG as BANDIT_DEFAULT_CONFIG
pprint(BANDIT_DEFAULT_CONFIG)

Configure a Trainer
Run python train.py and inspect the output.

What you see here are the default configurations for this trainer class.

As you can see, there are plenty of ways to customize trainers.

You can configure trainers in RLlib with a configuration dictionary.

Let's update some of these configs for our own needs.

For example, we want to specify the name of the environment and the environment setup.

Add the following with the custom config dictionary:

Copy to Editorbandit_config = {
    "env": "strong_lts",
    "env_config": {
        "num_candidates": 20, 
        "slate_size": 2,
        "resample_documents": True,

        # Bandit-specific flags:
        "convert_to_discrete_action_space": True,
        # Convert "doc" key into "item" key.
        "wrap_for_bandits": True,
        # Set consistent random seeds for the environment
        "seed": 123,
    },
    # Seed for the trainer.
    "seed": 123
}

As you can see, we pass our environment here as a string value strong_lts.

The most convenient way to hook up a trainer from RLlib with a custom gym environment is to register the environment in Ray.

Add the following code to register our environment:

Copy to Editor# Register the environment 
tune.register_env("strong_lts", lambda env_config: StrongLTS(LongTermSatisfactionRecSimEnv(env_config)))

To actually create the trainer object with the above config, add the following code:

Copy to Editorbandit_trainer = BanditLinUCBTrainer(config=bandit_config)

Finally, add the following code to inspect our trainer briefly:

Copy to Editorpprint(bandit_trainer.get_config())
print(bandit_trainer.iteration)

Run python train.py to take a look.

You will see our trainer's configuration: for example, the env used and the expected action_space associated with it.

Also, our trainer is currently at iteration 0, which means this trainer has not been trained yet!

Let's change that.

Run a Single Training
It's time to run the first training!

One training iteration usually includes the following two steps:

Perform an action in the environment
Use the observed data (observations, actions taken, rewards) to update the policy model (e.g., a neural network) such that it would pick better actions in the future, leading to higher rewards.
Trainers in RLlib are trained using the .train method.

Add the following code to perform a single training call and print the results:

Copy to Editorresult = bandit_trainer.train()
del result["config"] # Erase config for better readability.
pprint(result)

Run python train.py and inspect the output.

You should see that our trainer is now at iteration 1.

So far, the trainer has not really learned anything but we can verify that the training is technically working correctly.

Run a Training Loop
We need to train our trainer many more times in order to find the best policy (model) that recommends the best actions for our environment.

In an RL setting, this can easily cover a couple of thousand episodes.

There are different learning approaches. In our case, we perform online learning where the model will be updated every time we receive the reward.

Add the following function, which performs a training over a given period of episodes. This function will also visualize the training progress and save it as a bandit-training.png file:

Copy to Editordef run_training(trainer, episodes):
  # Train for n episodes and collect rewards.
  rewards = []
  for i in progressbar.progressbar(range(episodes)):
      # Update the model immediately on the received reward.
      result = trainer.train()
      # Collect rewards
      rewards.append(result["episode_reward_mean"])

  # Free up resources
  trainer.stop()

  # Plot episode rewards
  plt.figure(figsize=(10,7))
  start_at = 0
  smoothing_win = 200
  x = list(range(start_at, len(rewards)))
  y = [np.nanmean(rewards[max(i - smoothing_win, 0):i + 1]) for i in range(start_at, len(rewards))]
  plt.plot(x, y)
  plt.title("Average reward")
  plt.xlabel("Time/Training steps")

  # Add average random baseline reward (red line).
  with open ("random_baseline.txt", "r") as f:
    random_baseline = int(f.read())
  plt.axhline(y=random_baseline, color="r", linestyle="-")

  plt.savefig('bandit-training.png')

To actually run the training over 100 epochs, let's also add the following function call:

Copy to Editorrun_training(bandit_trainer, 100)

Before we run the script, we want to actually make sure that the result of the training process is saved.

Add the following lines to save the checkpoints and keep the latest checkpoint path in a separate file checkpoint.txt:

Copy to Editorcheckpoint_path = bandit_trainer.save("checkpoints")
print(f"Trainer was saved in '{checkpoint_path}' at iteration {bandit_trainer.iteration}.")

with open('checkpoint_path.txt', 'w') as f:
    f.write(checkpoint_path)

Now run the file python train.py.

Wait until the training is finished, indicated by the message Trainer was saved in 'checkpoints/checkpoint_...' at iteration 101.

Evaluate Training Performance
Let's find out how our trainer actually performs.

Open bandit-training.png and inspect the result.

You should see that the trainer performance actually drops after the first couple of iterations.

At some point, however, the trainer picks up a better policy and successively gets closer to the random baseline (red horizontal line).

If you like, you can rerun this training process with more than 100 steps and see how far the total reward gets.

At some point, the trainer will hit a plateau very close to the baseline. Why is that? Let's find out by inspecting the trainer output!

Trainer Inference—Inspect Outputs
To find out what our trainer actually recommends, let's run some inference:

Open the file inference.py.

This file restores our trainer from the checkpoint and runs inference in a given environment.

Run python inference.py and inspect the result.

The output shows the observed documents and the action that the trainer suggests.

If you look carefully, you will see that the bandit always chooses the document with the highest feature value, i.e., the most clickbaity item. You can run python inference.py a couple more times to back up our assumption.

The reason for this behavior is that the bandit has learned to optimize for short-term rewards.

And that's why it won't be able to capture the long-term goal of keeping the user satisfaction high as well.

For that we will need a different algorithm that can handle a multigoal objective.

We will try to improve this behavior using the SlateQ algorithm in the next lab Train a Deep Neural Network with SlateQ.

Train a Deep Neural Network with SlateQ

Imports
Click the Copy to Editor button below to create a file train.py and import the following packages:

Copy to Editor# Imports

import ray
import gym
import numpy as np
from ray import tune
from pprint import pprint
import matplotlib.pyplot as plt
import progressbar
from os.path import exists
ray.init(object_store_memory=78643200)

These packages allow us to set up our reinforcement learning environment and implement state-of-the-art deep learning algorithms without having to write too much boilerplate code.

Environment
Let's define our environment using RecSim via RLlib just like we did in the lab Build a Contextual Bandit Using RLlib.

This time, however, we can't use a simple function to modify the environment parameters.

The SlateQ algorithm requires an observation space with information (features) about the users.

To add this information, we can define a short wrapper class.

This wrapper class StrongLTS will:

Increase the satisfaction effect in the user model as in the previous labs
Modify the observation space so it includes information about the user's state
Add the class by copying the following code:

Copy to Editor# Environment

from ray.rllib.examples.env.recommender_system_envs_with_recsim import LongTermSatisfactionRecSimEnv

# LTS (Long Term Satisfaction) wrapper
class StrongLTS(gym.ObservationWrapper):

    def __init__(self, env):
        # Change user model
        env.environment._user_model._user_sampler._state_parameters.update({
            "sensitivity": 0.06,
            "time_budget": 120,
            "choc_stddev": 0.1,
            "kale_stddev": 0.1
        })

        super().__init__(env)

        # Add user features to observation space
        if "response" in self.observation_space.spaces:
            self.observation_space.spaces["user"] = gym.spaces.Box(0.0, 1.0, (1, ), dtype=np.float32)
            for r in self.observation_space["response"]:
                if "engagement" in r.spaces:
                    r.spaces["watch_time"] = r.spaces["engagement"]
                    del r.spaces["engagement"]
                    break

    def observation(self, observation):
        if "response" in self.observation_space.spaces:
            observation["user"] = np.array([self.env.environment._user_model._user_state.satisfaction])
            for r in observation["response"]:
                if "engagement" in r:
                    r["watch_time"] = r["engagement"]
                    del r["engagement"]
        return observation
Finally, add the following line to register the environment in Ray:

Copy to Editortune.register_env("strong_lts", lambda env_config: StrongLTS(LongTermSatisfactionRecSimEnv(env_config)))

Configure the Trainer
Thanks to the APIs in RLlib, setting up and configuring the SlateQ trainer works very similar to the bandit.

Add the following code to import the trainer object from RLlib and pass the configuration parameters in a dictionary:

Copy to Editorfrom ray.rllib.agents.slateq import SlateQTrainer

slateq_config = {
    "env": "strong_lts",
    "framework": "torch",
    "env_config": {
        "num_candidates": 20, 
        "slate_size": 2,
        "resample_documents": True
    }
}

slateq_trainer = SlateQTrainer(config=slateq_config)
Let's go ahead and train the trainer.

Training
Add the following training function, which we used in the lab Build a Contextual Bandit Using RLlib.

This function will train over a specified number of episodes and visualize the training progress in a PNG file:

Copy to Editor# Training

def run_training(trainer, episodes):
  rewards = []
  for i in progressbar.progressbar(range(episodes)):
      result = trainer.train()
      rewards.append(result["episode_reward_mean"])

  trainer.stop()
  plt.figure(figsize=(10,7))
  start_at = 0
  smoothing_win = 200
  x = list(range(start_at, len(rewards)))
  y = [np.nanmean(rewards[max(i - smoothing_win, 0):i + 1]) for i in range(start_at, len(rewards))]
  plt.plot(x, y)
  plt.title("Average reward")
  plt.xlabel("Time/Training steps")

  # Add random baseline reward (red line).
  with open ("random_baseline.txt", "r") as f:
    random_baseline = int(f.read())
  plt.axhline(y=random_baseline, color="r", linestyle="-")

  plt.savefig('slateq-training.png')

Call the function for 5 episodes by adding the following code:

Copy to Editor# Run Training
run_training(slateq_trainer, 5)

You will see that training the SlateQ model takes much longer than the bandit. It's much heavier and more complex, but it also has more powerful capabilities.

Save Checkpoints
Before we perform the training, add the following code to save the checkpoints as well as the checkpoint path to a separate file:

Copy to Editor# Save checkpoints
checkpoint_path = slateq_trainer.save("checkpoints")
print(f"Trainer was saved in '{checkpoint_path}' at iteration {slateq_trainer.iteration}.")

with open('checkpoint_path.txt', 'w') as f:
    f.write(checkpoint_path)

Now run the training job:

python train.py
This will probably take 2-3 minutes to complete.

Inspect the Training Progress
Once the training is finished, take a look at the chart slateq-training.png.

Depending on the decisions made by the algorithm at the beginning, it starts either above or below the baseline.

Also, the accumulated reward during the first few episodes might still be decreasing. This is because the algorithm is still in the exploration phase to find the best policy.

So far, we have just trained over 5 training iterations, so that's fine.

How can we improve the algorithm to achieve higher rewards? The easiest way is to train more!

As long as you see that the average reward curve is still moving up or down, not horizontally, that's a good indication that the algorithm is still learning/exploring.

To continue the training where we left off, add the following code before calling the training function to restore the trainer from the saved checkpoints:

Copy to Editor# Run Training
if exists("checkpoint_path.txt"):
  # Restore trainer
  with open ("checkpoint_path.txt", "r") as f:
    checkpoint_path = f.read()
  slateq_trainer.restore(checkpoint_path)

Now run the training script:

python train.py
Once the training is finished, check the chart slateq-training.png to inspect your training progress.

You should see how the overall reward improves episode over episode. If not, try running the training for some more times.

After around 40 episodes, the SlateQ algorithm should perform much better than the bandit and the baseline, hitting an average accumulated reward of more than 1160.

Of course there are many more ways to improve the training performance. Another way is to tune the algorithm's hyper parameters using a grid search.

While hyperparameter tuning would be out of the scope of this tutorial, you will find some useful links in the resources at the end.

Inspect the Policy
Let's take a look under the hood of our trainer to understand the policy's model even better.

First, be sure to comment out the run_training command since we don't want to run another training loop.

You can comment out the line manually or click the Copy to Editor button below:

Copy to Editor#run_training(slateq_trainer, 5)
Now add the following code to show the trainer policy as well as the underlying model:

Copy to Editorpolicy = slateq_trainer.get_policy()
model = policy.model
print(f"Policy: {policy}")
print(f"Policy's model: {model}")
Execute the training file again:

python train.py
As you can see from the output, the model is a sequential deep learning model, as you might recognize from learning about supervised machine learning.

It contains inputs (the observations) and outputs (the actions).

As with any deep learning model you can also train this model on historic data.

In the context of reinforcement learning, this process is called offline learning, another powerful technique that can help you improve the quality of your model.

Offline learning will also help you if you don't have a training environment but want to learn purely from historical data. If you want to find out more about offline learning, check out the resource links at the end of this lab.