## Table Of Contents
1. [Problem_Statement](#ps)
2. [Environment](#env)
2. [Observations](#obs)
3. [Declaring random and SlateQ agents](#create)
4. [Training and evaluating agents](#train)
5. [Results](#result)

Inspired by [this](https://github.com/google-research/recsim/blob/master/recsim/colab/RecSim_Overview.ipynb) tutorial.

In [1]:
import numpy as np
import tensorflow as tf

from recsim.environments import long_term_satisfaction
from recsim.agents import full_slate_q_agent, random_agent
from recsim.simulator import runner_lib

<a id='ps'></a>

## Problem Statement

Most practical recommender systems focus on estimating immediate user engagement without considering the long-term effects of recommendations on user behaviour. Reinforcement learning (RL) methods offer the potential to optimize recommendations for long-term user engagement. However, since users are often presented with slates of multiple items—which may have interacting effects on user choice—methods are required to deal with the combinatorics of the RL action space.

Google’s SlateQ algorithm addresses this challenge by decomposing the long-term value (LTV) of a slate of items into a tractable function of its component item-wise LTVs. For our project, we want to compare the efficiency of SlateQ to other RL methods like Deep Q-Networks that don’t decompose the LTV of a slate into its component-wise LTVs.

<a id='env'></a>

## Environment

The `long_term_satisfaction` environment depicts a situation in which a user of an online service interacts with items of content,  which are characterized by their level of clickbaitiness (on a scale of 0 to 1). 

In particular, clickbaity items generate engagement, but lead to decrease in long-term satisfaction.
Non-clickbaity items increase satisfaction but do not generate as much engagement. 

The challenge is to balance the two in order to achieve some long- term optimal trade-off.
The dynamics of this system are partially observable, as satisfaction is a latent variable. It has to be inferred through the increase/decrease in
engagement.

![title](../images/simulator.png)

<a id='obs'></a>

## Observations

A RecSim observation is a dictionary with 3 keys:

- 'user', which represent the 'User Observable Features' in the structure diagram above,
- 'doc', containing the current corpus of recommendable videos and their observable features ('Document Observable Features'),
- and 'response', indicating the user's response to the last slate of recommendations ('User Response'). At this stage the 'response' key is vacuous and will be set to None, as no recommendation has been made yet.

In [2]:
# setting environment variables and sampling

seed = 0
np.random.seed(seed)
env_config = {
  'num_candidates': 10,  # number of videos to choose from
  'slate_size': 3,
  'resample_documents': True,
  'seed': seed,
  }
ie_environment = long_term_satisfaction.create_environment(env_config)
initial_observation = ie_environment.reset()

In [3]:
# print sample record

print('User Observable Features')
print(initial_observation['user'])
print('User Response')
print(initial_observation['response'])
print('Document Observable Features')
for doc_id, doc_features in initial_observation['doc'].items():
  print('ID:', doc_id, 'features:', doc_features)

User Observable Features
[]
User Response
None
Document Observable Features
ID: 10 features: [0.79172504]
ID: 11 features: [0.52889492]
ID: 12 features: [0.56804456]
ID: 13 features: [0.92559664]
ID: 14 features: [0.07103606]
ID: 15 features: [0.0871293]
ID: 16 features: [0.0202184]
ID: 17 features: [0.83261985]
ID: 18 features: [0.77815675]
ID: 19 features: [0.87001215]


Document Space consists of a slate of 10 items where each document is represented by a level of clickbaitiness.

In [4]:
# print sample record

print('Document observation space')
for key, space in ie_environment.observation_space['doc'].spaces.items():
  print(key, ':', space)

Document observation space
10 : Box(1,)
11 : Box(1,)
12 : Box(1,)
13 : Box(1,)
14 : Box(1,)
15 : Box(1,)
16 : Box(1,)
17 : Box(1,)
18 : Box(1,)
19 : Box(1,)


User Response has following elements for each items in slate:

- Click: A boolean indicating whether the video was clicked.
- engagement: a nonnegative real number, representing the degree of engagement, e.g. econds watched from a recommended video.

In [5]:
# print sample response

print('Response observation space')
print(ie_environment.observation_space['response'])

Response observation space
Tuple(Dict(click:Discrete(2), engagement:Box()), Dict(click:Discrete(2), engagement:Box()), Dict(click:Discrete(2), engagement:Box()))


A User can observe the document features discussed above.

In [6]:
# print sample record

print('User observation space')
print(ie_environment.observation_space['user'])

User observation space
Box(0,)


Our RecSim slate is a list of first $2$ indices of obeservation['doc']. E.g. the slate [0, 1] corresponds to the slate consisting of:

In [7]:
slate = [0, 1, 2]
for slate_doc in slate:
  print(list(initial_observation['doc'].items())[slate_doc])

('10', array([0.79172504]))
('11', array([0.52889492]))
('12', array([0.56804456]))


In [8]:
ie_environment.action_space

MultiDiscrete([10 10 10])

In [9]:
# print sample episode with rewards

observation, reward, done, _ = ie_environment.step(slate)
print(observation)
print(reward)
print(done)

{'user': array([], dtype=float64), 'doc': OrderedDict([('20', array([0.97861834])), ('21', array([0.79915856])), ('22', array([0.46147936])), ('23', array([0.78052918])), ('24', array([0.11827443])), ('25', array([0.63992102])), ('26', array([0.14335329])), ('27', array([0.94466892])), ('28', array([0.52184832])), ('29', array([0.41466194]))]), 'response': ({'click': 1, 'engagement': 64.61894045997103}, {'click': 0, 'engagement': 0.0}, {'click': 0, 'engagement': 0.0})}
64.61894045997103
False


<a id='create'></a>

## Creating Agents

In [13]:
# creating agents

def create_q_agent(sess, environment, eval_mode, summary_writer=None):
  kwargs = {
      'observation_space': environment.observation_space,
      'action_space': environment.action_space,
      'summary_writer': summary_writer,
      'eval_mode': eval_mode,
  }
  return full_slate_q_agent.FullSlateQAgent(sess, **kwargs)


def create_random_agent(sess, environment, eval_mode, summary_writer=None):
    kwargs = {
        'action_space': environment.action_space
  }
    return random_agent.RandomAgent(**kwargs)

<a id='train'></a>

## Training the Agents

### Configuring the Environment

In [14]:
# environment config

seed = 0
np.random.seed(seed)
env_config = {
  'num_candidates': 10,
  'slate_size': 2,
  'resample_documents': True,
  'seed': seed,
  }

### Training the Random Agent

In [15]:
# training

tmp_random_dir = './results/baseline_random/'
runner = runner_lib.TrainRunner(
    base_dir=tmp_random_dir,
    create_agent_fn=create_random_agent,
    env=long_term_satisfaction.create_environment(env_config),
    episode_log_file="",
    max_training_steps=50,
    num_iterations=10)
runner.run_experiment()

INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Beginning training...


INFO:tensorflow:Beginning training...


INFO:tensorflow:Reloaded checkpoint and will start from iteration 10


INFO:tensorflow:Reloaded checkpoint and will start from iteration 10






### Training the Standard Full Slate Q-Agent

In [16]:
# training

tmp_q_dir = './results/baseline_fullq/'
runner = runner_lib.TrainRunner(
    base_dir=tmp_q_dir,
    create_agent_fn=create_q_agent,
    env=long_term_satisfaction.create_environment(env_config),
    episode_log_file="",
    max_training_steps=50,
    num_iterations=10)
runner.run_experiment()

INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_training_steps = 50, number_iterations = 10,checkpoint frequency = 1 iterations.


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x13cfd65c0>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x13cfd65c0>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 1)


INFO:tensorflow:	 observation_shape: (11, 1)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    


Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Please use tf.global_variables instead.


Instructions for updating:
Please use tf.global_variables instead.


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning training...


INFO:tensorflow:Beginning training...


INFO:tensorflow:Restoring parameters from ./results/baseline_fullq/train/checkpoints/tf_ckpt-9


INFO:tensorflow:Restoring parameters from ./results/baseline_fullq/train/checkpoints/tf_ckpt-9


INFO:tensorflow:Reloaded checkpoint and will start from iteration 10


INFO:tensorflow:Reloaded checkpoint and will start from iteration 10






### Evaluating the Random Agent

In [17]:
# evaluating

runner = runner_lib.EvalRunner(
      base_dir=tmp_random_dir,
      create_agent_fn=create_random_agent,
      env=long_term_satisfaction.create_environment(env_config),
      max_eval_episodes=10,
      test_mode=True)
runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 10


INFO:tensorflow:max_eval_episodes = 10


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:eval_file: ./results/baseline_random/eval_10/returns_600


INFO:tensorflow:eval_file: ./results/baseline_random/eval_10/returns_600


### Evaluating the Q Agent

In [18]:
# evaluating

runner = runner_lib.EvalRunner(
      base_dir=tmp_q_dir,
      create_agent_fn=create_q_agent,
      env=long_term_satisfaction.create_environment(env_config),
      max_eval_episodes=10,
      test_mode=True)
runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_eval_episodes = 5


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:max_steps_per_episode = 27000


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:Creating FullSlateQAgent agent with the following parameters:


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 update_horizon: 1.000000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 min_replay_history: 20000


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 update_period: 4


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 target_update_period: 8000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_train: 0.010000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_eval: 0.001000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 epsilon_decay_period: 250000


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 tf_device: /cpu:*


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 use_staging: True


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x148a67ef0>


INFO:tensorflow:	 optimizer: <tensorflow.python.training.rmsprop.RMSPropOptimizer object at 0x148a67ef0>


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:	 max_tf_checkpoints_to_keep: 4


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:Creating a OutOfGraphReplayBuffer replay memory with the following parameters:


INFO:tensorflow:	 observation_shape: (11, 1)


INFO:tensorflow:	 observation_shape: (11, 1)


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 observation_dtype: float32


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 terminal_dtype: <class 'numpy.uint8'>


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 stack_size: 1


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 replay_capacity: 1000000


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 batch_size: 32


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 update_horizon: 1


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:	 gamma: 0.990000


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:legacy_checkpoint_load: False


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Beginning evaluation...


INFO:tensorflow:Restoring parameters from ./results/baseline_fullq/train/checkpoints/tf_ckpt-9


INFO:tensorflow:Restoring parameters from ./results/baseline_fullq/train/checkpoints/tf_ckpt-9


INFO:tensorflow:eval_file: ./results/baseline_fullq/eval_5/returns_600


INFO:tensorflow:eval_file: ./results/baseline_fullq/eval_5/returns_600


<a id='result'></a>

## Viewing Results

In [20]:
%load_ext tensorboard

### Random Agent Results

In [21]:
%tensorboard --logdir=./results/baseline_random/

### Q Agent Results

In [22]:
%reload_ext tensorboard

In [24]:
%tensorboard --logdir=./results/baseline_fullq/ --port=8008