<a href="https://colab.research.google.com/github/ZiminPark/recsim/blob/master/recsim/colab/RecSim_Developing_an_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Developing an Agent

마지막 퍼즐 Agent 개발로 넘어가 보자. 이번 튜토리얼에서 두 가지를 다룰 것이다.

* basics : RecSim에서 어떤 데이터가 agent에 들어가는지, 어떤 걸 return 받기 원하는지

* design: 어떤 피쳐가 RecSim에서 agents를 개발하는데 필요한지.

# Basics


<p align="center"><img width="50%" src="https://github.com/google-research/recsim/blob/master/recsim/colab/figures/recsim_architecture_agent_centered.png?raw=true" /></p>

Agent는 다음을 consume해야 한다.

* 유저 state에 대한 observations
* 추천에 대한 유저의 반응 observations
* 가능한 documents와 documents 피쳐 벡터.

return으로 agent는 $K$개의 추천을 만들고 user's choice와 transition model에 넘겨준다.

RecSim's agent API를 설명하기 위해 simple bandit agent for RecSim's *interest exploration* environment을 사용해보자.

*interest exploration*은 clustered bandit problem을 말한다. 세상은 topics로 cluster될 수 있는 documents로 구성되어 있고 유저도 cluster될 수 있다고 가정한다.

유저의 document에 대한 affinity는 document의 퀄리티 + 유저의 주제에 대한 affinity다.

이런 상황은 클릭율이 높은 documents를 높은 순위로 책정하여 근시안적인 정책을 만들어낸다. 이는 suboptimal policy도 수렴하게 되기 때문에 active exploration이 필요하다.

method를 각각 나눠보고 나중에 합치자.

In [None]:
!pip install --upgrade --no-cache-dir recsim

In [2]:
import functools
from gym import spaces
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from recsim import agent
from recsim import document
from recsim import user
from recsim.choice_model import MultinomialLogitChoiceModel
from recsim.simulator import environment
from recsim.simulator import recsim_gym
from recsim.simulator import runner_lib
from recsim.environments import interest_exploration

env는 관심사가 아니기 때문에 주어지는 걸 사용하자.

In [8]:
env_config = {'slate_size': 2,
              'seed': 0,
              'num_candidates': 15,
              'resample_documents': True}
ie_environment = interest_exploration.create_environment(env_config)

In [9]:
initial_observation = ie_environment.reset()

## Observations

RecSim의 observation은 dict이고 key가 3가지 있다:
* 'user', 위 그림에서 'User Observable Features'를 나타낸다.
* 'doc', 추천할 수 있는 Document와 그 피쳐('Document Observable Features'),
* 'response', 마지막 추천에 대한 유저의 반응('User Response'). 

이번 environment에서는 유저의 observable features를 구현하지 않았다. 그래서 이 필드는 계속 empty이다.

In [10]:
print('User Observable Features')
print(initial_observation['user'])
print('User Response')
print(initial_observation['response'])
print('Document Observable Features')
for doc_id, doc_features in initial_observation['doc'].items():
  print('ID:', doc_id, 'features:', doc_features)

User Observable Features
[]
User Response
None
Document Observable Features
ID: 15 features: {'quality': array(1.22720163), 'cluster_id': 1}
ID: 16 features: {'quality': array(1.29258489), 'cluster_id': 1}
ID: 17 features: {'quality': array(1.23977078), 'cluster_id': 1}
ID: 18 features: {'quality': array(1.46045555), 'cluster_id': 1}
ID: 19 features: {'quality': array(2.10233425), 'cluster_id': 0}
ID: 20 features: {'quality': array(1.09572905), 'cluster_id': 1}
ID: 21 features: {'quality': array(2.37256963), 'cluster_id': 0}
ID: 22 features: {'quality': array(1.34928002), 'cluster_id': 1}
ID: 23 features: {'quality': array(1.00670188), 'cluster_id': 1}
ID: 24 features: {'quality': array(1.20448562), 'cluster_id': 1}
ID: 25 features: {'quality': array(2.18351159), 'cluster_id': 0}
ID: 26 features: {'quality': array(1.19411585), 'cluster_id': 1}
ID: 27 features: {'quality': array(1.03514646), 'cluster_id': 1}
ID: 28 features: {'quality': array(2.29592623), 'cluster_id': 0}
ID: 29 feature

In [15]:
print('Document observation space')
for key, space in ie_environment.observation_space['doc'].spaces.items():
  print(key, ':', space)
print('Response observation space')
print(ie_environment.observation_space['response'])

Document observation space
15 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
16 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
17 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
18 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
19 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
20 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
21 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
22 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
23 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
24 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
25 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
26 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
27 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
28 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), flo

## Slates

In [16]:
slate = [0, 1]
for slate_doc in slate:
  print(list(initial_observation['doc'].items())[slate_doc])

('15', {'quality': array(1.22720163), 'cluster_id': 1})
('16', {'quality': array(1.29258489), 'cluster_id': 1})


In [17]:
ie_environment.action_space

MultiDiscrete([15 15])

첫 번째 slate가 주어지면 simulator은 env에서 step하고 새로운 observation과 reward를 만든다.

In [18]:
observation, reward, done, _ = ie_environment.step(slate)

The main job of the agent is to produce a valid slate for each step of the simulation. 

## A trivial agent

- 기본적인 수준에서 agent는 step-function만 구현하는 것으로도 충분하다.
- 일단 첫 K개를 추천하는 agent를 만들어보자.

In [19]:
from recsim.agent import AbstractEpisodicRecommenderAgent

- RecSim agent는 *AbstractEpisodicRecommenderAgent*를 상속 받는다. 

- observation_space 와 action_space가 init에 필요하다. 

- 이것들을 이용하여 environment가 agent 운영의 전제조건을 충족하는지 검증할 수 있다.

In [21]:
class StaticAgent(AbstractEpisodicRecommenderAgent):
  def __init__(self, observation_space, action_space):
    # Check if document corpus is large enough.
    if len(observation_space['doc'].spaces) < len(action_space.nvec):
      raise RuntimeError('Slate size larger than size of the corpus.')
    super(StaticAgent, self).__init__(action_space)

  def step(self, reward, observation):
    print(observation)
    return list(range(self._slate_size))

In [24]:
import tensorflow as tf
tf.compat.v1.disable_eager_execution()

In [25]:
def create_agent(sess, environment, eval_mode, summary_writer=None):
  return StaticAgent(environment.observation_space, environment.action_space)

tmp_base_dir = '/tmp/recsim/'

runner = runner_lib.EvalRunner(
  base_dir=tmp_base_dir,
  create_agent_fn=create_agent,
  env=ie_environment,
  max_eval_episodes=1,
  max_steps_per_episode=5,
  test_mode=True)

# We won't run this, but we totally could
# runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 1


INFO:tensorflow:max_eval_episodes = 1


INFO:tensorflow:max_steps_per_episode = 5


INFO:tensorflow:max_steps_per_episode = 5


# Design: Hierarchical Agent Layers 

- basic을 만들었으니 좀 더 어려운 것에 도전하고 싶다. 

- bandit algorithm을 돌려볼 수 있겠다. 유저의 intent가 observable하다고 가정하고 policy을 만들어보자.

- 전처리, 추천, 후처리 등 hierarchical한 추천시스템 구조를 설명하는 거 같은데 음... 어떤 맥락에서 이런 이야기가 나왔는지는 원문을 다시보자.

The way this problem is set up, a natural heuristic presents itself. We can run a bandit algorithm to reveal the average engagement of a user with each cluster of documents. That is, each cluster becomes an arm. Once the algorithm has chosen a cluster, we serve take the highest quality video from that cluster. This is a metaphor for a situation that occurs often in recommender systems that serve as a front end to multiple (sub-)products: within each session, the user will interact with the recommender with some intent in mind, that is, to realize some task that can be fulfilled by one of the possible sub-products. Sometimes, the user will issue an explicit query (e.g., enter search terms), which effectively makes that intent observable up ot query interpretation uncertainty. Most often, however, the intent will be latent -- the user will reveal it indirectly by chosing among a set of items from the slate. We assume that had the intent been observable, a product-specific policy would be available to fulfill it.

This set-up captures some typical features of practical recommender systems -- they tend to very hierarchical, often very heuristic due to the complexity of the environment they operate in, and also very idiosyncratic to the task at hand. For this reason, RecSim's approach to agent engineering is very modular. Instead of providing a wide array of agents, we provide an easily extendable set of agent building blocks, called Agent Layers, which could be combined into hierarchies to create more complex agents.




## Hierarchical agent layers
![Hierarchical agent architecture](https://github.com/google-research/recsim/blob/master/recsim/colab/figures/agent_architecture.png?raw=true)

A hierarchical agent layer does not materialize a slate of documents, but relies on one or more base agents to do so. The hierarchical agent architecture in RecSim can roughly be described follows:
* a hierarchical agent layer receives an observationand reward from the environment; it preprocesses the raw observation and passes it to one or more base agents.
* Each base agent outputs either a slate or an abstract action (depending on the use case), which is then post-processed by the layer to create/output the slate (concrete action). 

Hierarchical layers are recursively stackable in a fashion similar to Keras layers. Hierarchical  layers  are  defined  by  their  pre-  and  post-processing functions and can play many roles dependinghow these are implemented. For example, a layer can beused as a pure feature injector — it can extract some feature from the (history of) observations and pass it to the base agent, while keeping the post-processing function vacuous. This allows decoupling of feature- and agent-engineering. Various regularizers can be implemented in a similar fashion by modifying the reward. Layers may also be stateful and dynamic, as the pre- or post-processing functions may implement parameter updates or learning mechanisms. 

We will not discuss how to implement these layers here (the reader is referred to examples in the *layers/* directory), rather, we will show their usage and benefits. 




## ClusterClickStats

Recall that the *Interest Exploration* provides clicks as feedback, but does not keep track of cumulative click counts or impression counts. Since maintaining such statistics is generally useful, we provide an agent layer that does exactly that. That is, it monitors the stream of responses and retains the number of clicks and impressions from each cluster. The precondition is that the response space has a key 'click', as well as 'cluster_id'. If this is met, than the layer can be used with any environment/agent. Let's see how this works.

In [None]:
from recsim.agents.layers.cluster_click_statistics import ClusterClickStatsLayer


A hierarchical agent layer is instantiated in a smilar way to usual agents, except that it takes in a constructor for a base agent, that is, an agent whose abstract action it can interpret. In the case of cluster click stats, it will not do any post-processing of the abstract action, that is, it simply relays the action of the base agent to the environment. This implies that the base agent will need to provide a full slate. 

Once instantiated, the cluster click stats layer will inject a sufficient statistic to the base agent's observation space containing clicks and impressions. Thus, the combination of both will behave like as if the base agent had an additional field in its observation space. We showcase this using our StaticAgent.

In [None]:
static_agent = StaticAgent(ie_environment.observation_space,
                           ie_environment.action_space)
static_agent.step(reward, observation)

{'user': array([], dtype=float64), 'doc': {'30': {'quality': 2.489224450301943, 'cluster_id': 0}, '31': {'quality': 2.125926607579561, 'cluster_id': 0}, '32': {'quality': 1.27448138607991, 'cluster_id': 1}, '33': {'quality': 1.2179911236932994, 'cluster_id': 1}, '34': {'quality': 1.177703750911228, 'cluster_id': 1}, '35': {'quality': 2.079489146813576, 'cluster_id': 0}, '36': {'quality': 1.1416765236282371, 'cluster_id': 1}, '37': {'quality': 1.2052916542615082, 'cluster_id': 1}, '38': {'quality': 1.2424683972006194, 'cluster_id': 1}, '39': {'quality': 1.8727966807396805, 'cluster_id': 0}, '40': {'quality': 1.1964488835024119, 'cluster_id': 1}, '41': {'quality': 1.282540205315461, 'cluster_id': 1}, '42': {'quality': 2.015585394934561, 'cluster_id': 0}, '43': {'quality': 2.464004827721051, 'cluster_id': 0}, '44': {'quality': 1.33980633202097, 'cluster_id': 1}}, 'response': ({'click': 0, 'quality': 1.2272016322975663, 'cluster_id': 1}, {'click': 0, 'quality': 1.2925848895378007, 'cluster

[0, 1]

In [None]:
cluster_static_agent = ClusterClickStatsLayer(StaticAgent,
                                              ie_environment.observation_space,
                                              ie_environment.action_space)
cluster_static_agent.step(reward, observation)

{'user': {'raw_observation': array([], dtype=float64), 'sufficient_statistics': {'impression_count': array([0, 2]), 'click_count': array([0, 0])}}, 'doc': {'30': {'quality': 2.489224450301943, 'cluster_id': 0}, '31': {'quality': 2.125926607579561, 'cluster_id': 0}, '32': {'quality': 1.27448138607991, 'cluster_id': 1}, '33': {'quality': 1.2179911236932994, 'cluster_id': 1}, '34': {'quality': 1.177703750911228, 'cluster_id': 1}, '35': {'quality': 2.079489146813576, 'cluster_id': 0}, '36': {'quality': 1.1416765236282371, 'cluster_id': 1}, '37': {'quality': 1.2052916542615082, 'cluster_id': 1}, '38': {'quality': 1.2424683972006194, 'cluster_id': 1}, '39': {'quality': 1.8727966807396805, 'cluster_id': 0}, '40': {'quality': 1.1964488835024119, 'cluster_id': 1}, '41': {'quality': 1.282540205315461, 'cluster_id': 1}, '42': {'quality': 2.015585394934561, 'cluster_id': 0}, '43': {'quality': 2.464004827721051, 'cluster_id': 0}, '44': {'quality': 1.33980633202097, 'cluster_id': 1}}, 'response': ({

[0, 1]

Observe how the 'user' field of the observation dictionary (as printed from within the static agent's step function) now has a new key 'sufficient_statistics', whereas the old user observation (which is vacuous) went under the 'raw_observation' key. This is done to avoid naming conflicts.

## AbstractClickBandit

The ClusterClickStats layer takes care of computing the necessary sufficient statistics for exploration. To implement the actual bandit policy, RecSim offers an abstract bandit layer implementation. The *AbstractClickBandit* takes as input a list of base agents, which it treats as arms. It will then utilize one of a a few implemented bandit policies (UCB1, KL-UCB, ThompsonSampling) to mix the policies in a way that achieves sub-linear regret relative to the best policy (which is apriori unknown), subject to certain assumptions about the environment.

In [None]:
from recsim.agents.layers.abstract_click_bandit import AbstractClickBanditLayer

To instantiate an abstract bandit, we must present a list of base agents. In our case, we will have one base agent for each cluster. That agent simply retrieves the documents of that cluster from the corpus and sorts them according to perceived quality.

In [None]:
class GreedyClusterAgent(agent.AbstractEpisodicRecommenderAgent):
  """Simple agent sorting all documents of a topic according to quality."""

  def __init__(self, observation_space, action_space, cluster_id, **kwargs):
    del observation_space
    super(GreedyClusterAgent, self).__init__(action_space)
    self._cluster_id = cluster_id

  def step(self, reward, observation):
    del reward
    my_docs = []
    my_doc_quality = []
    for i, doc in enumerate(observation['doc'].values()):
      if doc['cluster_id'] == self._cluster_id:
        my_docs.append(i)
        my_doc_quality.append(doc['quality'])
    if not bool(my_docs):
      return []
    sorted_indices = np.argsort(my_doc_quality)[::-1]
    return list(np.array(my_docs)[sorted_indices])


We will now instantiate one GreedyClusterAgent for each cluster.

In [None]:
  num_topics = list(ie_environment.observation_space.spaces['doc']
                    .spaces.values())[0].spaces['cluster_id'].n
  base_agent_ctors = [
      functools.partial(GreedyClusterAgent, cluster_id=i)
      for i in range(num_topics)
  ]

We can now instantiate our cluster bandit as a combination of ClusterClickStats, AbstractClickBandit, and GreedyClusterAgent:

In [None]:
bandit_ctor = functools.partial(AbstractClickBanditLayer,
                                arm_base_agent_ctors=base_agent_ctors)
cluster_bandit = ClusterClickStatsLayer(bandit_ctor,
                                        ie_environment.observation_space,
                                        ie_environment.action_space)

Our ClusterBandit is ready to use!

In [None]:
observation0 = ie_environment.reset()
slate = cluster_bandit.begin_episode(observation0)
print("Cluster bandit slate 0:")
doc_list = list(observation0['doc'].values())
for doc_position in slate:
  print(doc_list[doc_position])

Cluster bandit slate 0:
{'quality': 1.4686875120276195, 'cluster_id': 1}
{'quality': 1.4226918183479484, 'cluster_id': 1}
