<a href="https://colab.research.google.com/github/ZiminPark/recsim/blob/master/recsim/colab/RecSim_Developing_an_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Developing an Agent

마지막 퍼즐 Agent 개발로 넘어가 보자. 이번 튜토리얼에서 두 가지를 다룰 것이다.

* basics : RecSim에서 어떤 데이터가 agent에 들어가는지, 어떤 걸 return 받기 원하는지

* design: 어떤 피쳐가 RecSim에서 agents를 개발하는데 필요한지.

# Basics


<p align="center"><img width="50%" src="https://github.com/google-research/recsim/blob/master/recsim/colab/figures/recsim_architecture_agent_centered.png?raw=true" /></p>

Agent는 다음을 consume해야 한다.

* 유저 state에 대한 observations
* 추천에 대한 유저의 반응 observations
* 가능한 documents와 documents 피쳐 벡터.

return으로 agent는 $K$개의 추천을 만들고 user's choice와 transition model에 넘겨준다.

RecSim's agent API를 설명하기 위해 simple bandit agent for RecSim's *interest exploration* environment을 사용해보자.

*interest exploration*은 clustered bandit problem을 말한다. 세상은 topics로 cluster될 수 있는 documents로 구성되어 있고 유저도 cluster될 수 있다고 가정한다.

유저의 document에 대한 affinity는 document의 퀄리티 + 유저의 주제에 대한 affinity다.

이런 상황은 클릭율이 높은 documents를 높은 순위로 책정하여 근시안적인 정책을 만들어낸다. 이는 suboptimal policy도 수렴하게 되기 때문에 active exploration이 필요하다.

method를 각각 나눠보고 나중에 합치자.

In [2]:
!pip install --upgrade --no-cache-dir recsim

Collecting recsim
[?25l  Downloading https://files.pythonhosted.org/packages/bb/5a/bbd19e986fd3448de90a2808010ddec29d048cff21cd940401c14c8666d6/recsim-0.2.4.tar.gz (65kB)
[K     |████████████████████████████████| 71kB 2.6MB/s 
Collecting dopamine-rl>=2.0.5
[?25l  Downloading https://files.pythonhosted.org/packages/f7/a8/443668d6c1a23b6e1713794674296854349c710079d8d4abaedf5623f8cb/dopamine_rl-3.1.8-py3-none-any.whl (117kB)
[K     |████████████████████████████████| 122kB 7.6MB/s 
Collecting flax>=0.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/2e/ef/1a0c6a869396e4b20a83375fe06d725b5c1711746578ee8dd6969472af41/flax-0.2.2-py3-none-any.whl (148kB)
[K     |████████████████████████████████| 153kB 42.6MB/s 
[?25hCollecting tf-slim>=1.0
[?25l  Downloading https://files.pythonhosted.org/packages/02/97/b0f4a64df018ca018cc035d44f2ef08f91e2e8aa67271f6f19633a015ff7/tf_slim-1.1.0-py2.py3-none-any.whl (352kB)
[K     |████████████████████████████████| 358kB 48.1MB/s 
Collectin

In [3]:
import functools
from gym import spaces
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from recsim import agent
from recsim import document
from recsim import user
from recsim.choice_model import MultinomialLogitChoiceModel
from recsim.simulator import environment
from recsim.simulator import recsim_gym
from recsim.simulator import runner_lib
from recsim.environments import interest_exploration

env는 관심사가 아니기 때문에 주어지는 걸 사용하자.

In [4]:
env_config = {'slate_size': 2,
              'seed': 0,
              'num_candidates': 15,
              'resample_documents': True}
ie_environment = interest_exploration.create_environment(env_config)

In [5]:
initial_observation = ie_environment.reset()

## Observations

RecSim의 observation은 dict이고 key가 3가지 있다:
* 'user', 위 그림에서 'User Observable Features'를 나타낸다.
* 'doc', 추천할 수 있는 Document와 그 피쳐('Document Observable Features'),
* 'response', 마지막 추천에 대한 유저의 반응('User Response'). 

이번 environment에서는 유저의 observable features를 구현하지 않았다. 그래서 이 필드는 계속 empty이다.

In [6]:
print('User Observable Features')
print(initial_observation['user'])
print('User Response')
print(initial_observation['response'])
print('Document Observable Features')
for doc_id, doc_features in initial_observation['doc'].items():
  print('ID:', doc_id, 'features:', doc_features)

User Observable Features
[]
User Response
None
Document Observable Features
ID: 15 features: {'quality': array(1.22720163), 'cluster_id': 1}
ID: 16 features: {'quality': array(1.29258489), 'cluster_id': 1}
ID: 17 features: {'quality': array(1.23977078), 'cluster_id': 1}
ID: 18 features: {'quality': array(1.46045555), 'cluster_id': 1}
ID: 19 features: {'quality': array(2.10233425), 'cluster_id': 0}
ID: 20 features: {'quality': array(1.09572905), 'cluster_id': 1}
ID: 21 features: {'quality': array(2.37256963), 'cluster_id': 0}
ID: 22 features: {'quality': array(1.34928002), 'cluster_id': 1}
ID: 23 features: {'quality': array(1.00670188), 'cluster_id': 1}
ID: 24 features: {'quality': array(1.20448562), 'cluster_id': 1}
ID: 25 features: {'quality': array(2.18351159), 'cluster_id': 0}
ID: 26 features: {'quality': array(1.19411585), 'cluster_id': 1}
ID: 27 features: {'quality': array(1.03514646), 'cluster_id': 1}
ID: 28 features: {'quality': array(2.29592623), 'cluster_id': 0}
ID: 29 feature

In [7]:
print('Document observation space')
for key, space in ie_environment.observation_space['doc'].spaces.items():
  print(key, ':', space)
print('Response observation space')
print(ie_environment.observation_space['response'])

Document observation space
15 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
16 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
17 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
18 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
19 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
20 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
21 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
22 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
23 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
24 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
25 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
26 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
27 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), float32))
28 : Dict(cluster_id:Discrete(2), quality:Box(0.0, inf, (), flo

## Slates

In [8]:
slate = [0, 1]
for slate_doc in slate:
  print(list(initial_observation['doc'].items())[slate_doc])

('15', {'quality': array(1.22720163), 'cluster_id': 1})
('16', {'quality': array(1.29258489), 'cluster_id': 1})


In [9]:
ie_environment.action_space

MultiDiscrete([15 15])

첫 번째 slate가 주어지면 simulator은 env에서 step하고 새로운 observation과 reward를 만든다.

In [10]:
observation, reward, done, _ = ie_environment.step(slate)

The main job of the agent is to produce a valid slate for each step of the simulation. 

## A trivial agent

- 기본적인 수준에서 agent는 step-function만 구현하는 것으로도 충분하다.
- 일단 첫 K개를 추천하는 agent를 만들어보자.

In [11]:
from recsim.agent import AbstractEpisodicRecommenderAgent

- RecSim agent는 *AbstractEpisodicRecommenderAgent*를 상속 받는다. 

- observation_space 와 action_space가 init에 필요하다. 

- 이것들을 이용하여 environment가 agent 운영의 전제조건을 충족하는지 검증할 수 있다.

In [12]:
class StaticAgent(AbstractEpisodicRecommenderAgent):
  def __init__(self, observation_space, action_space):
    # Check if document corpus is large enough.
    if len(observation_space['doc'].spaces) < len(action_space.nvec):
      raise RuntimeError('Slate size larger than size of the corpus.')
    super(StaticAgent, self).__init__(action_space)

  def step(self, reward, observation):
    print(observation)
    return list(range(self._slate_size))

In [13]:
import tensorflow as tf
tf.compat.v1.disable_eager_execution()

In [14]:
def create_agent(sess, environment, eval_mode, summary_writer=None):
  return StaticAgent(environment.observation_space, environment.action_space)

tmp_base_dir = '/tmp/recsim/'

runner = runner_lib.EvalRunner(
  base_dir=tmp_base_dir,
  create_agent_fn=create_agent,
  env=ie_environment,
  max_eval_episodes=1,
  max_steps_per_episode=5,
  test_mode=True)

# We won't run this, but we totally could
# runner.run_experiment()

INFO:tensorflow:max_eval_episodes = 1


INFO:tensorflow:max_eval_episodes = 1


INFO:tensorflow:max_steps_per_episode = 5


INFO:tensorflow:max_steps_per_episode = 5


# Design: Hierarchical Agent Layers 

- basic을 만들었으니 좀 더 어려운 것에 도전하고 싶다. 

- bandit algorithm을 돌려볼 수 있겠다. 유저의 intent가 observable하다고 가정하고 policy을 만들어보자.

- 전처리, 추천, 후처리 등 hierarchical한 추천시스템 구조를 설명하는 거 같은데 음... 어떤 맥락에서 이런 이야기가 나왔는지는 원문을 다시보자.

The way this problem is set up, a natural heuristic presents itself. We can run a bandit algorithm to reveal the average engagement of a user with each cluster of documents. That is, each cluster becomes an arm. Once the algorithm has chosen a cluster, we serve take the highest quality video from that cluster. This is a metaphor for a situation that occurs often in recommender systems that serve as a front end to multiple (sub-)products: within each session, the user will interact with the recommender with some intent in mind, that is, to realize some task that can be fulfilled by one of the possible sub-products. Sometimes, the user will issue an explicit query (e.g., enter search terms), which effectively makes that intent observable up ot query interpretation uncertainty. Most often, however, the intent will be latent -- the user will reveal it indirectly by chosing among a set of items from the slate. We assume that had the intent been observable, a product-specific policy would be available to fulfill it.

This set-up captures some typical features of practical recommender systems -- they tend to very hierarchical, often very heuristic due to the complexity of the environment they operate in, and also very idiosyncratic to the task at hand. For this reason, RecSim's approach to agent engineering is very modular. Instead of providing a wide array of agents, we provide an easily extendable set of agent building blocks, called Agent Layers, which could be combined into hierarchies to create more complex agents.




## Hierarchical agent layers

<p align="center"><img width="50%" src="https://github.com/google-research/recsim/blob/master/recsim/colab/figures/agent_architecture.png?raw=true" /></p>

hierarchical agent 구조는 대략 다음과 같다:
* observation와 reward를 환경으로부터 전처리해서 받고 하나이상의 base agents에 넘겨준다.

* 각각의 base agent는 slate나 abstract action을 아웃풋으로 낸 다음, 각각의 레이어에서 후처리하여 구체적인 slate(concrete action)을 만든다.

* 각각의 layer는 keras처럼 연결되어서 수정하기 용이하다.

요기서는 이런 레이어들을 구체적으로 어떻게 구현할지 설명하지 않을 것이다.(*layers/* directory) 보시라. 대신 용법이나 장점을 살펴보자.



## ClusterClickStats

*Interest Exploration*은 clicks을 feedback으로하지만 누적 클릭 수나 impression 횟수는 제공하지 않는다. 그런 통계값을 유지하는 건 유용하기 때문에 agent layer에게 제공할 것이다.(? 이 의미 맞는지 모르겠음)
이를 위한 선제조건은 response space가 click과 cluster_id를 key로 갖어야한다. 이런 조건이 맞으면 layer가 어떤 환경/ agent에서도 쓰일 수 있다. 어떻게 쓰이는지 보자.

In [15]:
from recsim.agents.layers.cluster_click_statistics import ClusterClickStatsLayer

hierarchical agent는 보통 agent와 비슷하게 instantiate된다. cluster click stats의 경우 abstract action을 아무 후처리하지 않는다.

cluster click stats에 충분한 통계치를 base agent의 observation space에 clicks과 impressions 값을 inject한다.
따라서 둘의 조합은 base agent가 observation space의 추가적인 field처럼 작동한다.

In [16]:
static_agent = StaticAgent(ie_environment.observation_space,
                           ie_environment.action_space)
static_agent.step(reward, observation)

{'user': array([], dtype=float64), 'doc': OrderedDict([('30', {'quality': array(2.48922445), 'cluster_id': 0}), ('31', {'quality': array(2.12592661), 'cluster_id': 0}), ('32', {'quality': array(1.27448139), 'cluster_id': 1}), ('33', {'quality': array(1.21799112), 'cluster_id': 1}), ('34', {'quality': array(1.17770375), 'cluster_id': 1}), ('35', {'quality': array(2.07948915), 'cluster_id': 0}), ('36', {'quality': array(1.14167652), 'cluster_id': 1}), ('37', {'quality': array(1.20529165), 'cluster_id': 1}), ('38', {'quality': array(1.2424684), 'cluster_id': 1}), ('39', {'quality': array(1.87279668), 'cluster_id': 0}), ('40', {'quality': array(1.19644888), 'cluster_id': 1}), ('41', {'quality': array(1.28254021), 'cluster_id': 1}), ('42', {'quality': array(2.01558539), 'cluster_id': 0}), ('43', {'quality': array(2.46400483), 'cluster_id': 0}), ('44', {'quality': array(1.33980633), 'cluster_id': 1})]), 'response': ({'click': 0, 'quality': array(1.22720163), 'cluster_id': 1}, {'click': 0, 'q

[0, 1]

In [17]:
cluster_static_agent = ClusterClickStatsLayer(StaticAgent,
                                              ie_environment.observation_space,
                                              ie_environment.action_space)
cluster_static_agent.step(reward, observation)

{'user': {'raw_observation': array([], dtype=float64), 'sufficient_statistics': {'impression_count': array([0, 2]), 'click_count': array([0, 0])}}, 'doc': OrderedDict([('30', {'quality': array(2.48922445), 'cluster_id': 0}), ('31', {'quality': array(2.12592661), 'cluster_id': 0}), ('32', {'quality': array(1.27448139), 'cluster_id': 1}), ('33', {'quality': array(1.21799112), 'cluster_id': 1}), ('34', {'quality': array(1.17770375), 'cluster_id': 1}), ('35', {'quality': array(2.07948915), 'cluster_id': 0}), ('36', {'quality': array(1.14167652), 'cluster_id': 1}), ('37', {'quality': array(1.20529165), 'cluster_id': 1}), ('38', {'quality': array(1.2424684), 'cluster_id': 1}), ('39', {'quality': array(1.87279668), 'cluster_id': 0}), ('40', {'quality': array(1.19644888), 'cluster_id': 1}), ('41', {'quality': array(1.28254021), 'cluster_id': 1}), ('42', {'quality': array(2.01558539), 'cluster_id': 0}), ('43', {'quality': array(2.46400483), 'cluster_id': 0}), ('44', {'quality': array(1.33980633

[0, 1]

'user' field가 변한 것을 볼 수 있다. `sufficient_statistics`가 새로운 key로 생겼고 old user observation은 `raw_observation`으로 갔다. 네이밍 충돌을 막기위해 이렇게 했다.

## AbstractClickBandit

`ClusterClickStats`은 exploration을 위한 statistics을 갖고 있음을 보았다. 실제 `bandit policy`를 구현하기 위해 RecSim은 `AbstractClickBandit`을 제공한다. 


`AbstractClickBandit`는 base agents의 list를 받고 각각을 arms로 취급한다. 그리고 몇몇 bandit policies (UCB1, KL-UCB, ThompsonSampling)을 구현하여 best policy에 대한 sub-linear regret을 달성한다. 

In [18]:
from recsim.agents.layers.abstract_click_bandit import AbstractClickBanditLayer

`abstract bandit`을 만드려면 list of base agents를 줘야한다. 우리의 경우, 클러스터마다 한 개의 base agent를 갖는다. 각 클러스터에서 retrieves하여 perceived quality에 따라 정렬한다.

In [19]:
class GreedyClusterAgent(agent.AbstractEpisodicRecommenderAgent):
  """Simple agent sorting all documents of a topic according to quality."""

  def __init__(self, observation_space, action_space, cluster_id, **kwargs):
    del observation_space
    super(GreedyClusterAgent, self).__init__(action_space)
    self._cluster_id = cluster_id

  def step(self, reward, observation):
    del reward
    my_docs = []
    my_doc_quality = []
    for i, doc in enumerate(observation['doc'].values()):
      if doc['cluster_id'] == self._cluster_id:
        my_docs.append(i)
        my_doc_quality.append(doc['quality'])
    if not bool(my_docs):
      return []
    sorted_indices = np.argsort(my_doc_quality)[::-1]
    return list(np.array(my_docs)[sorted_indices])


We will now instantiate one GreedyClusterAgent for each cluster.

In [20]:
  num_topics = list(ie_environment.observation_space.spaces['doc']
                    .spaces.values())[0].spaces['cluster_id'].n
  base_agent_ctors = [
      functools.partial(GreedyClusterAgent, cluster_id=i)
      for i in range(num_topics)
  ]

We can now instantiate our cluster bandit as a combination of ClusterClickStats, AbstractClickBandit, and GreedyClusterAgent:

In [22]:
bandit_ctor = functools.partial(AbstractClickBanditLayer,
                                arm_base_agent_ctors=base_agent_ctors)
cluster_bandit = ClusterClickStatsLayer(bandit_ctor,
                                        ie_environment.observation_space,
                                        ie_environment.action_space)

Our ClusterBandit is ready to use!

In [23]:
observation0 = ie_environment.reset()
slate = cluster_bandit.begin_episode(observation0)
print("Cluster bandit slate 0:")
doc_list = list(observation0['doc'].values())
for doc_position in slate:
  print(doc_list[doc_position])

Cluster bandit slate 0:
{'quality': array(2.36424144), 'cluster_id': 0}
{'quality': array(2.30721859), 'cluster_id': 0}
