<a href="https://colab.research.google.com/github/ZiminPark/bandit-reco/blob/master/notebooks/0.%20Getting%20Started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install recogym

## In this notebook

-  a simple recommendation agent that **suggests an item in proportion to how many times it has been viewed**. 

## Reinforcement Learning Setup

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/rl-setup.png?raw=1" alt="Drawing" style="width: 200px;"/>

## Organic and Bandit

- Online advertising 위주로 다루지만, universal한 추천에 적용할 수 있는 프레임워크를 만들고 싶다.
- 도메인에 무관한 용어 **_Organic_** and **_Bandit_** sessions을 사용하자.  
    1. An **_Organic_** session is an observation of items the user interacts with.  For example, it could be views of products on an e-commerce website, listens to songs while streaming music, or readings of articles on an online newspaper.  
    2. A **_Bandit_** session is one where we have an opportunity to recommend the user an item and observe their behavior.  We receive a reward if they click.

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/organic-bandit.png?raw=1" alt="Drawing" style="width: 200px;"/>

## Offline and Online Learning


Criteo's 추천에 바로 RL을 적용할 수는 없었다. 학습하는 기간이 costly하기 때문. Instead, we need to leverage the vast amounts of offline training examples we already to make the algorithm perform as good as the current system before releasing it into the online production environment.

Thus, RecoGym follows a similar flow. An agent is first given access to many offline training examples produced from a fixed policy. Then, they have access to the online system where they choose the actions.

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/two-steps.png?raw=1" alt="Drawing" style="width: 250px;"/>

## Let's see some code - Interacting with the environment 


The code snippet below shows how to initialize the environment and step through in an 'offline' manner (Here offline means that the environment is generating some recommendations for us).  We print out the results from the environment at each step.

### World creation

In [2]:
import gym, recogym
from copy import deepcopy

In [3]:
# env_1_args is a dictionary of default parameters that defines the simulated world 
# such as user behavior, number of products, etc.
from recogym import env_1_args, Configuration


env_1_args['random_seed'] = 42
env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

env.reset()

In [49]:
env_1_args

{'K': 5,
 'change_omega_for_bandits': False,
 'normalize_beta': False,
 'num_clusters': 2,
 'num_products': 10,
 'num_users': 100,
 'number_of_flips': 0,
 'phi_var': 0.1,
 'prob_bandit_to_organic': 0.05,
 'prob_leave_bandit': 0.01,
 'prob_leave_organic': 0.01,
 'prob_organic_to_bandit': 0.25,
 'random_seed': 42,
 'sigma_mu_organic': 3,
 'sigma_omega': 0.1,
 'sigma_omega_initial': 1,
 'with_ps_all': False}

### Act on the environment
We will now choose the product to recommend, and _hope_ for a click from the user.
For our first agent we will hardcode the actions taken.

In [47]:
actions = [None] + [1, 2, 3, 4, 5]  # Create a list of hard coded actions.
env.reset()
done = False  # Set Done to False.

i = 0  # Counting how many steps.

while not done and i < len(actions):
    action = actions[i]
    observation, reward, done, info = env.step(action)
    print(f"Step: {i} - Action: {action} - Observation: {observation.sessions()} - Reward: {reward}")
    i += 1
# 왜 어떤건 Observation이 있고 어떤건 없는지 모르겠네...

Step: 0 - Action: None - Observation: [{'t': 0, 'u': 0, 'z': 'pageview', 'v': 3}, {'t': 1, 'u': 0, 'z': 'pageview', 'v': 3}, {'t': 2, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 3, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 4, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 5, 'u': 0, 'z': 'pageview', 'v': 3}, {'t': 6, 'u': 0, 'z': 'pageview', 'v': 5}] - Reward: None
Step: 1 - Action: 1 - Observation: [{'t': 8, 'u': 0, 'z': 'pageview', 'v': 4}, {'t': 9, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 10, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 11, 'u': 0, 'z': 'pageview', 'v': 9}] - Reward: 0
Step: 2 - Action: 2 - Observation: [] - Reward: 0
Step: 3 - Action: 3 - Observation: [{'t': 14, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 15, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 16, 'u': 0, 'z': 'pageview', 'v': 9}] - Reward: 0
Step: 4 - Action: 4 - Observation: [] - Reward: 0
Step: 5 - Action: 5 - Observation: [] - Reward: 0


In [48]:
env.step?

Okay, there's quite a bit going on here:  
- `Action`
   * `t` is the timestep (always incremented), it won't be useful today
   * `u` is the user id, as we have one user, for now, it's always 0
   * `a` is a number between `0` and `num_products - 1` that references the index of the product recommended.
   * `ps` is the propensity score or the probability that the agent assigned to this action
   * `ps-a` are the probabilities assigned to all actions by the agent (we can see that it's uniform for now: the agent randomly selects the recommended product)
- `observation` will either be `None` or a session of Organic data, showing the index of products the user views.
   * `t`, `u` have the same meaning as above
   * `z` in the type of event (always pageview for now)
   * `v` is the index of the viewed product
- `reward` is `0` if the user does not click on the recommended product and `1` if they do.  Notice that when a user clicks on a product (Wherever the reward is `1`), they start a new Organic session.
- `done` is a True/False flag indicating if the episode (aka user's timeline) is over.  

Also, notice that the first `action` is `None`.  In our implementation, the agent observes Organic behavior before recommending anything.

## Creating our first agent

- 객체 지향적으로 Agent를 만들어 보자.
- 아래 코드는 Organic하게 가장 많이 본 아이템을 기록해두었다가 인기도에 비례해서 샘플하여 추천한다.

In [5]:
import numpy as np
from recogym.agents import Agent

In [6]:
class BestOfAgent(Agent):
    def __init__(self, config):
        # Set number of products as an attribute of the Agent.
        Agent.__init__(self, config)
        self.organic_views = np.zeros(self.config.num_products)

    def train(self, observation, action, reward, done):
        """Train method learns from a tuple of data.
        this method can be called for offline or online learning
        """
        # Adding organic session to organic view counts.
        if observation:
            for session in observation.sessions():
                viewed_item_index = session['v']
                self.organic_views[viewed_item_index] += 1

    def act(self, observation, reward, done):
        """returns an action based on current observation and past history"""
        
        probabilities = self.organic_views / np.sum(self.organic_views)
        action = np.random.choice(self.config.num_products, p=probabilities)
        
        return {
            **super().act(observation, reward, done),
            **{
                'a': action,
                'ps': probabilities[action],
                'ps-a': probabilities,
            }
        }

- BestOfAgent class가 우리가 선호하는 Agent 생성 방식이다.
- `train` 메소드는 take in training data from the environments step_offline method and thus has nothing to return
- `act` 메소드는 return an action to pass back into the environment

아래 코드에서 이 Agent가 어떻게 처음 first offline training하고 이걸 이용해서 online으로 추천하는지 보자.

### 내가 찍어본 코드.

- step_offline이 뭐하는 애인가 싶어서. 

In [23]:
env.reset()
observation, reward, done = None, 0, False
action, observation, reward, done, info = env.step_offline(observation, reward, done)
observation.sessions()

[{'t': 0, 'u': 0, 'v': 5, 'z': 'pageview'}]

In [20]:
env.reset()
observation, reward, done = None, 0, False
action, observation, reward, done, info = env.step(observation, reward, done)  # 그냥 step은 당연하게도 action만 받는다.
observation.sessions()

TypeError: ignored

### 원래 자료 다시 시작

In [50]:
# Instantiate instance of PopularityAgent class.
num_products = 10
agent = BestOfAgent(Configuration({**env_1_args, 'num_products': num_products}))
env.reset_random_seed()

# Train on 1000 users offline.
num_offline_users = 1000

for _ in range(num_offline_users):

    env.reset()
    done = False

    observation, reward, done = None, 0, False
    while not done:
        old_observation = observation
        action, observation, reward, done, info = env.step_offline(observation, reward, done)
        agent.train(old_observation, action, reward, done)

In [51]:
agent.act(old_observation, reward, done)

{'a': 9,
 'ps': 0.2729454956227054,
 'ps-a': array([0.09145251, 0.02306316, 0.00381248, 0.06918949, 0.29233738,
        0.14143839, 0.01572061, 0.05963475, 0.03040572, 0.2729455 ]),
 't': 84,
 'u': 0}

In [25]:
agent.organic_views

array([1943.,  490.,   81., 1470., 6211., 3005.,  334., 1267.,  646.,
       5799.])

In [26]:
# Train on 100 users online and track click through rate.
num_online_users = 100
num_clicks, num_events = 0, 0

for _ in range(num_online_users):

    # Reset env and set done to False.
    env.reset()
    observation, _, done, _ = env.step(None)
    reward = None
    done = None
    while not done:
        action = agent.act(observation, reward, done)
        observation, reward, done, info = env.step(action['a'])

        # Used for calculating click through rate.
        num_clicks += 1 if reward == 1 else 0
        num_events += 1

In [28]:
num_events

8163

In [27]:
ctr = num_clicks / num_events
print(f"Click Through Rate: {ctr:.4f}")

Click Through Rate: 0.0141


## Testing our first agent

- baseline으로 아이템을 똑같은 확률로 추천하는 에이전트(`RandomAgent`)를 사용해보자.
- 이를 위해 조금 더 복잡한 toy 환경인 `reco-gym-v1`을 사용하겠다.

In [33]:
import gym, recogym
from recogym import env_1_args
from recogym.agents import RandomAgent, random_args
from copy import deepcopy

env_1_args['random_seed'] = 42
env_1 = gym.make('reco-gym-v1')
env_1.init_gym(env_1_args)

# Create the two agents.
num_products = env_1_args['num_products']

best_of_agent = BestOfAgent(Configuration(env_1_args))
random_agent = RandomAgent(Configuration({**env_1_args, **random_args,}))

- 비교를 위해 `test_agent` 메소드를 사용할 수 있다.
- `test_agent`를 사용하려면 아래의 내용을 제공해야 한다.
    1. a copy of the current env 
    2. a copy of the agent class
    3. the number of training users
    4. the number of testing users

In [35]:
recogym.test_agent?

In [36]:
# Confidence interval of the CTR median and 0.025 0.975 quantile.
random_agent_results = recogym.test_agent(
    deepcopy(env_1),
    deepcopy(random_agent),
    num_offline_users=1000,
    num_online_users=1000
)
median_random_agent, lower_bound_random_agent, upper_bound_random_agent = random_agent_results

Organic Users: 100%|██████████| 100/100 [00:00<00:00, 2255.27it/s]
Users:   2%|▏         | 16/1000 [00:00<00:14, 69.05it/s]

START: Agent Training #0
START: Agent Training @ Epoch #0


Users: 100%|██████████| 1000/1000 [00:22<00:00, 44.78it/s]
Organic Users: 0it [00:00, ?it/s]
Users:   2%|▏         | 16/1000 [00:00<00:19, 49.71it/s]

END: Agent Training @ Epoch #0 (22.391456365585327s)
START: Agent Evaluating @ Epoch #0


Users: 100%|██████████| 1000/1000 [00:22<00:00, 44.65it/s]


END: Agent Evaluating @ Epoch #0 (22.579938411712646s)


In [37]:
# Confidence interval of the CTR median and 0.025 0.975 quantile.
bestof_agent_results = recogym.test_agent(
    deepcopy(env),
    deepcopy(best_of_agent),
    num_offline_users=1000,
    num_online_users=1000
)
median_bestof_agent, lower_bound_bestof_agent, upper_bound_bestof_agent = bestof_agent_results

Organic Users: 100%|██████████| 100/100 [00:00<00:00, 1245.40it/s]
Users:   0%|          | 0/1000 [00:00<?, ?it/s]

START: Agent Training #0
START: Agent Training @ Epoch #0


Users: 100%|██████████| 1000/1000 [00:22<00:00, 44.01it/s]
Organic Users: 0it [00:00, ?it/s]
Users:   1%|          | 12/1000 [00:00<00:19, 49.50it/s]

END: Agent Training @ Epoch #0 (22.80797266960144s)
START: Agent Evaluating @ Epoch #0


Users: 100%|██████████| 1000/1000 [00:28<00:00, 35.12it/s]


END: Agent Evaluating @ Epoch #0 (28.645923614501953s)


In [38]:
print(f'Random agent CTR  = {median_random_agent:.4f} ({lower_bound_random_agent:.4f}, {upper_bound_random_agent:.4f})')
print(f'Best of agent CTR = {median_bestof_agent:.4f} ({lower_bound_bestof_agent:.4f}, {upper_bound_bestof_agent:.4f})')

Random agent CTR  = 0.0109 (0.0102, 0.0117)
Best of agent CTR = 0.0147 (0.0138, 0.0155)


We see an improvement in the click-through rate for an agent as simple as the best of agent.