<a href="https://colab.research.google.com/github/ZiminPark/bandit-reco/blob/master/notebooks/0.%20Getting%20Started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install recogym

## In this notebook

-  a simple recommendation agent that **suggests an item in proportion to how many times it has been viewed**. 

## Reinforcement Learning Setup

RecoGym follows the usual reinforcement learning setup.  It means there are interactions between the environment (the user's behavior) and the agent (our recommendation algorithm).  The agent receives a reward if the user clicks on the recommendation.

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/rl-setup.png?raw=1" alt="Drawing" style="width: 600px;"/>

## Organic and Bandit

Even though our focus is biased towards online advertising, we tried to make RecoGym universal to all types of recommendation.  Hence, we introduce the domain-agnostic terms _Organic_ and _Bandit_ sessions.  An _Organic_ session is an observation of items the user interacts with.  For example, it could be views of products on an e-commerce website, listens to songs while streaming music, or readings of articles on an online newspaper.  A _Bandit_ session is one where we have an opportunity to recommend the user an item and observe their behavior.  We receive a reward if they click.

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/organic-bandit.png?raw=1" alt="Drawing" style="width: 450px;"/>

## Offline and Online Learning

This project was born out of a desire to improve Criteo's recommendation system by exploring reinforcement learning algorithms. We quickly realized that we couldn't just blindly apply RL algorithms in a production system out of the box. The learning period would be too costly. Instead, we need to leverage the vast amounts of offline training examples we already to make the algorithm perform as good as the current system before releasing it into the online production environment.

Thus, RecoGym follows a similar flow. An agent is first given access to many offline training examples produced from a fixed policy. Then, they have access to the online system where they choose the actions.

<img src="https://github.com/ZiminPark/bandit-reco/blob/master/notebooks/images/two-steps.png?raw=1" alt="Drawing" style="width: 450px;"/>

## Let's see some code - Interacting with the environment 


The code snippet below shows how to initialize the environment and step through in an 'offline' manner (Here offline means that the environment is generating some recommendations for us).  We print out the results from the environment at each step.

### World creation

In [None]:
import gym, recogym
from copy import deepcopy

In [None]:
# env_1_args is a dictionary of default parameters that defines the simulated world (such as user behavior, number of products, etc.).
from recogym import env_1_args, Configuration

# You can overwrite environment arguments here:
env_1_args['random_seed'] = 42

# Initialize the gym for the first time by calling .make() and .init_gym()
env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

# .reset() env before each episode (one episode per user).
env.reset()

### Act on the environment
We will now choose the product to recommend, and _hope_ for a click from the user.
For our first agent we will hardcode the actions taken.

In [None]:
# Create a list of hard coded actions.
actions = [None] + [1, 2, 3, 4, 5]

# Reset the environment and set Done to False.
env.reset()
done = False

# Counting how many steps.
i = 0

while not done and i < len(actions):
    action = actions[i]
    observation, reward, done, info = env.step(action)
    print(f"Step: {i} - Action: {action} - Observation: {observation.sessions()} - Reward: {reward}")
    i += 1

Step: 0 - Action: None - Observation: [{'t': 0, 'u': 0, 'z': 'pageview', 'v': 1}, {'t': 1, 'u': 0, 'z': 'pageview', 'v': 4}, {'t': 2, 'u': 0, 'z': 'pageview', 'v': 4}, {'t': 3, 'u': 0, 'z': 'pageview', 'v': 1}, {'t': 4, 'u': 0, 'z': 'pageview', 'v': 4}, {'t': 5, 'u': 0, 'z': 'pageview', 'v': 9}, {'t': 6, 'u': 0, 'z': 'pageview', 'v': 1}] - Reward: None
Step: 1 - Action: 1 - Observation: [] - Reward: 0
Step: 2 - Action: 2 - Observation: [] - Reward: 0
Step: 3 - Action: 3 - Observation: [] - Reward: 0
Step: 4 - Action: 4 - Observation: [] - Reward: 0
Step: 5 - Action: 5 - Observation: [] - Reward: 0


Okay, there's quite a bit going on here:  
- `Action`
   * `t` is the timestep (always incremented), it won't be useful today
   * `u` is the user id, as we have one user, for now, it's always 0
   * `a` is a number between `0` and `num_products - 1` that references the index of the product recommended.
   * `ps` is the propensity score or the probability that the agent assigned to this action
   * `ps-a` are the probabilities assigned to all actions by the agent (we can see that it's uniform for now: the agent randomly selects the recommended product)
- `observation` will either be `None` or a session of Organic data, showing the index of products the user views.
   * `t`, `u` have the same meaning as above
   * `z` in the type of event (always pageview for now)
   * `v` is the index of the viewed product
- `reward` is `0` if the user does not click on the recommended product and `1` if they do.  Notice that when a user clicks on a product (Wherever the reward is `1`), they start a new Organic session.
- `done` is a True/False flag indicating if the episode (aka user's timeline) is over.  

Also, notice that the first `action` is `None`.  In our implementation, the agent observes Organic behavior before recommending anything.

## Creating our first agent

Now that we have seen how the offline and online versions of the environment work, it is time to code our first recommendation agent!  Technically, an agent can be anything that produces actions for the environment to use.  However, we will show you the object-oriented way we like to create agents.

Below is the code for a very simple agent - the _best of_ agent. The _best of_ agent records merely how many times each product has been seen organically, then when required to make a recommendation, the agent chooses a product randomly in proportion with a number of times it has been viewed overall.

In [None]:
import numpy as np
from recogym.agents import Agent

# Define an Agent class.
class BestOfAgent(Agent):
    def __init__(self, config):
        # Set number of products as an attribute of the Agent.
        Agent.__init__(self, config)

        # Track number of times each item viewed in the organic sessions.
        self.organic_views = np.zeros(self.config.num_products)

    def train(self, observation, action, reward, done):
        """Train method learns from a tuple of data.
        this method can be called for offline or online learning
        """
        # Adding organic session to organic view counts.
        if observation:
            for session in observation.sessions():
                viewed_item_index = session['v']
                self.organic_views[viewed_item_index] += 1

    def act(self, observation, reward, done):
        """Act method returns an action based on current observation and past
        history
        """
        # Choosing action randomly in proportion with number of views.
        probabilities = self.organic_views / np.sum(self.organic_views)
        action = np.random.choice(self.config.num_products, p=probabilities)
        
        return {
            **super().act(observation, reward, done),
            **{
                'a': action,
                'ps': probabilities[action],
                'ps-a': probabilities,
            }
        }

The `BestOfAgent` class above demonstrates our preferred way to create agents for RecoGym. Notice how we have both a `train` and `act` method present. The `train` method is designed to take in training data from the environments `step_offline` method and thus has nothing to return, while the `act` method must return an action to pass back into the environment. 

The code below highlights how one would use this agent for first offline training and then using the learned knowledge to make recommendations online.

In [None]:
# Instantiate instance of PopularityAgent class.
num_products = 10
agent = BestOfAgent(Configuration({
    **env_1_args,
    'num_products': num_products,
}))

# Resets random seed back to 42, or whatever we set it to in env_0_args.
env.reset_random_seed()

# Train on 1000 users offline.
num_offline_users = 1000

for _ in range(num_offline_users):

    # Reset env and set done to False.
    env.reset()
    done = False

    observation, reward, done = None, 0, False
    while not done:
        old_observation = observation
        action, observation, reward, done, info = env.step_offline(observation, reward, done)
        agent.train(old_observation, action, reward, done)

# Train on 100 users online and track click through rate.
num_online_users = 100
num_clicks, num_events = 0, 0

for _ in range(num_online_users):

    # Reset env and set done to False.
    env.reset()
    observation, _, done, _ = env.step(None)
    reward = None
    done = None
    while not done:
        action = agent.act(observation, reward, done)
        observation, reward, done, info = env.step(action['a'])

        # Used for calculating click through rate.
        num_clicks += 1 if reward == 1 else 0
        num_events += 1

ctr = num_clicks / num_events

print(f"Click Through Rate: {ctr:.4f}")

Click Through Rate: 0.0142


## Testing our first agent

Now we have created our popularity based agent, and we should test it against an even simpler baseline - one that performs no learning and recommends products uniformly at random. To do this, we will first load a more complex version of the toy data environment called `reco-gym-v1`.

Next, we will load another agent for our agent to compete against each other. Here you can see we make use of the `RandomAgent` and create an instance of it in addition to our `BestOfAgent`.

In [None]:
import gym, recogym
from recogym import env_1_args
from recogym.agents import RandomAgent, random_args

from copy import deepcopy

env_1_args['random_seed'] = 42

env_1 = gym.make('reco-gym-v1')
env_1.init_gym(env_1_args)

# Create the two agents.
num_products = env_1_args['num_products']

best_of_agent = BestOfAgent(Configuration(env_1_args))
random_agent = RandomAgent(Configuration({
    **env_1_args,
    **random_args,
}))

Now we have instances of our two agents. We can use the `test_agent` method from RecoGym and compare their performance.

To use `test_agent`, one must provide a copy of the current env, a copy of the agent class, the number of training users, and the number of testing users. 

In [None]:
# Confidence interval of the CTR median and 0.025 0.975 quantile.
random_agent_results = recogym.test_agent(
    deepcopy(env_1),
    deepcopy(random_agent),
    num_offline_users=1000,
    num_online_users=1000
)
median_random_agent, lower_bound_random_agent, upper_bound_random_agent = random_agent_results

Start: Agent Training #0
Start: Agent Testing #0


In [None]:
# Confidence interval of the CTR median and 0.025 0.975 quantile.
bestof_agent_results = recogym.test_agent(
    deepcopy(env),
    deepcopy(best_of_agent),
    num_offline_users=1000,
    num_online_users=1000
)
median_bestof_agent, lower_bound_bestof_agent, upper_bound_bestof_agent = bestof_agent_results

In [None]:
print(f'Random agent CTR  = {median_random_agent:.4f} ({lower_bound_random_agent:.4f}, {upper_bound_random_agent:.4f})')
print(f'Best of agent CTR = {median_bestof_agent:.4f} ({lower_bound_bestof_agent:.4f}, {upper_bound_bestof_agent:.4f})')

We see an improvement in the click-through rate for an agent as simple as the best of agent.