## What is RecoGym?

RecoGym is a Python [OpenAI Gym](https://gym.openai.com/) environment for testing recommendation algorithms.  It allows for the testing of both offline and reinforcement-learning based agents.  It provides a way to quickly test algorithms in a toy environment.

In this notebook we will code a simple recommendation agent that suggests an item in proportion to how many times it's been viewed.  We hope to inspire you to create your own agents and test them against our baseline models.

In order to make the most out of RecoGym, we suggest you have some experience coding in Python, some background knowledge in recommender systems, and familiarity with the reinforcement learning setup.  Also, be sure to check out the python-based requirements in the README if something below errors.

## Reinforcement Learning Setup

RecoGym follows the usual reinforcement learning setup.  This means there are interactions between the environment (the user's behavior) and the agent (our recommendation algorithm).  The agent receives reward if the user clicks on the recommendation.

<img src="images/rl-setup.png" alt="Drawing" style="width: 600px;"/>

## Organic and Bandit

Even though our focus is biased towards online advertising, we tried to make RecoGym universal to all types of recommendation.  Hence, we introduce the domain-agnostic terms Organic and Bandit sessions.  An Organic session is an observation of items the user interacts with.  For example, it could be views of products on an e-commerce website, listens to songs while streaming music, or readings of articles on an online newspaper.  A Bandit session is one where we have an opportunity to recommend the user an item and observe their behavior.  We receive a reward if they click.

<img src="images/organic-bandit.png" alt="Drawing" style="width: 450px;"/>

## Offline and Online Learning

This project was born out of a desire to improve Criteo's recommendation system by exploring reinforcement learning algorithms. We quickly realized that we can't just blindly apply RL algorithms in a production system out of the box. The learning period would be too costly. Instead, we need to leverage the vast amounts of offline training examples we already to make the algorithm perform as good as the current system before releasing into the online production environment.

Thus, RecoGym follows a similar flow. An agent is first given access to many offline training examples produced from a fixed policy. Then, they have access to the online system where they choose the actions.

<img src="images/two-steps.png" alt="Drawing" style="width: 450px;"/>

## Let's see some code - Interacting with the environment 


The code snippet below shows how to initialize the environment and step through in an 'offline' manner (Here offline means that the environment is generating some recommendations for us).  We print out the results from the environment at each step.

In [1]:
import gym, reco_gym

# env_0_args is a dictionary of default parameters (i.e. number of products)
from reco_gym import env_1_args

# you can overwrite environment arguments here:
env_1_args['random_seed'] = 42

# initialize the gym for the first time by calling .make() and .init_gym()
env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

# .reset() env before each episode (one episode per user)
env.reset()
done = False

# counting how many steps
i = 0 

while not done:
    action, observation, reward, done, info = env.step_offline()
    print(f"Step: {i} - Action: {action} - Observation: {observation} - Reward: {reward}")
    i += 1

Step: 0 - Action: None - Observation: [('pageview', 4)] - Reward: None
Step: 1 - Action: 9 - Observation: None - Reward: 0
Step: 2 - Action: 3 - Observation: None - Reward: 0
Step: 3 - Action: 6 - Observation: [('pageview', 5), ('pageview', 4), ('pageview', 5)] - Reward: 0
Step: 4 - Action: 0 - Observation: None - Reward: 0
Step: 5 - Action: 4 - Observation: [('pageview', 5)] - Reward: 0
Step: 6 - Action: 2 - Observation: None - Reward: 0
Step: 7 - Action: 3 - Observation: None - Reward: 0
Step: 8 - Action: 6 - Observation: None - Reward: 0
Step: 9 - Action: 9 - Observation: None - Reward: 0
Step: 10 - Action: 5 - Observation: None - Reward: 0
Step: 11 - Action: 8 - Observation: None - Reward: 0
Step: 12 - Action: 4 - Observation: None - Reward: 0
Step: 13 - Action: 5 - Observation: [('pageview', 7)] - Reward: 0
Step: 14 - Action: 2 - Observation: None - Reward: 0
Step: 15 - Action: 6 - Observation: None - Reward: 0
Step: 16 - Action: 0 - Observation: None - Reward: 0
Step: 17 - Action

Okay, there's quite a bit going on here:  
- `action`, is a number between `0` and `num_products - 1` that references the index of the product recommended.   
- `observation` will either be `None` or a session of Organic data, showing the index of products the user views. 
- `reward` is 0 if the user does not click on the recommended product and 1 if the do.  Notice that when a user clicks on a product (Wherever the reward is 1), they start a new Organic session.
- `done` is a True/False flag indicating if the episode (aka user's timeline) is over.  
- `info` currently not used so it's always an empty dictionary.

Also, notice that the first `action` is `None`.  In our implementation, the agent observes Organic behavior before recommending anything.

Now, we'll show calling the environment in an online manner, where the agent needs to supply an action. For demonstration purposes, we will create a list of hard-coded actions. 

In [2]:
# create list of hard coded actions
actions = [None] + [1, 2, 3, 4, 5]

# reset env and set done to False
env.reset()
done = False

# counting how many steps
i = 0 

while not done and i < len(actions):
    action = actions[i]
    observation, reward, done, info = env.step(action)
    print(f"Step: {i} - Action: {action} - Observation: {observation} - Reward: {reward}")
    i += 1

Step: 0 - Action: None - Observation: [('pageview', 4), ('pageview', 4), ('pageview', 4), ('pageview', 4), ('pageview', 4), ('pageview', 4)] - Reward: None
Step: 1 - Action: 1 - Observation: None - Reward: 0
Step: 2 - Action: 2 - Observation: None - Reward: 0
Step: 3 - Action: 3 - Observation: [('pageview', 4), ('pageview', 9), ('pageview', 9)] - Reward: 0
Step: 4 - Action: 4 - Observation: [('pageview', 9)] - Reward: 0


You'll notice that the offline and online APIs are nearly identical.  The only difference is that one calls either env.step_offline() or env.step(action).

## Creating our first agent

Now that we see have seen how the offline and online versions of the environment work, it's time to code our first recommendation agent!  Technically, an agent can be anything that produces actions for the environment to to use.  However, we will show you the object-oriented way we like to create agents.

Below is the code for a very simply agent - the popularity based agent. The popularity based agent simply records how many times a user sees each product organically, then when required to make a recommendation, the agent choses a product randomly in proportion with number of times the user has viewed it.

In [3]:
import numpy as np
from numpy.random import choice

# define agent class
class PopularityAgent:
    def __init__(self, num_products):
        # set number of products as an attribute of agent
        self.num_products = num_products
        
        # track number of times each item viewed in Organic session
        self.organic_views = np.zeros(self.num_products)
        
    def train(self, observation, action, reward, done):
        """train method learns from a tuple of data.
            this method can be called for offline or online learning"""
        
        # adding organic session to organic view counts
        if observation is not None:
            for product in observation.get_views():
                self.organic_views[product] += 1
    
    def act(self, observation, reward, done):
        """act method returns an action based on current observation and past
            history"""
        
        # choosing action randomly in proportion with number of views
        prob = self.organic_views / sum(self.organic_views)
        action = choice(self.num_products, p = prob)
        
        return action

The `PopularityAgent` class above demonstrates our preferred way to create agents for reco-gym. Notice how we have both a `train` and `act` method present. The `train` method is designed to take in training data from the environments `step_offline` method and thus has nothing to return, whilst the `act` method must return an action to pass back into the environment. 

The code below highlights how one would use this agent for first offline training and then using the learned knowledge to make recommendations online. 

In [4]:
# instantiate instance of PopularityAgent class
num_products = 10
agent = PopularityAgent(num_products)

# resets random seed back to 42, or whatever we set it to in env_0_args
env.reset_random_seed()

# train on 1000 users offline
num_offline_users = 1000

for _ in range(num_offline_users):
    
    #reset env and set done to False
    env.reset()
    done = False
    
    while not done:
        old_observation = observation
        action, observation, reward, done, info = env.step_offline()
        agent.train(old_observation, action, reward, done)

# train on 100 users online and track click through rate
num_online_users = 100
num_clicks, num_events = 0, 0

for _ in range(num_online_users):
    
    #reset env and set done to False
    env.reset()
    observation, _, done, _ = env.step(None)
    reward = None
    done = None
    while not done:
        action = agent.act(observation, reward, done)
        observation, reward, done, info = env.step(action)
        
        # used for calculating click through rate
        num_clicks += 1 if reward == 1 and reward is not None else 0
        num_events += 1

ctr = num_clicks / num_events


print(f"Click Through Rate: {ctr:.4f}")

Click Through Rate: 0.0263


## Testing our first agent

Now we have created our popularity based agent, we should test it against an even simpler baseline - one that performs no learning and recommends products uniformly at random. To do this, we will first load a more complex version of the toy data environment called `reco-gym-v1`.

Next we will load another agent for our agent to compete against. Here you can see we make use of the `RandomAgent` and create an instance of it in addition to our `PopularityAgent`.

In [5]:
import gym, reco_gym
from reco_gym import env_1_args

from copy import deepcopy

env_1_args['random_seed'] = 42

env = gym.make('reco-gym-v1')
env.init_gym(env_1_args)

# Import the random agent
from agents import RandomAgent, random_args

# Create the two agents
num_products = env_1_args['num_products']
popularity_agent = PopularityAgent(num_products)
agent_rand = RandomAgent(random_args)

Now we have instances of our two agents, we can use the `test_agent` method from reco-gym and compare there performance.

To use `test_agent`, one must provide a copy of the current env, a copy of the agent class, the number of training users and the number of testing users. 

In [6]:
# credible interval of the ctr median and 0.025 0.975 quantile
reco_gym.test_agent(deepcopy(env), deepcopy(agent_rand), 1000, 1000) 

Starting Agent Training
Starting Agent Testing


(0.018444507507702788, 0.017500405252641585, 0.019421321092238708)

In [7]:
# credible interval of the ctr median and 0.025 0.975 quantile
reco_gym.test_agent(deepcopy(env), deepcopy(popularity_agent), 1000, 1000) 

Starting Agent Training
Starting Agent Testing


(0.026837156159392188, 0.025703859122250695, 0.028002355026224057)

We see an improvement in the click through rate for the popularity agent.